The CoVax Project

Background

The development and distribution of Covid-19 vaccines in response to the Covid-19 pandemic demonstrated the world’s ability to act quickly and face threats to public health. The data that the ongoing vaccination has been generating can provide an interesting insight into the effect of governments’ policies concerning vaccine distribution and immunisation eligibility, the willingness of individuals within certain areas to receive immunisation, etc.

In the Czech Republic, Covid-19 data are freely available through the government’s Open Data service [1], which (in collaboration with the National Dashboard of Active Diseases [2]) provides well-made and informative visualisations; these are mostly interactive maps with user-adaptable controls. Currently, the official sources do not provide any information about, or visualisations of, the ratio of fully immunised residents (i.e., those who received 2 doses of an approved Covid-19 vaccine) to the total region population.

Objective

To create a map of the Czech Republic showing the ratio of fully immunised to total population per region.

Visualisation Questions

Is the the ratio of fully immunised to total population per region homogeneous across the whole country?
- If heterogeneity is observed, what is (roughly) the difference between the regions with the highest and the lowest ratios?

Source

All used data sets were created in a non-English-speaking environment and contained special characters, therefore UTF-8 encoding is used for reading and saving the data sets.
Reading and saving UTF-8 data sets changes the name of the first column in the data set. There did not seem to be a workaround and therefore changed name of the first column is therefore reflected in the code and data set previews.

Immunisation data

Czech Covid-19 statistics are published and updated daily by the Ministry of Health [3] on its National Dashboard of Active Diseases. The dashboard provides statistics, visualisations, and access to a number of more or less aggregated data files. Data can be accessed through API as well as downloaded directly (.csv).
Data collection, management, and publishing is managed by the Institute of Health Information and Statistics of the Czech Republic [4] through the National Health Information System (NHIS) and its subsystems, which pool data shared by individuals and institutions managing case tracking and healthcare (i.e., healthcare providers and regional public health offices).
The vaccine uptake data set is updated daily (except for weekends) and available from the dashboard:
https://onemocneni-aktualne.mzcr.cz/api/v2/covid-19/ockovani-orp.csv

Population data

Summary population statistics are published by the Czech Statistical Office on its website [5] and updated annually. They are compiled based on observed births, deaths, and within-country mobility. Additionally, the data is checked when the outcomes of the most recent census become available.
The data set is available on the website, however it is formatted for user-friendly viewing rather than data processing and was therefore pre-processed and saved in the project data folder.
The original data set is accessible through the following catalogue:
https://www.czso.cz/csu/czso/population-of-municipalities-1-january-2021
or directly through the following download link:
https://www.czso.cz/documents/10180/142756350/1300722101.xlsx

Variable description

The Codebook provides information about data sets used in the Data Management and Visualisation project. Please note that where variables were dropped and not used further, their Czech name is referenced. Variables that have been manipulated (and therefore translated) are labelled in English.

Immunisation data

The data set is in the long format (i.e., several rows per observation to display unique combinations of all recorded variables).

Used variables

region_name: The official names of the administrative regions in Czech (see Notes).
- Uncategorised entries were dropped.
region_id: The statistical codes of the administrative regions within the European NUTS hierarchy (see Notes).
- Uncategorised entries were dropped.
finished_vax: The number of individuals who completed their Covid-19 immunisation.

Unused variables

id: A pre-assigned ID, possibly for internal identification purposes.
date: The date when individual observations were reported.
orp_bydliste_nazev: The official names of the administrative districts in Czech (see Notes).
- Contains both labelled and unlabelled uncategorised data.
orp_bydliste_kod: The statistical codes of the administrative districts within the Statistical Meta-information System (see Notes).
Contains unlabelled uncategorised data.
vekova_kategorie: The age group immunisation recipients fall within.
ockovaci_latka_nazev: The name of the European Medicines Agency-approved vaccine that was administered.
ockovaci_latka_kod: A code assigned to the approved vaccines, possibly for internal use of the Ministry.
prvni_davka: The number of individuals falling within observed categories who received the first dose of a Covid-19 vaccine.
druha_davka: A number of individuals falling within observed categories who received the second dose of a Covid-19 vaccine.
posilujici_davka: A number of individuals falling within observed categories who received the booster dose of a Covid-19 vaccine.

Population data

The data set is not in a machine readable format. It is separated into several groups for which the data are provided: the Czech Republic, Administrative regions (used in the visualisation), Administrative districts.

Used variables

pop_total: The number of permanent residents per region
region_id: The statistical codes of the administrative regions within the European NUTS hierarchy (see Notes).

Unused variables

Population - Males: The number of male residents per category.
Population - Females: The number of female residents per category.
Average age - Total: Total average age per category.
Average age - Males: Average male age per category.
Average age - Females: Average female age per category.

Computed variables

vax_per_pop: The proportion of individuals with completed Covid-19 immunisation per total region population; computed by dividing finished_vax (Immunisation data) by pop_total (Population data).

Notes

Administrative districts of municipalities with extended competence: the smallest administrative districts in the Czech Republic, centered around 205 municipalities with extended competence [6].
Administrative regions: A total of 14 administrative regions of the Czech republic with develoved powers, roughly equivalent to UK counties or groups of unitary authorities [7].
NUTS: Eurostat’s Nomenclature of Territorial Units for Statistics used for comparison of devolved administration units across the EU as well as funding purposes [8].

Preview: Immunisation data

Preview (5 rows, 7 colums - 5 dropped for ease of visualisation) without displaying IDs:

datum	kraj_bydliste_nazev	kraj_bydliste_kod	orp_bydliste_nazev	orp_bydliste_kod	vekova_kategorie	ockovaci_latka_nazev
2021-01-01	Hlavní mesto Praha	CZ010	nezarazení	0	45-49	Comirnaty
2021-01-01	Hlavní mesto Praha	CZ010	nezarazení	0	25-29	Comirnaty
2021-01-01	Hlavní mesto Praha	CZ010	nezarazení	0	55-59	Comirnaty
2021-01-01	Stredoceský kraj	CZ020	Brandýs nad Labem-Stará Boleslav	2003	75-79	Comirnaty
2021-01-01	Stredoceský kraj	CZ020	Horovice	2008	25-29	Comirnaty

Preview: Total region population data

Pre-edited data set preview (3 rows):

X.U.FEFF.region_id	pop_total
CZ010	1259413
CZ020	1372588
CZ031	636422

Walkthrough

The project starts with loading required packages and raw data - encoding is set to UTF-8.

# load packages
if(!require(here)) install.packages("here",
                                    repos = "http://cran.us.r-project.org")
if(!require(tidyverse)) install.packages("tidyverse",
                                         repos = "http://cran.us.r-project.org")
if(!require(write)) install.packages("write",
                                     repos = "http://cran.us.r-project.org")

# import data
url_vaccines <- "https://onemocneni-aktualne.mzcr.cz/api/v2/covid-19/ockovani-orp.csv"

df <- read.csv(url_vaccines,
               encoding = "UTF-8")

pop_df <- read.csv(here("data_raw", "2021-01-01_region_population.csv"),
                   encoding = "UTF-8")

In cleaning the data, the first step is to replace NA values with an integer (0) for subsequent data aggregation and manipulation. Redundant columns are dropped and relevant columns are renamed in English:

# NA → 0 for subsequent computations
df[is.na(df)] <- 0

# drop redundant columns
df <- df %>% select(-c(
  X.U.FEFF.id,
  ockovaci_latka_kod,
  orp_bydliste_nazev,
  orp_bydliste_kod,
  vekova_kategorie,
  ockovaci_latka_nazev,
  prvni_davka,
  druha_davka,
  posilujici_davka
  )) %>% 

# translate and simplify column names
  rename(
    date = datum,
    region_name = kraj_bydliste_nazev,
    region_id = kraj_bydliste_kod,
    finished_vax = dokoncene_ockovani)

The observations are then summed to produce the total of completed immunisations per region:

# sum daily figures into totals
df <- df %>% 
  group_by(region_name, region_id) %>% 
  summarise(finished_vax = sum(finished_vax))

The immunisation data set is then merged with the population (total region resident population per region). Statistical region codes are used for the merger:

# merge with population data
df <- merge(df, pop_df, by.x = "region_id", by.y = "X.U.FEFF.region_id") %>% 
  mutate(vax_per_pop = finished_vax / pop_total)

Finally, the data set is saved for further manipulation while the UTF-8 encoding is preserved.

# save tidy data (in UTF-8)
readr::write_excel_csv(df, file = here("data_tidy", "cz_covax_regions_tidy.csv"))

Preview: Tidy data

Preview (5 rows)

X.U.FEFF.region_id	region_name	finished_vax	pop_total	vax_per_pop
CZ010	Hlavní mesto Praha	882433	1259413	0.7006701
CZ020	Stredoceský kraj	928939	1372588	0.6767792
CZ031	Jihoceský kraj	414295	636422	0.6509753
CZ032	Plzenský kraj	374386	576358	0.6495720
CZ041	Karlovarský kraj	180331	285020	0.6326960

Visualisation Code
Output & Aesthetics

For visualisation, the required packages are loaded as well as the data. New simple features (sf) objects which provide the geo-spatial coordinates for visualising the regions (kraje()) on the map of the Czech Republic (as well as the map itself, republika()) is supplied by the RCzechia package. The sf object was not introduced in the data preparation stages as combining sf objects with data frames is difficult and the resulting files are large. Redundant information is removed from the sf object:

# load packages 
if(!require(here)) install.packages("here",
                                    repos = "http://cran.us.r-project.org")
if(!require(tidyverse)) install.packages("tidyverse",
                                         repos = "http://cran.us.r-project.org")
if(!require(ggplot2)) install.packages("ggplot2",
                                       repos = "http://cran.us.r-project.org")
if(!require(RCzechia)) install.packages("RCzechia",
                                        repos = "http://cran.us.r-project.org")
if(!require(sf)) install.packages("sf",
                                  repos = "http://cran.us.r-project.org")

# load data
df <- read.csv(here("data_tidy","cz_covax_regions_tidy.csv"),
               encoding = "UTF-8")

# load sf objects
repub <- republika()

regions <- kraje(resolution = "high") %>% 
  select(-c(KOD_KRAJ, NAZ_CZNUTS3))

The tidy data set is then merged with the tidy sf object, and the new object is explicitly coerced into being an sf object (as not doing so has caused issues with the visualisation package recognising the sf data):

df <- merge(df, regions, by.x = "X.U.FEFF.region_id", by.y = "KOD_CZNUTS3")

# df → sf object for ggplot2
df <- st_as_sf(df)

The visualisation is initiated using the ggplot2 package; the df is specified as well as custom aesthetics. The visualisation is labelled and formatted (see below for notes on aesthetics):

# specify df for ggplot
ggmap <- ggplot(data = df) +
  
  # read regional geographic data
  geom_sf(aes(fill = vax_per_pop), colour = "black", lwd = 0.3) +
  
  # read geographic data of the country
  geom_sf(data = repub, color = "black", lwd = 0.6, fill = NA) +
  
  # set colour (suitable for visually impaired)
  scale_fill_viridis_c(labels = scales::comma, direction = -1) +
  
  # label plot elements
  labs(
    x = "Latitude",
    y = "Longitude",
    title = "Covid-19-immunised to total region population ratio in the Czech Republic",
       subtitle = "Ratio of fully immunised population (i.e., having received 2 doses of an approved Covid-19 vaccine) to total population per region",
       caption = "Source: IHIS CR & CZSO
    Note: Cases of immunisation of foregin nationals and individuals of unknown region residence excluded",
       fill = "Immunised-to-total region \nresidents ratio") +
  
  # suitable theme (reading coords)
  theme_bw() +
  
  # sub/title alignment
  theme(plot.title = element_text(hjust = 0),
        plot.subtitle = element_text(hjust = 0))

Output

Finally, the visualisation is displayed:

# view
ggmap

Map of Covid-19 vaccination uptake in the Czech Republic

Aesthetics

Colours: The colour scheme (viridis) was selected for its distinctive colours which guarantee good readability even if exported as black-and-white, and good readability for those who are colourblind. The scheme was inverted (direction = -1) so that the highest ratios could logically be represented as the darkest areas of the map.

Fonts & lines: Arial, a sans-serif font, is one of the recommended fonts as it ensures good readability for everyone, including individuals with learning difficulties. The border colour was set to black as it was the best at distinguishing among the regions out of all the colours used. The thickness of the lines was set through iterative trial-and-error changes to ensure the regions can be distinguished with ease and the country border is distinctive from the region borders.

Theme: The bw theme was selected for its subtlety (does not distract from the map) and its grid that allows to identify the regions based on latitude and longitude coordinates provided.

Alignment: The layout of the alignment freely follows the APA-7 layout (except for the left footnote alignment, which felt more natural to the author). The title and the subtitle provide essential information for the interpretation while the footnote references the data and provides information about phenomena that need to be taken into account while reading and interpreting the map.

Objectives & Questions

The project successfully visualised the ratio of fully immunised to total population per region. The map clearly indicates that the ratios of completed immunisations to total region population are heterogeneous with the difference between the region with the highest and lowest immunisation ratio being roughly 10 %.

The highest ratio is in the capital; this might be because of the abundance of healthcare and vaccination facilities, attained education, socio-economic status, and other associated markers; those who move to the capital often come for better employment opportunities, have higher levels of education and are socially and economically stable - they believe in the system and the benefits of being vaccinated. Politically, residents of these regions often vote for progressive liberal-conservative parties.

The lowest ratio is by the Polish borders; these predominantly industrial regions are among the poorer regions of the country; the overall educational attainment as well as SES tend to be lower, as is the trust in the system. Politically, residents of these regions often vote for populist and/or extremist parties (typically the communist party).

Limitations & Suggestions

Dropping data of foreign nationals and untraceable residents: As a pre-existing package with geo-spatial data was used for the visualisation, observations not tied to any regions were dropped. An alternative to this solution would be to create polygons outside of the map that would represent unmatched cases or the immunisation of foreign nationals.

Interactivity, variable retention, & aggregation: In order to produce a static visualisation, the data set variables (which otherwise provide rich data) has been heavily aggregated. This project could therefore be extended by using some of the dropped/aggregated variables. For example, the dates could be aggregated into weekly or monthly observations and the gganimate package could be used to show changes in the vaccination ratio in regions over time. Alternatively, using the shiny package could allow viewers to subset data in a desired way (e.g., by letting them sort by age groups, vaccines administered, or using a slider to view the time trends in vaccination).

References

Portál otevřených dat (Czech Open Data Portal) - https://data.gov.cz/
Onemocnění aktuálně ČR (National Dashboard of Active Diseases) - https://onemocneni-aktualne.mzcr.cz
Ministerstvo zdravotnictví ČR (Ministry of Health of the Czech Republic) - https://www.mzcr.cz
Ústav zdravotnických informací a statistiky ČR (Institute of Health Information and Statistics of the Czech Republic) - https://www.uzis.cz/index-en.php
Český statistický úřad (Czech Statistical Office) - https://www.czso.cz/csu/czso/home
Wikipedia: Districts of the Czech Republic: Municipalities with extended competence - https://en.wikipedia.org/wiki/Districts_of_the_Czech_Republic#Municipalities_with_extended_competence
Wikipedia: Regions of the Czech Republic - https://en.wikipedia.org/wiki/Regions_of_the_Czech_Republic
Eurostat: NUTS - https://ec.europa.eu/eurostat/web/nuts/background