Background
The development and distribution of Covid-19 vaccines in response to the Covid-19 pandemic demonstrated the world’s ability to act quickly and face threats to public health. The data that the ongoing vaccination has been generating can provide an interesting insight into the effect of governments’ policies concerning vaccine distribution and immunisation eligibility, the willingness of individuals within certain areas to receive immunisation, etc.
In the Czech Republic, Covid-19 data are freely available through the government’s Open Data service [1], which (in collaboration with the National Dashboard of Active Diseases [2]) provides well-made and informative visualisations; these are mostly interactive maps with user-adaptable controls. Currently, the official sources do not provide any information about, or visualisations of, the ratio of fully immunised residents (i.e., those who received 2 doses of an approved Covid-19 vaccine) to the total region population.
Objective
- To create a map of the Czech Republic showing the ratio of fully immunised to total population per region.
Visualisation Questions
-
Is the the ratio of fully immunised to total population per
region homogeneous across the whole country?
- If heterogeneity is observed, what is (roughly) the difference between the regions with the highest and the lowest ratios?
Source
-
All used data sets were created in a non-English-speaking environment and contained special characters, therefore UTF-8 encoding is used for reading and saving the data sets.
-
Reading and saving UTF-8 data sets changes the name of the first column in the data set. There did not seem to be a workaround and therefore changed name of the first column is therefore reflected in the code and data set previews.
Immunisation data
-
Czech Covid-19 statistics are published and updated daily by the Ministry of Health [3] on its National Dashboard of Active Diseases. The dashboard provides statistics, visualisations, and access to a number of more or less aggregated data files. Data can be accessed through API as well as downloaded directly (.csv).
-
Data collection, management, and publishing is managed by the Institute of Health Information and Statistics of the Czech Republic [4] through the National Health Information System (NHIS) and its subsystems, which pool data shared by individuals and institutions managing case tracking and healthcare (i.e., healthcare providers and regional public health offices).
-
The vaccine uptake data set is updated daily (except for weekends) and available from the dashboard:
https://onemocneni-aktualne.mzcr.cz/api/v2/covid-19/ockovani-orp.csv
Population data
- Summary population statistics are published by the Czech Statistical Office on its website [5] and updated annually. They are compiled based on observed births, deaths, and within-country mobility. Additionally, the data is checked when the outcomes of the most recent census become available.
- The data set is available on the website, however it is formatted for user-friendly viewing rather than data processing and was therefore pre-processed and saved in the project data folder.
-
The original data set is accessible through the following
catalogue:
https://www.czso.cz/csu/czso/population-of-municipalities-1-january-2021
or directly through the following download link:
https://www.czso.cz/documents/10180/142756350/1300722101.xlsx
Variable description
The Codebook provides information about data sets used in the Data Management and Visualisation project. Please note that where variables were dropped and not used further, their Czech name is referenced. Variables that have been manipulated (and therefore translated) are labelled in English.
Immunisation data
- The data set is in the long format (i.e., several rows per observation to display unique combinations of all recorded variables).
Used variables
-
region_name: The official names of the administrative regions in Czech (see Notes).- Uncategorised entries were dropped.
-
region_id: The statistical codes of the administrative regions within the European NUTS hierarchy (see Notes).- Uncategorised entries were dropped.
-
finished_vax: The number of individuals who completed their Covid-19 immunisation.
Unused variables
-
id: A pre-assigned ID, possibly for internal identification purposes. -
date: The date when individual observations were reported. -
orp_bydliste_nazev: The official names of the administrative districts in Czech (see Notes).- Contains both labelled and unlabelled uncategorised data.
-
orp_bydliste_kod: The statistical codes of the administrative districts within the Statistical Meta-information System (see Notes). Contains unlabelled uncategorised data.
-
vekova_kategorie: The age group immunisation recipients fall within. -
ockovaci_latka_nazev: The name of the European Medicines Agency-approved vaccine that was administered. -
ockovaci_latka_kod: A code assigned to the approved vaccines, possibly for internal use of the Ministry. -
prvni_davka: The number of individuals falling within observed categories who received the first dose of a Covid-19 vaccine. -
druha_davka: A number of individuals falling within observed categories who received the second dose of a Covid-19 vaccine. -
posilujici_davka: A number of individuals falling within observed categories who received the booster dose of a Covid-19 vaccine.
Population data
- The data set is not in a machine readable format. It is separated into several groups for which the data are provided: the Czech Republic, Administrative regions (used in the visualisation), Administrative districts.
Used variables
-
pop_total: The number of permanent residents per region -
region_id: The statistical codes of the administrative regions within the European NUTS hierarchy (see Notes).
Unused variables
-
Population - Males: The number of male residents per category. -
Population - Females: The number of female residents per category. -
Average age - Total: Total average age per category. -
Average age - Males: Average male age per category. -
Average age - Females: Average female age per category.
Computed variables
-
vax_per_pop: The proportion of individuals with completed Covid-19 immunisation per total region population; computed by dividingfinished_vax(Immunisation data) bypop_total(Population data).
Notes
- Administrative districts of municipalities with extended competence: the smallest administrative districts in the Czech Republic, centered around 205 municipalities with extended competence [6].
- Administrative regions: A total of 14 administrative regions of the Czech republic with develoved powers, roughly equivalent to UK counties or groups of unitary authorities [7].
- NUTS: Eurostat’s Nomenclature of Territorial Units for Statistics used for comparison of devolved administration units across the EU as well as funding purposes [8].
Preview: Immunisation data
Preview (5 rows, 7 colums - 5 dropped for ease of visualisation) without displaying IDs:
| datum | kraj_bydliste_nazev | kraj_bydliste_kod | orp_bydliste_nazev | orp_bydliste_kod | vekova_kategorie | ockovaci_latka_nazev |
|---|---|---|---|---|---|---|
| 2021-01-01 | Hlavní mesto Praha | CZ010 | nezarazení | 0 | 45-49 | Comirnaty |
| 2021-01-01 | Hlavní mesto Praha | CZ010 | nezarazení | 0 | 25-29 | Comirnaty |
| 2021-01-01 | Hlavní mesto Praha | CZ010 | nezarazení | 0 | 55-59 | Comirnaty |
| 2021-01-01 | Stredoceský kraj | CZ020 | Brandýs nad Labem-Stará Boleslav | 2003 | 75-79 | Comirnaty |
| 2021-01-01 | Stredoceský kraj | CZ020 | Horovice | 2008 | 25-29 | Comirnaty |
Preview: Total region population data
Pre-edited data set preview (3 rows):
| X.U.FEFF.region_id | pop_total |
|---|---|
| CZ010 | 1259413 |
| CZ020 | 1372588 |
| CZ031 | 636422 |
Walkthrough
The project starts with loading required packages and raw data - encoding is set to UTF-8.
# load packages
if(!require(here)) install.packages("here",
repos = "http://cran.us.r-project.org")
if(!require(tidyverse)) install.packages("tidyverse",
repos = "http://cran.us.r-project.org")
if(!require(write)) install.packages("write",
repos = "http://cran.us.r-project.org")
# import data
url_vaccines <- "https://onemocneni-aktualne.mzcr.cz/api/v2/covid-19/ockovani-orp.csv"
df <- read.csv(url_vaccines,
encoding = "UTF-8")
pop_df <- read.csv(here("data_raw", "2021-01-01_region_population.csv"),
encoding = "UTF-8")
In cleaning the data, the first step is to replace NA values with an integer (0) for subsequent data aggregation and manipulation. Redundant columns are dropped and relevant columns are renamed in English:
# NA → 0 for subsequent computations
df[is.na(df)] <- 0
# drop redundant columns
df <- df %>% select(-c(
X.U.FEFF.id,
ockovaci_latka_kod,
orp_bydliste_nazev,
orp_bydliste_kod,
vekova_kategorie,
ockovaci_latka_nazev,
prvni_davka,
druha_davka,
posilujici_davka
)) %>%
# translate and simplify column names
rename(
date = datum,
region_name = kraj_bydliste_nazev,
region_id = kraj_bydliste_kod,
finished_vax = dokoncene_ockovani)
The observations are then summed to produce the total of completed immunisations per region:
# sum daily figures into totals
df <- df %>%
group_by(region_name, region_id) %>%
summarise(finished_vax = sum(finished_vax))
The immunisation data set is then merged with the population (total region resident population per region). Statistical region codes are used for the merger:
# merge with population data
df <- merge(df, pop_df, by.x = "region_id", by.y = "X.U.FEFF.region_id") %>%
mutate(vax_per_pop = finished_vax / pop_total)
Finally, the data set is saved for further manipulation while the UTF-8 encoding is preserved.
# save tidy data (in UTF-8)
readr::write_excel_csv(df, file = here("data_tidy", "cz_covax_regions_tidy.csv"))
Preview: Tidy data
Preview (5 rows)
| X.U.FEFF.region_id | region_name | finished_vax | pop_total | vax_per_pop |
|---|---|---|---|---|
| CZ010 | Hlavní mesto Praha | 882433 | 1259413 | 0.7006701 |
| CZ020 | Stredoceský kraj | 928939 | 1372588 | 0.6767792 |
| CZ031 | Jihoceský kraj | 414295 | 636422 | 0.6509753 |
| CZ032 | Plzenský kraj | 374386 | 576358 | 0.6495720 |
| CZ041 | Karlovarský kraj | 180331 | 285020 | 0.6326960 |
For visualisation, the required packages are loaded as well as
the data. New simple features (sf) objects which provide the
geo-spatial coordinates for visualising the regions
(kraje()) on the map of the Czech Republic (as
well as the map itself, republika()) is supplied
by the RCzechia package. The sf object was not
introduced in the data preparation stages as combining sf
objects with data frames is difficult and the resulting files
are large. Redundant information is removed from the sf
object:
# load packages
if(!require(here)) install.packages("here",
repos = "http://cran.us.r-project.org")
if(!require(tidyverse)) install.packages("tidyverse",
repos = "http://cran.us.r-project.org")
if(!require(ggplot2)) install.packages("ggplot2",
repos = "http://cran.us.r-project.org")
if(!require(RCzechia)) install.packages("RCzechia",
repos = "http://cran.us.r-project.org")
if(!require(sf)) install.packages("sf",
repos = "http://cran.us.r-project.org")
# load data
df <- read.csv(here("data_tidy","cz_covax_regions_tidy.csv"),
encoding = "UTF-8")
# load sf objects
repub <- republika()
regions <- kraje(resolution = "high") %>%
select(-c(KOD_KRAJ, NAZ_CZNUTS3))
The tidy data set is then merged with the tidy sf object, and the new object is explicitly coerced into being an sf object (as not doing so has caused issues with the visualisation package recognising the sf data):
df <- merge(df, regions, by.x = "X.U.FEFF.region_id", by.y = "KOD_CZNUTS3")
# df → sf object for ggplot2
df <- st_as_sf(df)
The visualisation is initiated using the
ggplot2 package; the df is specified as well as
custom aesthetics. The visualisation is labelled and formatted
(see below for notes on aesthetics):
# specify df for ggplot
ggmap <- ggplot(data = df) +
# read regional geographic data
geom_sf(aes(fill = vax_per_pop), colour = "black", lwd = 0.3) +
# read geographic data of the country
geom_sf(data = repub, color = "black", lwd = 0.6, fill = NA) +
# set colour (suitable for visually impaired)
scale_fill_viridis_c(labels = scales::comma, direction = -1) +
# label plot elements
labs(
x = "Latitude",
y = "Longitude",
title = "Covid-19-immunised to total region population ratio in the Czech Republic",
subtitle = "Ratio of fully immunised population (i.e., having received 2 doses of an approved Covid-19 vaccine) to total population per region",
caption = "Source: IHIS CR & CZSO
Note: Cases of immunisation of foregin nationals and individuals of unknown region residence excluded",
fill = "Immunised-to-total region \nresidents ratio") +
# suitable theme (reading coords)
theme_bw() +
# sub/title alignment
theme(plot.title = element_text(hjust = 0),
plot.subtitle = element_text(hjust = 0))
Output
Finally, the visualisation is displayed:
# view
ggmap
Aesthetics
Colours: The colour scheme (viridis) was
selected for its distinctive colours which guarantee good
readability even if exported as black-and-white, and good
readability for those who are colourblind. The scheme was
inverted (direction = -1) so that the highest
ratios could logically be represented as the darkest areas of
the map.
Fonts & lines: Arial, a sans-serif font, is one of the recommended fonts as it ensures good readability for everyone, including individuals with learning difficulties. The border colour was set to black as it was the best at distinguishing among the regions out of all the colours used. The thickness of the lines was set through iterative trial-and-error changes to ensure the regions can be distinguished with ease and the country border is distinctive from the region borders.
Theme: The bw theme was selected
for its subtlety (does not distract from the map) and its grid
that allows to identify the regions based on latitude and
longitude coordinates provided.
Alignment: The layout of the alignment freely follows the APA-7 layout (except for the left footnote alignment, which felt more natural to the author). The title and the subtitle provide essential information for the interpretation while the footnote references the data and provides information about phenomena that need to be taken into account while reading and interpreting the map.
Objectives & Questions
The project successfully visualised the ratio of fully immunised to total population per region. The map clearly indicates that the ratios of completed immunisations to total region population are heterogeneous with the difference between the region with the highest and lowest immunisation ratio being roughly 10 %.
The highest ratio is in the capital; this might be because of the abundance of healthcare and vaccination facilities, attained education, socio-economic status, and other associated markers; those who move to the capital often come for better employment opportunities, have higher levels of education and are socially and economically stable - they believe in the system and the benefits of being vaccinated. Politically, residents of these regions often vote for progressive liberal-conservative parties.
The lowest ratio is by the Polish borders; these predominantly industrial regions are among the poorer regions of the country; the overall educational attainment as well as SES tend to be lower, as is the trust in the system. Politically, residents of these regions often vote for populist and/or extremist parties (typically the communist party).
Limitations & Suggestions
Dropping data of foreign nationals and untraceable residents: As a pre-existing package with geo-spatial data was used for the visualisation, observations not tied to any regions were dropped. An alternative to this solution would be to create polygons outside of the map that would represent unmatched cases or the immunisation of foreign nationals.
Interactivity, variable retention, & aggregation: In
order to produce a static visualisation, the data set variables
(which otherwise provide rich data) has been heavily aggregated.
This project could therefore be extended by using some of the
dropped/aggregated variables. For example, the dates could be
aggregated into weekly or monthly observations and the
gganimate package could be used to show changes in
the vaccination ratio in regions over time. Alternatively, using
the shiny package could allow viewers to subset data
in a desired way (e.g., by letting them sort by age groups,
vaccines administered, or using a slider to view the time trends
in vaccination).
References
-
Portál otevřených dat (Czech Open Data Portal) - https://data.gov.cz/
-
Onemocnění aktuálně ČR (National Dashboard of Active Diseases) - https://onemocneni-aktualne.mzcr.cz
-
Ministerstvo zdravotnictví ČR (Ministry of Health of the Czech Republic) - https://www.mzcr.cz
-
Ústav zdravotnických informací a statistiky ČR (Institute of Health Information and Statistics of the Czech Republic) - https://www.uzis.cz/index-en.php
-
Český statistický úřad (Czech Statistical Office) - https://www.czso.cz/csu/czso/home
-
Wikipedia: Districts of the Czech Republic: Municipalities with extended competence - https://en.wikipedia.org/wiki/Districts_of_the_Czech_Republic#Municipalities_with_extended_competence
-
Wikipedia: Regions of the Czech Republic - https://en.wikipedia.org/wiki/Regions_of_the_Czech_Republic
-
Eurostat: NUTS - https://ec.europa.eu/eurostat/web/nuts/background