Skip to content

Instantly share code, notes, and snippets.

@data-henrik
Last active November 25, 2022 01:36
Show Gist options
  • Save data-henrik/7e02c6a89b60dc35b45c23c246fced9c to your computer and use it in GitHub Desktop.
Save data-henrik/7e02c6a89b60dc35b45c23c246fced9c to your computer and use it in GitHub Desktop.
SQL statements to dig into COVID-19 data

Introduction

Let's assume a simple schema consisting of two tables.

  1. STATISTICS for COVID-19 daily statistics like confirmed cases, deaths, etc.
  2. DEMOGRAPHICS with addition per-country data like population, area, population density and more

STATISTICS

This table could have columns such as

  • country_id: Identifies the country by ISO code
  • dt: the date for the reported number
  • confirmed_cases: number of confirmed cases
  • deaths: number of confirmed deaths
  • ...

Data could come from a source like https://www.ecdc.europa.eu/en/geographical-distribution-2019-ncov-cases or https://www.who.int/emergencies/diseases/novel-coronavirus-2019

DEMOGRAPHICS

This table could have columns such as

  • country_id: identifies the country by ISO code
  • country_name: the full country name
  • population: total population
  • area: total area
  • density: population density
  • ...

There could be many more columns. See https://unstats.un.org/unsd/demographic-social/index.cshtml for a possible data source.

--- Clause to compute delta from previous day:
---
--- Use LAG to access data from previous day. PARTITION BY country to get the same country.
--- The sorting is by the (ascending) date.
--- The clause is embedded into a simple SELECT statement to show the context. The delta for deaths
--- is computed in a similar way.
SELECT
confirmed_cases - LAG(confirmed_cases,1) OVER (PARTITION BY country_id ORDER BY dt) AS confirmed_cases_delta,
confirmed_cases,
country_id,
dt
FROM statistics
--- Clause to compute smoothed rolling average:
---
--- Use AVG() OVER with PRECEDING / FOLLOWING to smooth out reported numbers. This is necessary to account for
--- "administrative weekends" with fewer reports or missing data which is added later on.
--- The clause is embedded into a simple SELECT statement to show the context. The values for deaths
--- can be computed in a similar way.
SELECT
AVG(confirmed_cases) OVER (PARTITION BY country_id ORDER BY dt ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS cases_smoothed5,
confirmed_cases,
country_id,
dt
FROM STATISTICS
--- Fetch latest available per country data from statistics
---
--- Not all countries may have reported on the latest overall date. Thus, if
--- using MAX(dt), those countries could be missed. Therefore, use ROW_NUMBER()
--- to enumerate available data in a DESCending way by date. Thereafter, pick the
--- first row number only.
WITH latest_deaths_data AS
( SELECT country_id,
dt,
deaths,
confirmed_cases,
ROW_NUMBER() OVER (PARTITION BY country_id ORDER BY dt DESC) as rn
FROM STATISTICS)
SELECT country_id,
dt,
deaths,
confirmed_cases,
rn
FROM latest_deaths_data
WHERE rn=1
--- Use the "latest data" from above and demographic information to compute the
--- mortality per 100,000 population. See here for sample results: https://coronavirus.jhu.edu/data/mortality
---
--- Mortality is computed similar to percentages as fraction from per-country deaths and population.
--- The resulting numbers are sorted DESCending and only the top 25 rows returned (LIMIT 25).
WITH latest_deaths_data AS
( SELECT country_id,
dt,
deaths,
ROW_NUMBER() OVER (PARTITION BY country_id ORDER BY dt DESC) as rn
FROM STATISTICS)
SELECT s.country_id,
d.country_name,
s.dt,
s.deaths,
d.population,
(s.deaths / d.population)*100000 as mortality_rate
FROM latest_deaths_data S, DEMOGRAPHICS D
WHERE rn=1
AND s.country_id=d.country_id
ORDER BY mortality_rate DESC
LIMIT 25
--- Use RANK() to find the biggest spike in new deaths per country.
--- Sort them by the most recent date, then by the count of new deaths.
---
--- RANK() is used per country on the daily number of deaths, hence PARTITION by country
--- and sorting on deaths_delta.
--- A minimum of 100 deaths total is used for meaningful input.
SELECT s.*, d.country_name
FROM
(SELECT country_id, dt, deaths, deaths_delta,
RANK() OVER (PARTITION BY country_id ORDER BY deaths_delta DESC) deaths_delta_rank
FROM (SELECT country_id, dt, deaths, deaths_delta
FROM statistics
WHERE dt>='2020-03-01' AND deaths>=100)) s JOIN demographics d ON s.country_id=d.country_id
WHERE deaths_delta_rank=1
ORDER BY dt DESC, deaths_delta DESC
LIMIT 30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment