data-henrik/0intro.md

## 0intro.md

      
    Raw
  

              0intro.md
            
          
    Introduction

Let's assume a simple schema consisting of two tables.

STATISTICS for COVID-19 daily statistics like confirmed cases, deaths, etc.
DEMOGRAPHICS with addition per-country data like population, area, population density and more

STATISTICS

This table could have columns such as

country_id: Identifies the country by ISO code
dt: the date for the reported number
confirmed_cases: number of confirmed cases
deaths: number of confirmed deaths
...

Data could come from a source like https://www.ecdc.europa.eu/en/geographical-distribution-2019-ncov-cases or https://www.who.int/emergencies/diseases/novel-coronavirus-2019
DEMOGRAPHICS

This table could have columns such as

country_id: identifies the country by ISO code
country_name: the full country name
population: total population
area: total area
density: population density
...

There could be many more columns. See https://unstats.un.org/unsd/demographic-social/index.cshtml for a possible data source.

  
## 10daily-delta.sql
--- Clause to compute delta from previous day:
---
--- Use LAG to access data from previous day. PARTITION BY country to get the same country.
--- The sorting is by the (ascending) date.
--- The clause is embedded into a simple SELECT statement to show the context. The delta for deaths
--- is computed in a similar way.

SELECT
   confirmed_cases - LAG(confirmed_cases,1) OVER (PARTITION BY country_id ORDER BY dt) AS confirmed_cases_delta,
   confirmed_cases,
   country_id,
   dt
FROM statistics

## 20smoothed-avg.sql
--- Clause to compute smoothed rolling average:
---
--- Use AVG() OVER with PRECEDING / FOLLOWING to smooth out reported numbers. This is necessary to account for
--- "administrative weekends" with fewer reports or missing data which is added later on.
--- The clause is embedded into a simple SELECT statement to show the context. The values for deaths
--- can be computed in a similar way.


SELECT
   AVG(confirmed_cases) OVER (PARTITION BY country_id ORDER BY dt ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS cases_smoothed5,
   confirmed_cases,
   country_id,
   dt
FROM STATISTICS

## 30latest-data.sql
--- Fetch latest available per country data from statistics
---
--- Not all countries may have reported on the latest overall date. Thus, if
--- using MAX(dt), those countries could be missed. Therefore, use ROW_NUMBER()
--- to enumerate available data in a DESCending way by date. Thereafter, pick the
--- first row number only.

WITH latest_deaths_data AS
   ( SELECT country_id,
            dt,
            deaths,
            confirmed_cases,
            ROW_NUMBER() OVER (PARTITION BY country_id ORDER BY dt DESC) as rn
     FROM STATISTICS)
SELECT country_id,
       dt,
       deaths,
       confirmed_cases,
       rn
FROM latest_deaths_data
WHERE rn=1

## 40mortality.sql
--- Use the "latest data" from above and demographic information to compute the
--- mortality per 100,000 population. See here for sample results: https://coronavirus.jhu.edu/data/mortality
---
--- Mortality is computed similar to percentages as fraction from per-country deaths and population.
--- The resulting numbers are sorted DESCending and only the top 25 rows returned (LIMIT 25).

WITH latest_deaths_data AS
   ( SELECT country_id,
            dt,
            deaths,
            ROW_NUMBER() OVER (PARTITION BY country_id ORDER BY dt DESC) as rn
     FROM STATISTICS)
SELECT s.country_id,
       d.country_name,
       s.dt,
       s.deaths,
       d.population,
       (s.deaths / d.population)*100000 as mortality_rate
FROM latest_deaths_data S, DEMOGRAPHICS D
WHERE rn=1
AND s.country_id=d.country_id
ORDER BY mortality_rate DESC
LIMIT 25

## 50ranked.sql
--- Use RANK() to find the biggest spike in new deaths per country.
--- Sort them by the most recent date, then by the count of new deaths.
---
--- RANK() is used per country on the daily number of deaths, hence PARTITION by country
--- and sorting on deaths_delta.
--- A minimum of 100 deaths total is used for meaningful input.

SELECT s.*, d.country_name
FROM
    (SELECT country_id, dt, deaths, deaths_delta,
            RANK() OVER (PARTITION BY country_id ORDER BY deaths_delta DESC) deaths_delta_rank
     FROM (SELECT country_id, dt, deaths, deaths_delta
           FROM statistics
           WHERE dt>='2020-03-01' AND deaths>=100)) s JOIN demographics d ON s.country_id=d.country_id
WHERE deaths_delta_rank=1
ORDER BY dt DESC, deaths_delta DESC
LIMIT 30
	--- Clause to compute delta from previous day:
	---
	--- Use LAG to access data from previous day. PARTITION BY country to get the same country.
	--- The sorting is by the (ascending) date.
	--- The clause is embedded into a simple SELECT statement to show the context. The delta for deaths
	--- is computed in a similar way.

	SELECT
	confirmed_cases - LAG(confirmed_cases,1) OVER (PARTITION BY country_id ORDER BY dt) AS confirmed_cases_delta,
	confirmed_cases,
	country_id,
	dt
	FROM statistics
	--- Clause to compute smoothed rolling average:
	---
	--- Use AVG() OVER with PRECEDING / FOLLOWING to smooth out reported numbers. This is necessary to account for
	--- "administrative weekends" with fewer reports or missing data which is added later on.
	--- The clause is embedded into a simple SELECT statement to show the context. The values for deaths
	--- can be computed in a similar way.


	SELECT
	AVG(confirmed_cases) OVER (PARTITION BY country_id ORDER BY dt ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS cases_smoothed5,
	confirmed_cases,
	country_id,
	dt
	FROM STATISTICS
	--- Fetch latest available per country data from statistics
	---
	--- Not all countries may have reported on the latest overall date. Thus, if
	--- using MAX(dt), those countries could be missed. Therefore, use ROW_NUMBER()
	--- to enumerate available data in a DESCending way by date. Thereafter, pick the
	--- first row number only.

	WITH latest_deaths_data AS
	( SELECT country_id,
	dt,
	deaths,
	confirmed_cases,
	ROW_NUMBER() OVER (PARTITION BY country_id ORDER BY dt DESC) as rn
	FROM STATISTICS)
	SELECT country_id,
	dt,
	deaths,
	confirmed_cases,
	rn
	FROM latest_deaths_data
	WHERE rn=1
	--- Use the "latest data" from above and demographic information to compute the
	--- mortality per 100,000 population. See here for sample results: https://coronavirus.jhu.edu/data/mortality
	---
	--- Mortality is computed similar to percentages as fraction from per-country deaths and population.
	--- The resulting numbers are sorted DESCending and only the top 25 rows returned (LIMIT 25).

	WITH latest_deaths_data AS
	( SELECT country_id,
	dt,
	deaths,
	ROW_NUMBER() OVER (PARTITION BY country_id ORDER BY dt DESC) as rn
	FROM STATISTICS)
	SELECT s.country_id,
	d.country_name,
	s.dt,
	s.deaths,
	d.population,
	(s.deaths / d.population)*100000 as mortality_rate
	FROM latest_deaths_data S, DEMOGRAPHICS D
	WHERE rn=1
	AND s.country_id=d.country_id
	ORDER BY mortality_rate DESC
	LIMIT 25
	--- Use RANK() to find the biggest spike in new deaths per country.
	--- Sort them by the most recent date, then by the count of new deaths.
	---
	--- RANK() is used per country on the daily number of deaths, hence PARTITION by country
	--- and sorting on deaths_delta.
	--- A minimum of 100 deaths total is used for meaningful input.

	SELECT s.*, d.country_name
	FROM
	(SELECT country_id, dt, deaths, deaths_delta,
	RANK() OVER (PARTITION BY country_id ORDER BY deaths_delta DESC) deaths_delta_rank
	FROM (SELECT country_id, dt, deaths, deaths_delta
	FROM statistics
	WHERE dt>='2020-03-01' AND deaths>=100)) s JOIN demographics d ON s.country_id=d.country_id
	WHERE deaths_delta_rank=1
	ORDER BY dt DESC, deaths_delta DESC
	LIMIT 30