vallka/e.md Secret

## e.md

      
    Raw
  

              e.md
            
          
This is the 2nd part of exploration of pandas package. Part 1 can be found here: https://www.vallka.com/blog/leaning-pandas-data-manipulation-and-visualization-covid-19-in-scotland-statistics/
All data are taken from the official websate:
weekly-deaths-by-location-age-sex.xlsx
Taken from official the site:
https://www.nrscotland.gov.uk/statistics-and-data/statistics/statistics-by-theme/vital-events/general-publications/weekly-and-monthly-data-on-births-and-deaths/deaths-involving-coronavirus-covid-19-in-scotland/related-statistics
Let's quickly repeat data load process from the Part 1:
import numpy as np
import pandas as pd

data_2021 = pd.read_excel ('https://www.nrscotland.gov.uk/files//statistics/covid19/weekly-deaths-by-location-age-sex.xlsx',
 sheet_name='Data',
        skiprows=4,
        skipfooter=2,
        usecols='A:F',
        header=None,
        names=['week','location','sex','age','cause','deaths']
        )

data_2021['year'] = data_2021.week.str.slice(0,2).astype(int)
data_2021['week'] = data_2021.week.str.slice(3,5).astype(int)

data_1519 = pd.read_excel ('https://www.nrscotland.gov.uk/files//statistics/covid19/weekly-deaths-by-location-age-group-sex-15-19.xlsx',
        sheet_name='Data',
        skiprows=4,
        skipfooter=2,
        usecols='A:F',
        header=None,
        names=['year','week','location','sex','age','deaths'])

data_1519['cause'] = 'Pre-COVID-19'

data=data_1519.copy()
data=data_1519.append(data_2021,ignore_index=True)
data

One note here. In part 1 I was using a simple assignment od DateFrame for making a copy. This is wrong. Simple assignment does not create a copy of a DataFrame, it just creates another reference to it, so all changes to 2nd DataFrame affects the original one. To make things even more interesting, pandas provides two versions of copy() function - copy(deep=False) and copy(deep=True) deep=True is default. Here is some discussion about all three.
It is not clear for me now what is the difference between simple assignment and shallow copy() (deep=False), nor my results confirm this discussion. But deep copy() (with default parameter deep=True) seems to be working as expected all the time.
Let's see the original data_1519 to ensure we didn't modify it accidently:
data_1519

Let's quickly create totals
totals = data.groupby(['year','week']).agg({'deaths': np.sum})
totals.loc[15,'deaths_15']=totals['deaths']
totals.loc[16,'deaths_16']=totals['deaths']
totals.loc[17,'deaths_17']=totals['deaths']
totals.loc[18,'deaths_18']=totals['deaths']
totals.loc[19,'deaths_19']=totals['deaths']
totals.loc[20,'deaths_20']=totals['deaths']
totals.loc[21,'deaths_21']=totals['deaths']
totals


Let's get rid of multi-index and trnasform it into additional columns - I'm still thinking this is the quickest way of plotting data in a single plot:
totals.reset_index(inplace=True)
totals

totals.plot(x='week',y=['deaths_15','deaths_16','deaths_17','deaths_18','deaths_19','deaths_20','deaths_21'],
    title='Total deaths by week',figsize=(12,8))


Again, let's look at current year only. Ok, a couple of weeks passed since I wrote Part 1. How many weeks in this year now? And more generic question, how many weeks in each year?
wpy = totals.groupby('year').agg({'week':np.max})
wpy['week']


totals[totals['week']<=wpy['week'][21]].plot(x='week',
    y=['deaths_15','deaths_16','deaths_17','deaths_18','deaths_19','deaths_20','deaths_21'],
    title='Total deaths by week',figsize=(12,8))


Let's move on and add cumulative values (or 'running totals'). Thanks to pandas, it is very easy - comparing to SQL. (There should be other ways to achieve the same result. I used the one which seemed the most simple for me on this stage of learning)
totals['cumdeaths_15']=totals.groupby('year').agg({'deaths_15':np.cumsum})
totals['cumdeaths_16']=totals.groupby('year').agg({'deaths_16':np.cumsum})
totals['cumdeaths_17']=totals.groupby('year').agg({'deaths_17':np.cumsum})
totals['cumdeaths_18']=totals.groupby('year').agg({'deaths_18':np.cumsum})
totals['cumdeaths_19']=totals.groupby('year').agg({'deaths_19':np.cumsum})
totals['cumdeaths_20']=totals.groupby('year').agg({'deaths_20':np.cumsum})
totals['cumdeaths_21']=totals.groupby('year').agg({'deaths_21':np.cumsum})

totals

totals.plot(x='week',
   y=['cumdeaths_15','cumdeaths_16','cumdeaths_17','cumdeaths_18','cumdeaths_19','cumdeaths_20','cumdeaths_21'],
	 title='Total deaths by week',figsize=(12,8))


For a year-wide values year 2020 is definitely the worst. Also on this graph we can clearly see the difference of lengths of different years (in weeks), so it is not quite correct to compare year 2020 with 53 weeks with year 2019 with only 52 weeks. Which year is worse on this graph - 2015 or 2018?
Again, let's take the beginning of year:
totals[totals['week']<=wpy['week'][21]].plot(x='week',
   y=['cumdeaths_15','cumdeaths_16','cumdeaths_17','cumdeaths_18','cumdeaths_19','cumdeaths_20','cumdeaths_21'],
   title='Total deaths by week',figsize=(12,8))


Ok, what's next? I think it would be interesting to look at year-length data back from the current date. That is, from March 2020 to March 2021 - and compare these data to the previous years. How to do this? Not obvious...