Skip to content

Instantly share code, notes, and snippets.

@vallka
Created March 14, 2021 19:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save vallka/ec6ac989aef1f1e26c5f282e84040984 to your computer and use it in GitHub Desktop.
Save vallka/ec6ac989aef1f1e26c5f282e84040984 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "covid-in-scotland-part1.ipynb",
"provenance": [],
"authorship_tag": "ABX9TyPutYPmv6l4NUYC3DWedHfu",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3.8.5 64-bit ('covid': pipenv)",
"metadata": {
"interpreter": {
"hash": "40f24880dc83019c30db16cc638ca1ae07502c8c6b3d9394a009f1d54cd49bf9"
}
}
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/github/vallka/covid/blob/main/covid_in_scotland_part1.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "q5UUxmV0QkEE"
},
"source": [
"Let's do a little bit of pandas. Pandas is (or are? :) extremely popular. Let's just dive in.\n",
"\n",
"First of all, using pandas is the easiest way to open an Excel file. Just one line of code. Let's take, for example, this file - \n",
"\n",
"Weekly deaths by location of death, age group, sex and cause, 2020 and 2021\n",
"(10 March 2021)\n",
"\n",
"weekly-deaths-by-location-age-sex.xlsx\n",
"\n",
"Taken from official the site: \n",
"\n",
"https://www.nrscotland.gov.uk/statistics-and-data/statistics/statistics-by-theme/vital-events/general-publications/weekly-and-monthly-data-on-births-and-deaths/deaths-involving-coronavirus-covid-19-in-scotland/related-statistics\n"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 419
},
"id": "V_plFfPQTk0s",
"outputId": "f3a2a034-9afd-4647-e6c7-5da4b0c933bf"
},
"source": [
"import pandas as pd\n",
"\n",
"data = pd.read_excel ('https://www.nrscotland.gov.uk/files//statistics/covid19/weekly-deaths-by-location-age-sex.xlsx',\n",
" sheet_name='Data',\n",
" skiprows=4,\n",
" skipfooter=2,\n",
" usecols='A:F',\n",
" header=None,\n",
" names=['week','location','sex','age','cause','deaths']\n",
" )\n",
"\n",
"data"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "TffwBmMBUUTi"
},
"source": [
"First of all, we can read the file directly from the Internet. The few parameters we used are self-explaining: we need to skip the first four rows, two last rows also of no interest. We also need to specify column headers, as in the original file, they somehow occupy both third and fourth rows and pandas is not able to detect them correctly.\n",
"There are a lot of data, we'll try to use most of them later. But the first task will be to get simple totals by week.\n",
"In SQL this would be achieved by using group by:\n",
"\n",
" select sum(deaths) from data group by week\n",
"\n",
"What is the equivalent in pandas?"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 450
},
"id": "WRH_-WVtT5y0",
"outputId": "aa9fb381-422c-4d7d-f50e-d8cc0d16048f"
},
"source": [
"import numpy as np\r\n",
"data.groupby('week').agg({'deaths': np.sum})"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "yg6w4sR7XsJg"
},
"source": [
"*group by* became **groupby()** method. A bit messy with **sum**: firstly, we had to import numpy as np, secondly, we had to use an additional function - **agg()**. The good news, with **agg()** we can use much more statistical function than any flavour of SQL would allow. np.median is probably the most noticeable example - there is no simple way to get median with SQL. And with **agg()** we can use our own functions as well.\n",
"\n",
"Let's plot a graph immediately! "
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 296
},
"id": "BlAAmGFqZ4P-",
"outputId": "abca7df0-223b-48c0-9b38-0ea7421201f2"
},
"source": [
"data.groupby('week').agg({'deaths': np.sum}).plot()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "guPPR-djb2op"
},
"source": [
"Let's make it a bit bigger and add a title:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "xZB6IGq-Z5l2",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 638
},
"outputId": "520e3942-6679-4bd2-91db-507b87d11e0a"
},
"source": [
"data.groupby('week').agg({'deaths': np.sum}).plot(figsize=(24,10),title='Total deaths by week')"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "XYeRj4mLcbHx"
},
"source": [
"figsize is measured in inches, according to documentation. 24 x 10 seems to be fine for my screen.\n",
"\n",
"Ok, we have a nice big chart with totals for the years 2020-2021. Let's add data for previous years, they just happened to be on the same web page:\n",
"\n",
"Weekly deaths by location of death, age group and sex, 2015 to 2019\n",
"\n",
"weekly-deaths-by-location-age-group-sex-15-19.xlsx\n",
"\n",
"Web page:\n",
"\n",
"https://www.nrscotland.gov.uk/statistics-and-data/statistics/statistics-by-theme/vital-events/general-publications/weekly-and-monthly-data-on-births-and-deaths/deaths-involving-coronavirus-covid-19-in-scotland/related-statistics \n"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 419
},
"id": "BAgef32Qe4Xs",
"outputId": "459310d4-4f39-4600-c3b4-a58d58595a9a"
},
"source": [
"data_1519 = pd.read_excel ('https://www.nrscotland.gov.uk/files//statistics/covid19/weekly-deaths-by-location-age-group-sex-15-19.xlsx',\r\n",
" sheet_name='Data',\r\n",
" skiprows=4,\r\n",
" skipfooter=2,\r\n",
" usecols='A:F',\r\n",
" header=None,\r\n",
" names=['year','week','location','sex','age','deaths'])\r\n",
"\r\n",
"data_1519"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "zCE9CeLVfROK"
},
"source": [
"The structure of the data is almost the same. There is no 'cause' column. Let's just add it:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "M8O7U000gHaG"
},
"source": [
"data_1519['cause'] = 'Pre-COVID-19'"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "afZ9SMr_gUSp"
},
"source": [
"Let's rearrange the colums to make the DataFrame closer to our first data:"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 419
},
"id": "l5-Y1J1pgQbU",
"outputId": "76b2237d-e968-4249-b8d1-7668b0b98ba5"
},
"source": [
"neworder =['year','week','location','sex','age','cause','deaths']\r\n",
"data_1519 = data_1519.reindex(columns=neworder)\r\n",
"data_1519"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "OUMGQbObgunF"
},
"source": [
"Now we need to do something with 'week' column of the first table. Let's split it into year and week to match the second table:"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 419
},
"id": "aQYsobZbhDeA",
"outputId": "b3831c80-8840-4c26-e3ab-a7b1dd17c386"
},
"source": [
"data_2021 = data # let's keep original DataFrame as is and work with a copy from now on\r\n",
"\r\n",
"data_2021['year'] = data_2021.week.str.slice(0,2).astype(int)\r\n",
"data_2021['week'] = data_2021.week.str.slice(3,5).astype(int)\r\n",
"\r\n",
"data_2021"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "KnOMlXgbiciu"
},
"source": [
"And rearrage columns to match"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 419
},
"id": "7CFkaB8oiFTt",
"outputId": "1ebeb8d5-8edf-4f9c-a0f1-f929ccca0b1e"
},
"source": [
"neworder =['year','week','location','sex','age','cause','deaths']\r\n",
"data_2021 = data_2021.reindex(columns=neworder)\r\n",
"data_2021"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "Q60yWrMRj4ve"
},
"source": [
"We have to update our groupby function to reflect this change:"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 312
},
"id": "JC8dyWEEjjA9",
"outputId": "a1a8fc8b-bc64-452b-9c70-aa2586f50fa4"
},
"source": [
"data_2021.groupby(['year','week']).agg({'deaths': np.sum}).plot(title='Total deaths by week')"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "YM0I1j5ykuiA"
},
"source": [
"Let's plot the 15/19 data the same way:"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 312
},
"id": "keCJT_3XkmsN",
"outputId": "c07d305e-1abc-44b5-b3ac-9f71a0fe4031"
},
"source": [
"data_1519.groupby(['year','week']).agg({'deaths': np.sum}).plot(title='Total deaths by week')"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "brtFcjD1qesA"
},
"source": [
"Apparently this is not what we wanted. Let's find an easy way forward.\n",
"\n",
"I think it would make sense to save our groupby'd data in a new DataFrame and play with saved totals:"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 450
},
"id": "Z2tlZ5ojleKx",
"outputId": "5af59dca-3ff2-4ce1-c059-e2bf00c06232"
},
"source": [
"totals_1519 = data_1519.groupby(['year','week']).agg({'deaths': np.sum})\r\n",
"totals_1519['deaths_15'] = None\r\n",
"totals_1519['deaths_16'] = None\r\n",
"totals_1519['deaths_17'] = None\r\n",
"totals_1519['deaths_18'] = None\r\n",
"totals_1519['deaths_19'] = None\r\n",
"\r\n",
"totals_1519"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "aErvu7Yxzw-Q",
"outputId": "602bbbca-ba08-4a3b-e53f-075377b1b88a"
},
"source": [
"totals_1519.shape"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "oHIylggcvnbL"
},
"source": [
"What I wanted to do: create columns death_xx for each year, None by default, and then copy actual values only for a given year, leaving Nones in the columns for other years. And I couldn't accomplish this before I have realized that the DataFrame structure is somewhat different from what I expected - it has 6 columns (deaths\tdeaths_15\tdeaths_16\tdeaths_17\tdeaths_18\tdeaths_19) and one Multiindex (year,week) (I think Multiindex corresponds to a familiar Composite index in SQL). And the index isn't counted as columns, so we need to re-structure the DataFrame, using the following function:"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 419
},
"id": "sKnEBwVvtt7d",
"outputId": "d86275e8-8a0d-4e2b-f2f4-96f2ac37243b"
},
"source": [
"totals_1519.reset_index(inplace=True)\r\n",
"totals_1519"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "WEK6HXXa0Pqu"
},
"source": [
"Let's populate deaths_xx columns as we wanted:"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 419
},
"id": "gZFodx9ose1K",
"outputId": "613c833b-e7af-414d-8cc6-998189e938d7"
},
"source": [
"totals_1519.loc[totals_1519['year']==15,'deaths_15']=totals_1519['deaths']\r\n",
"totals_1519.loc[totals_1519['year']==16,'deaths_16']=totals_1519['deaths']\r\n",
"totals_1519.loc[totals_1519['year']==17,'deaths_17']=totals_1519['deaths']\r\n",
"totals_1519.loc[totals_1519['year']==18,'deaths_18']=totals_1519['deaths']\r\n",
"totals_1519.loc[totals_1519['year']==19,'deaths_19']=totals_1519['deaths']\r\n",
"totals_1519"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "b73cQ-RD0WAR"
},
"source": [
"Now we can plot all 6 lines in a single graph:"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 312
},
"id": "NMkhZOkfx5MS",
"outputId": "0ef269c3-5711-45bc-c2b0-c484d1e12573"
},
"source": [
"totals_1519.plot(x='week',y=['deaths_15','deaths_16','deaths_17','deaths_18','deaths_19'],title='Total deaths by week')"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "QPtOnWcp0fjL"
},
"source": [
"Let's do the same with the 20/21 data:"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 419
},
"id": "moaAMg2E0omJ",
"outputId": "353b2651-faf2-48e2-931b-6bc10c027d6d"
},
"source": [
"totals_2021 = data_2021.groupby(['year','week']).agg({'deaths': np.sum})\r\n",
"totals_2021['deaths_20'] = None\r\n",
"totals_2021['deaths_21'] = None\r\n",
"totals_2021.reset_index(inplace=True)\r\n",
"totals_2021.loc[totals_2021['year']==20,'deaths_20']=totals_2021['deaths']\r\n",
"totals_2021.loc[totals_2021['year']==21,'deaths_21']=totals_2021['deaths']\r\n",
"totals_2021"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "hMeDKkdz9nX3"
},
"source": [
"Now we are ready to combine two DataFrames into a final totals DataFrame. Tho thing to notice: \n",
"* pandas puts NaN, non None, in empty cells. Doesn't affect us in this case\n",
"* it is really difficult to predict which operations are performed in place and which return the new DataFrame...\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jXdlqV3hi3Ez"
},
"source": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 419
},
"id": "pA8SsVF62H4x",
"outputId": "f29bca54-42c0-4670-9bd0-a3ffbbcf8cc6"
},
"source": [
"totals = totals_1519\r\n",
"totals=totals.append(totals_2021,ignore_index=True)\r\n",
"totals"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "43xSFykD-lj5"
},
"source": [
"Let's plot the final totals:"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 638
},
"id": "VJq7ORA-3aQg",
"outputId": "c3421b5d-d767-40d4-9318-10cd1867cb38"
},
"source": [
"totals.plot(x='week',y=['deaths_15','deaths_16','deaths_17','deaths_18','deaths_19','deaths_20','deaths_21'],title='Total deaths by week',figsize=(24,10))"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "8VI6iazr-we7"
},
"source": [
"And to finish today's exercises let's concentrate on the first 9 weeks of the years:"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 638
},
"id": "UaBoUAvq3FiX",
"outputId": "f86beddc-9da1-484b-d4a2-4c8421f15039"
},
"source": [
"totals[totals['week']<=9].plot(x='week',y=['deaths_15','deaths_16','deaths_17','deaths_18','deaths_19','deaths_20','deaths_21'],title='Total deaths in Scotland by week',figsize=(24,10))"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "u-ThHHCi-_Kf"
},
"source": [
"To be continued..."
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment