Skip to content

Instantly share code, notes, and snippets.

@seandavi
Last active September 21, 2020 13:34
Show Gist options
  • Save seandavi/e3589d1320ed86ff454299edc7995c91 to your computer and use it in GitHub Desktop.
Save seandavi/e3589d1320ed86ff454299edc7995c91 to your computer and use it in GitHub Desktop.
sars2pack jupyter notebook for Cornell comp bio talk
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# sars2pack\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"name": "init",
"tags": [
"remove_cell"
]
},
"outputs": [],
"source": [
"knitr::opts_chunk$set(warning=FALSE,message=FALSE, cache=TRUE,\n",
" fig.width=8, fig.height=6, out.width = '100%'\n",
" )\n",
"knitr::opts_knit$set(upload.fun = knitr::imgur_upload)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Questions addressed by sars2pack\n",
"\n",
"- What are the current and historical total, new cases, and deaths of COVID-19 at the city, county, state, national, and international levels?\n",
"- How do changes in infection rates differ across locations?\n",
"- What are the non-pharmacological interventions in place at the local and national levels?\n",
"- In the United States, what is the geographical distribution of healthcare capacity (ICU beds, total beds, doctors, etc.)?\n",
"- What are the published values of key epidemic parameters, as curated from the literature?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"eval": false
},
"outputs": [],
"source": [
"# If you do not have BiocManager installed:\n",
"install.packages('BiocManager')\n",
"options(Ncpus=6)\n",
"\n",
"# Then, if sars2pack is not already installed:\n",
"BiocManager::install('seandavi/sars2pack')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After the one-time installation, load the packge to get started."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"library(sars2pack)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Available datasets\n",
"\n",
"The sars2pack package has a computable catalog of available datasets. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"remove_input"
]
},
"outputs": [],
"source": [
"# Load some convenience libraries\n",
"suppressPackageStartupMessages({\n",
"library(knitr)\n",
"library(kableExtra)\n",
"library(tibble)\n",
"library(dplyr)\n",
"library(purrr)\n",
"library(sars2pack)\n",
"library(yaml)})"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"remove_input"
]
},
"outputs": [],
"source": [
"# access the available datasets\n",
"ad = available_datasets()\n",
"nrow(ad)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"table(unlist(ad$data_type))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And take a look at the available datasets."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"remove_input"
]
},
"outputs": [],
"source": [
"ad %>% arrange(data_type) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dataset characterization via the above factors is really useful for exploring data. For example, we can examine datasets and their related:\n",
"\n",
"- names\n",
"- geographic resolution\n",
"- data types covered\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"library(visNetwork)\n",
"ad = available_datasets()\n",
"ad$resolution[[12]]=''\n",
"nodes = bind_rows(\n",
" list(\n",
" data.frame(id=unique(ad$accessor),label=unique(ad$accessor),\n",
" title=\"dataset\", color='red'),\n",
" data.frame(id=unique(unlist(ad$data_type)),title=\"type\",\n",
" label = unique(unlist(ad$data_type)), color='blue'),\n",
" data.frame(id=unique(unlist(ad$resolution)),title=\"resolution\",\n",
" label=unique(unlist(ad$resolution)), color='green')\n",
" )\n",
")\n",
"edges = data.frame(ad %>% dplyr::select(accessor,data_type) %>% tidyr::unnest(data_type))\n",
"e2 = data.frame(ad %>% dplyr::select(accessor,resolution) %>% tidyr::unnest(resolution))\n",
"\n",
"colnames(edges)=c('from','to')\n",
"colnames(e2)=c('from','to')\n",
"edges = bind_rows(list(edges,e2))\n",
"visNetwork(nodes=nodes,edges=edges) %>% \n",
" visOptions(highlightNearest = list(enabled = T, degree = 0, hover = T), \n",
" width=1200, height=1200) %>% \n",
" visIgraphLayout(layout='layout_in_circle', smooth=TRUE) %>% \n",
" visNodes(shape='ellipse', font=list(size=28)) %>%\n",
" visSave(file='dataset_network.html', selfcontained = FALSE, background='black')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"browseURL('dataset_network.html')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each dataset is associated with extensive documentation, including provenance and original URL or reference."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"help(nytimes_state_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Case tracking\n",
"\n",
"Updated tracking of city, county, state, national, and international confirmed cases, deaths,\n",
"and testing is critical to driving policy, implementing interventions, and measuring their effectiveness. Case tracking datasets include date, a count of cases, and usually numerous other pieces of information related to location of reporting, etc. \n",
"\n",
"Accessing case-tracking datasets is typically done with one function per dataset. The example here is data from the European Centers for Disease Control, or ECDC."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"name": "worldwide"
},
"outputs": [],
"source": [
"ecdc = ecdc_data()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get a quick overview of the dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"head(ecdc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `ecdc` dataset is just a `data.frame` (actually, a `tibble`), so applying standard R or tidyverse functionality can get answers to basic questions with little code. The next code block generates a `top10` of countries with the most deaths recorded to date. Note that if you do this on your own computer, the data will be updated to today's data values. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"results": "asis"
},
"outputs": [],
"source": [
"library(dplyr)\n",
"top10 = ecdc %>% dplyr::filter(subset=='deaths') %>% \n",
" dplyr::group_by(location_name) %>%\n",
" dplyr::filter(count==max(count)) %>%\n",
" dplyr::arrange(desc(count)) %>%\n",
" head(10) %>% dplyr::select(-starts_with('iso'),-continent,-subset) %>%\n",
" dplyr::mutate(rate_per_100k = 1e5*count/population_2019)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, present a nice table of those countries:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"top10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Examine the spread of the pandemic throughout the world by examining cumulative deaths\n",
"reported for the top 10 countries above."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"name": "plotcases"
},
"outputs": [],
"source": [
"ecdc_top10 = ecdc %>% filter(location_name %in% top10$location_name & subset=='deaths')\n",
"plot_epicurve(ecdc_top10,\n",
" filter_expression = count > 10, \n",
" color='location_name')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Comparing the features of disease spread is easiest if all curves are shifted to \n",
"\"start\" at the same absolute level of infection. In this case, shift the origin for\n",
"all countries to start at the first time point when more than 100 cumulative cases\n",
"had been observed. Note how some curves cross others which is evidence of less infection\n",
"control at the same relative time in the pandemic for that country (eg., Brazil)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"lines_to_next_cell": 0
},
"outputs": [],
"source": [
"ecdc_top10 %>% align_to_baseline(count>100,group_vars=c('location_name')) %>%\n",
" plot_epicurve(date_column = 'index',color='location_name')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"library(plotly)\n",
"p = ecdc_top10 %>% dplyr::filter(count>100) %>%\n",
" plot_epicurve(color='location_name') %>% ggplotly(height=800)\n",
"p"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How much testing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Start by finding testing datasets. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ad %>% dplyr::filter(sapply(data_type, function(x) 'testing' %in% x))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Focusing on the United States, we can take a look at CovidTracker."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"help('covidtracker_data')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ct = covidtracker_data() %>%\n",
" dplyr::filter(positive>0 & as.integer(fips)<60) %>%\n",
" dplyr::mutate(tests=positive+negative,positive=positive)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The next code block simply sets up a trace of the amount of testing done per state over the last two months. Axes are log10(positivity_rate) and log10(tests_performed)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"p = ct %>% dplyr::filter(positive>0 & as.integer(fips)<60) %>% dplyr::group_by(state) %>%\n",
" dplyr::filter(date > max(date)-56) %>% \n",
" dplyr::ungroup() %>%\n",
" dplyr::mutate(tests=positive+negative,positive=positive) %>%\n",
" ggplot(aes(x=tests,y=positive/tests, color=state,group=state)) + \n",
" geom_line(alpha=0.7) + theme_light() +\n",
" scale_y_log10() + scale_x_log10() +\n",
" #xlab(expression(log[10](tests))) + ylab(expression(log[10]('proportion positive tests'))) +\n",
" theme(legend.position='none')\n",
"library(plotly)\n",
"ggplotly(p)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Contributions\n",
"\n",
"Pull requests are gladly accepted on [Github](https://github.com/seandavi/sars2pack).\n",
"\n",
"### Adding new datasets\n",
"\n",
"See the **Adding new datasets** vignette. \n",
"\n",
"## Similar work\n",
"\n",
"- <https://github.com/emanuele-guidotti/COVID19>\n",
"- [Top 25 R resources on Novel COVID-19\n",
" Coronavirus](https://towardsdatascience.com/top-5-r-resources-on-covid-19-coronavirus-1d4c8df6d85f)\n",
"- [COVID-19 epidemiology with\n",
" R](https://rviews.rstudio.com/2020/03/05/covid-19-epidemiology-with-r/)\n",
"- <https://github.com/RamiKrispin/coronavirus>\n",
"- [Youtube: Using R to analyze\n",
" COVID-19](https://www.youtube.com/watch?v=D_CNmYkGRUc)\n",
"- [DataCamp: Visualize the rise of COVID-19 cases globally with\n",
" ggplot2](https://www.datacamp.com/projects/870)\n",
"- [MackLavielle/covidix R\n",
" package](https://github.com/MarcLavielle/covidix/)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"jupytext": {
"cell_metadata_filter": "results,tags,name,eval,-all",
"main_language": "R",
"notebook_metadata_filter": "-all"
},
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "4.0.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment