Skip to content

Instantly share code, notes, and snippets.

@dandanxu
Created March 17, 2015 14:57
Show Gist options
  • Save dandanxu/c3fa5999d86d506f3b5c to your computer and use it in GitHub Desktop.
Save dandanxu/c3fa5999d86d506f3b5c to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "",
"signature": "sha256:2ff2d646f1f92260460f297b9fbeabaf3049dd0c368c9a772bd7f852d648b4b5"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# SolveBio Tutorial\n",
"## 2015-01-27 Average Age of Diagnosis in TCGA\n",
"SolveBio provides programmatic access to genomic reference data.\n",
"In this demo, we will use SolveBio's Python package, combined with plot.ly and numpy to quickly analyze and visualize patients and their characteristics from the The Cancer Genome Atlas Project."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from solvebio import login, Dataset, Filter\n",
"import numpy as np\n",
"import plotly.plotly as py\n",
"import plotly.tools as tls\n",
"\n",
"from plotly.graph_objs import Data, Layout, XAxis, YAxis, Figure, Box\n",
"\n",
"# Load local SolveBio credentials\n",
"login()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we need to open up the TCGA Patient Information dataset. To find interesting datasets to analyze, and to explore the dat before querying, you can look at the fields available in the [SolveBio Data Library](https://www.solvebio.com/library/TCGA/1.2.0-2015-02-11/PatientInformation). "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"tcga = Dataset.retrieve('TCGA/PatientInformation')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we're conducting this analysis by cancer type, we first need to pull out all the possible values for cancer type (aka `cancer_abbreviation`) in this dataset. This is easily accomplished by looking through the dataset's fields and pulling out the 'facets', or possible values. Set `limit=0` to pull out the entire list of unique values."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cancers = [x[0] for x in tcga.fields('cancer_abbreviation').facets(limit=0)['facets']]\n",
"print \"Cancer types: {0}\".format(','.join(cancers))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Cancer types: BRCA,GBM,OV,UCEC,KIRC,LUAD,HNSC,THCA,LUSC,LGG,COAD,STAD,PRAD,SKCM,LIHC,BLCA,CESC,KIRP,SARC,LAML,PCPG,READ,ESCA,PAAD,TGCT,THYM,KICH,ACC,UVM,MESO,UCS,DLBC,CHOL\n"
]
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have our cancer types we're interested in, we can start going through the PatientInformation dataset and pulling out the fields we're interested in analyzing. Today, we're going to look at the age at which the intiial pathologic diagnosis of this cancer type was performed for each patient. We'll filter out the data points where the age wasn't recorded. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cancer_and_age = []\n",
"print \"Retrieving data for cancer type:\"\n",
"for cancer in cancers:\n",
" print \"{0}\".format(cancer), \n",
" f = ~Filter(age_at_initial_pathologic_diagnosis='[Not Available]') & \\\n",
" Filter(cancer_abbreviation=cancer)\n",
" results = tcga.query(fields='age_at_initial_pathologic_diagnosis',\n",
" filters=f)\n",
" ages = [int(r['age_at_initial_pathologic_diagnosis']) for r in results]\n",
" cancer_and_age.append({'cancer_type': cancer, 'ages': ages})"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Retrieving data for cancer type:\n",
"BRCA "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"GBM "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"OV "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"UCEC "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"KIRC "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"LUAD "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"HNSC "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"THCA "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"LUSC "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"LGG "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"COAD "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"STAD "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"PRAD "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"SKCM "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"LIHC "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"BLCA "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"CESC "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"KIRP "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"SARC "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"LAML "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"PCPG "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"READ "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"ESCA "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"PAAD "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"TGCT "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"THYM "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"KICH "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"ACC "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"UVM "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"MESO "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"UCS "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"DLBC "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"CHOL\n"
]
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have the age of diagnosis for every patient in TCGA, by cancer type, let's sort the data by median age for each cancer with numpy and visualize the data with plot.ly."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cancer_and_age = sorted(cancer_and_age, key = lambda x: np.median(x['ages']))\n",
"\n",
"data = Data([Box(y=cancer['ages'], name=cancer['cancer_type'])\n",
" for cancer in cancer_and_age])\n",
"layout = Layout(\n",
" title='Age of Diagnosis for TCGA Patients by Cancer Type',\n",
" xaxis=XAxis(title='Cancer Type'),\n",
" yaxis=YAxis(title='Age of Diagnosis')\n",
")\n",
"fig = Figure(data=data, layout=layout)\n",
"\n",
"plot_url = py.plot(fig, filename='age-of-diagnosis-for-tcga-patients', auto_open=False)\n",
"tls.embed(plot_url)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\"seamless=\"seamless\" src=\"https://plot.ly/~dandanxu/60.embed\" height=\"525\" width=\"100%\"></iframe>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 5,
"text": [
"<plotly.tools.PlotlyDisplay at 0x1075b9c10>"
]
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The results are as we expect, based on the unique epidemiology of each cancer. For example, we know that testicular germ cell tumors are most common between the ages of 15-35 in men. This is a pretty simple analysis, but there's a lot of data in SolveBio's TCGA datasets that are ripe for analysis. See the [SolveBio Data Library](https://www.solvebio.com/library) to find your favorite datasets."
]
}
],
"metadata": {}
}
]
}
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment