Skip to content

Instantly share code, notes, and snippets.

@jamescasbon
Last active February 26, 2018 02:03
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save jamescasbon/4678244 to your computer and use it in GitHub Desktop.
Save jamescasbon/4678244 to your computer and use it in GitHub Desktop.
VCF pandas demo
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "VCF in pandas"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "# Pandas as a VCF data format\n\nExample using pandas to represent VCF data. This currently uses pvcf and then converts the results, but we would want to change that if this is good idea.\n\nThis next block loads some sample data:\n"
},
{
"cell_type": "code",
"collapsed": false,
"input": "import pandas as pd\nimport numpy as np\n\nimport vcf\nreader = vcf.Reader(file('gatk.vcf'))\n\nn = 0\ndf = {}\nfor line in reader:\n linedat = {\n 'CHROM': line.CHROM,\n 'POS': line.POS,\n 'REF': line.REF,\n 'ALT': tuple(line.ALT)\n }\n for sample, call in zip(reader.samples, line.samples):\n datum = dict(linedat)\n datum['sample'] = sample\n datum.update(zip(call.data._fields, call.data))\n df[n] = datum\n n = n+1\n\ndata = pd.DataFrame.from_dict(df, orient='index')\ndata = data.set_index(['CHROM', 'POS', 'REF', 'sample'])\n",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The data is converted from VCF's wide format to a a long format where we have one sample per row.\n\nThe first four columns below are a hierarchical index.\n\nTODO: the ALT/REF column are not quite right. We don't want to pivot on them (i.e. they are almost part of the index)."
},
{
"cell_type": "code",
"collapsed": false,
"input": "data.head()",
"language": "python",
"metadata": {},
"outputs": [
{
"html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th>GT</th>\n <th>AD</th>\n <th>GQ</th>\n <th>PL</th>\n <th>ALT</th>\n <th>DP</th>\n </tr>\n <tr>\n <th>CHROM</th>\n <th>POS</th>\n <th>REF</th>\n <th>sample</th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th rowspan=\"5\" valign=\"top\">chr22</th>\n <th rowspan=\"5\" valign=\"top\">42522392</th>\n <th rowspan=\"5\" valign=\"top\">G</th>\n <th>BLANK</th>\n <td> 0/0</td>\n <td> [6, 0]</td>\n <td> 18.04</td>\n <td> [0, 18, 211]</td>\n <td> (A)</td>\n <td> 6</td>\n </tr>\n <tr>\n <th>NA12878</th>\n <td> 0/1</td>\n <td> [138, 107]</td>\n <td> 99.00</td>\n <td> [1961, 0, 3049]</td>\n <td> (A)</td>\n <td> 250</td>\n </tr>\n <tr>\n <th>NA12891</th>\n <td> 0/1</td>\n <td> [169, 77]</td>\n <td> 99.00</td>\n <td> [1038, 0, 3533]</td>\n <td> (A)</td>\n <td> 250</td>\n </tr>\n <tr>\n <th>NA12892</th>\n <td> 0/0</td>\n <td> [249, 0]</td>\n <td> 99.00</td>\n <td> [0, 600, 5732]</td>\n <td> (A)</td>\n <td> 250</td>\n </tr>\n <tr>\n <th>NA19238</th>\n <td> 0/0</td>\n <td> [248, 1]</td>\n <td> 99.00</td>\n <td> [0, 627, 6191]</td>\n <td> (A)</td>\n <td> 250</td>\n </tr>\n </tbody>\n</table>\n</div>",
"output_type": "pyout",
"prompt_number": 3,
"text": " GT AD GQ PL ALT DP\nCHROM POS REF sample \nchr22 42522392 G BLANK 0/0 [6, 0] 18.04 [0, 18, 211] (A) 6\n NA12878 0/1 [138, 107] 99.00 [1961, 0, 3049] (A) 250\n NA12891 0/1 [169, 77] 99.00 [1038, 0, 3533] (A) 250\n NA12892 0/0 [249, 0] 99.00 [0, 600, 5732] (A) 250\n NA19238 0/0 [248, 1] 99.00 [0, 627, 6191] (A) 250"
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We can get back to the wide format by unstacking the deepest level of the index, the sample.te\nThis returns almost VCF, but with the column axis swapped.\n\nNote: we filter the samples just to make the data width small enough to render."
},
{
"cell_type": "code",
"collapsed": false,
"input": "subdata = data[data.index.map(lambda x: x[3] in ['NA12891', 'NA12878'])]\nsubdata.unstack().head()",
"language": "python",
"metadata": {},
"outputs": [
{
"html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr>\n <th></th>\n <th></th>\n <th></th>\n <th colspan=\"2\" halign=\"left\">GT</th>\n <th colspan=\"2\" halign=\"left\">AD</th>\n <th colspan=\"2\" halign=\"left\">GQ</th>\n <th colspan=\"2\" halign=\"left\">PL</th>\n <th colspan=\"2\" halign=\"left\">ALT</th>\n <th colspan=\"2\" halign=\"left\">DP</th>\n </tr>\n <tr>\n <th></th>\n <th></th>\n <th>sample</th>\n <th>NA12878</th>\n <th>NA12891</th>\n <th>NA12878</th>\n <th>NA12891</th>\n <th>NA12878</th>\n <th>NA12891</th>\n <th>NA12878</th>\n <th>NA12891</th>\n <th>NA12878</th>\n <th>NA12891</th>\n <th>NA12878</th>\n <th>NA12891</th>\n </tr>\n <tr>\n <th>CHROM</th>\n <th>POS</th>\n <th>REF</th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th rowspan=\"5\" valign=\"top\">chr22</th>\n <th>42522392</th>\n <th>G</th>\n <td> 0/1</td>\n <td> 0/1</td>\n <td> [138, 107]</td>\n <td> [169, 77]</td>\n <td> 99</td>\n <td> 99</td>\n <td> [1961, 0, 3049]</td>\n <td> [1038, 0, 3533]</td>\n <td> (A)</td>\n <td> (A)</td>\n <td> 250</td>\n <td> 250</td>\n </tr>\n <tr>\n <th>42522613</th>\n <th>G</th>\n <td> 0/1</td>\n <td> 0/0</td>\n <td> [118, 127]</td>\n <td> [241, 0]</td>\n <td> 99</td>\n <td> 99</td>\n <td> [2396, 0, 1719]</td>\n <td> [0, 459, 4476]</td>\n <td> (C)</td>\n <td> (C)</td>\n <td> 246</td>\n <td> 244</td>\n </tr>\n <tr>\n <th>42522755</th>\n <th>C</th>\n <td> 0/0</td>\n <td> 0/0</td>\n <td> [208, 40]</td>\n <td> [192, 56]</td>\n <td> 99</td>\n <td> 99</td>\n <td> [0, 236, 4169]</td>\n <td> [0, 114, 4292]</td>\n <td> (G)</td>\n <td> (G)</td>\n <td> 248</td>\n <td> 249</td>\n </tr>\n <tr>\n <th>42523003</th>\n <th>A</th>\n <td> 1/1</td>\n <td> 0/1</td>\n <td> [9, 173]</td>\n <td> [153, 95]</td>\n <td> 99</td>\n <td> 99</td>\n <td> [2385, 273, 0]</td>\n <td> [355, 0, 2355]</td>\n <td> (G)</td>\n <td> (G)</td>\n <td> 183</td>\n <td> 249</td>\n </tr>\n <tr>\n <th>42523077</th>\n <th>A</th>\n <td> 0/0</td>\n <td> 0/0</td>\n <td> [249, 1]</td>\n <td> [250, 0]</td>\n <td> 99</td>\n <td> 99</td>\n <td> [0, 544, 6985]</td>\n <td> [0, 577, 6968]</td>\n <td> (G)</td>\n <td> (G)</td>\n <td> 250</td>\n <td> 250</td>\n </tr>\n </tbody>\n</table>\n</div>",
"output_type": "pyout",
"prompt_number": 9,
"text": " GT AD GQ \\\nsample NA12878 NA12891 NA12878 NA12891 NA12878 NA12891 \nCHROM POS REF \nchr22 42522392 G 0/1 0/1 [138, 107] [169, 77] 99 99 \n 42522613 G 0/1 0/0 [118, 127] [241, 0] 99 99 \n 42522755 C 0/0 0/0 [208, 40] [192, 56] 99 99 \n 42523003 A 1/1 0/1 [9, 173] [153, 95] 99 99 \n 42523077 A 0/0 0/0 [249, 1] [250, 0] 99 99 \n\n PL ALT DP \\\nsample NA12878 NA12891 NA12878 NA12891 NA12878 \nCHROM POS REF \nchr22 42522392 G [1961, 0, 3049] [1038, 0, 3533] (A) (A) 250 \n 42522613 G [2396, 0, 1719] [0, 459, 4476] (C) (C) 246 \n 42522755 C [0, 236, 4169] [0, 114, 4292] (G) (G) 248 \n 42523003 A [2385, 273, 0] [355, 0, 2355] (G) (G) 183 \n 42523077 A [0, 544, 6985] [0, 577, 6968] (G) (G) 250 \n\n \nsample NA12891 \nCHROM POS REF \nchr22 42522392 G 250 \n 42522613 G 244 \n 42522755 C 249 \n 42523003 A 249 \n 42523077 A 250 "
}
],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We can use the power of pandas to easily get, for example, average depth and quality at each position:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "data.groupby(level=['CHROM', 'POS']).mean().head()\n",
"language": "python",
"metadata": {},
"outputs": [
{
"html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th></th>\n <th>GQ</th>\n <th>DP</th>\n </tr>\n <tr>\n <th>CHROM</th>\n <th>POS</th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th rowspan=\"5\" valign=\"top\">chr22</th>\n <th>42522392</th>\n <td> 87.434286</td>\n <td> 215.142857</td>\n </tr>\n <tr>\n <th>42522613</th>\n <td> 93.805714</td>\n <td> 211.428571</td>\n </tr>\n <tr>\n <th>42522755</th>\n <td> 73.802857</td>\n <td> 215.714286</td>\n </tr>\n <tr>\n <th>42523003</th>\n <td> 84.945714</td>\n <td> 204.428571</td>\n </tr>\n <tr>\n <th>42523077</th>\n <td> 98.105714</td>\n <td> 216.857143</td>\n </tr>\n </tbody>\n</table>\n</div>",
"output_type": "pyout",
"prompt_number": 38,
"text": " GQ DP\nCHROM POS \nchr22 42522392 87.434286 215.142857\n 42522613 93.805714 211.428571\n 42522755 73.802857 215.714286\n 42523003 84.945714 204.428571\n 42523077 98.105714 216.857143"
}
],
"prompt_number": 38
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We can also use a split, apply, join style to define our own aggregations such as call rate:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "def call_rate(x): \n return 1 - float(sum(x.GT.isnull())) / len(x.GT)\n\ndata.groupby(level=['CHROM', 'POS']).apply(call_rate).tail()\n",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 14,
"text": "CHROM POS \nchr22 42526694 1.000000\n 42527471 0.857143\n 42527533 0.857143\n 42527793 0.857143\n 42527891 0.857143"
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now add some sample groupings. \nI use a function here, but we could easily use pandas join ability to join on another data frame loaded from a spreadsheet or database."
},
{
"cell_type": "code",
"collapsed": false,
"input": "groupdata = data.copy()\ngroupdata['group'] = data.index.map(lambda x: 'case' if x[3].startswith('NA12') else 'control')\ngroupdata.head()",
"language": "python",
"metadata": {},
"outputs": [
{
"html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th>GT</th>\n <th>AD</th>\n <th>GQ</th>\n <th>PL</th>\n <th>ALT</th>\n <th>DP</th>\n <th>group</th>\n </tr>\n <tr>\n <th>CHROM</th>\n <th>POS</th>\n <th>REF</th>\n <th>sample</th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th rowspan=\"5\" valign=\"top\">chr22</th>\n <th rowspan=\"5\" valign=\"top\">42522392</th>\n <th rowspan=\"5\" valign=\"top\">G</th>\n <th>BLANK</th>\n <td> 0/0</td>\n <td> [6, 0]</td>\n <td> 18.04</td>\n <td> [0, 18, 211]</td>\n <td> (A)</td>\n <td> 6</td>\n <td> control</td>\n </tr>\n <tr>\n <th>NA12878</th>\n <td> 0/1</td>\n <td> [138, 107]</td>\n <td> 99.00</td>\n <td> [1961, 0, 3049]</td>\n <td> (A)</td>\n <td> 250</td>\n <td> case</td>\n </tr>\n <tr>\n <th>NA12891</th>\n <td> 0/1</td>\n <td> [169, 77]</td>\n <td> 99.00</td>\n <td> [1038, 0, 3533]</td>\n <td> (A)</td>\n <td> 250</td>\n <td> case</td>\n </tr>\n <tr>\n <th>NA12892</th>\n <td> 0/0</td>\n <td> [249, 0]</td>\n <td> 99.00</td>\n <td> [0, 600, 5732]</td>\n <td> (A)</td>\n <td> 250</td>\n <td> case</td>\n </tr>\n <tr>\n <th>NA19238</th>\n <td> 0/0</td>\n <td> [248, 1]</td>\n <td> 99.00</td>\n <td> [0, 627, 6191]</td>\n <td> (A)</td>\n <td> 250</td>\n <td> control</td>\n </tr>\n </tbody>\n</table>\n</div>",
"output_type": "pyout",
"prompt_number": 5,
"text": " GT AD GQ PL ALT DP \\\nCHROM POS REF sample \nchr22 42522392 G BLANK 0/0 [6, 0] 18.04 [0, 18, 211] (A) 6 \n NA12878 0/1 [138, 107] 99.00 [1961, 0, 3049] (A) 250 \n NA12891 0/1 [169, 77] 99.00 [1038, 0, 3533] (A) 250 \n NA12892 0/0 [249, 0] 99.00 [0, 600, 5732] (A) 250 \n NA19238 0/0 [248, 1] 99.00 [0, 627, 6191] (A) 250 \n\n group \nCHROM POS REF sample \nchr22 42522392 G BLANK control \n NA12878 case \n NA12891 case \n NA12892 case \n NA19238 control "
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We then use the sample group in a 'group by', to produce genotype count aggregations:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "gt_groups = (groupdata.reset_index()\n .groupby(['CHROM', 'POS', 'group'])\n .GT.value_counts()\n .unstack(level='group')\n .unstack()\n .fillna(0)\n)\ngt_groups.head()",
"language": "python",
"metadata": {},
"outputs": [
{
"html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr>\n <th></th>\n <th>group</th>\n <th colspan=\"3\" halign=\"left\">case</th>\n <th colspan=\"3\" halign=\"left\">control</th>\n </tr>\n <tr>\n <th></th>\n <th></th>\n <th>0/0</th>\n <th>0/1</th>\n <th>1/1</th>\n <th>0/0</th>\n <th>0/1</th>\n <th>1/1</th>\n </tr>\n <tr>\n <th>CHROM</th>\n <th>POS</th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th rowspan=\"5\" valign=\"top\">chr22</th>\n <th>42522392</th>\n <td> 1</td>\n <td> 2</td>\n <td> 0</td>\n <td> 4</td>\n <td> 0</td>\n <td> 0</td>\n </tr>\n <tr>\n <th>42522613</th>\n <td> 1</td>\n <td> 2</td>\n <td> 0</td>\n <td> 0</td>\n <td> 4</td>\n <td> 0</td>\n </tr>\n <tr>\n <th>42522755</th>\n <td> 2</td>\n <td> 1</td>\n <td> 0</td>\n <td> 4</td>\n <td> 0</td>\n <td> 0</td>\n </tr>\n <tr>\n <th>42523003</th>\n <td> 0</td>\n <td> 2</td>\n <td> 1</td>\n <td> 0</td>\n <td> 4</td>\n <td> 0</td>\n </tr>\n <tr>\n <th>42523077</th>\n <td> 3</td>\n <td> 0</td>\n <td> 0</td>\n <td> 3</td>\n <td> 1</td>\n <td> 0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"output_type": "pyout",
"prompt_number": 20,
"text": "group case control \n 0/0 0/1 1/1 0/0 0/1 1/1\nCHROM POS \nchr22 42522392 1 2 0 4 0 0\n 42522613 1 2 0 0 4 0\n 42522755 2 1 0 4 0 0\n 42523003 0 2 1 0 4 0\n 42523077 3 0 0 3 1 0"
}
],
"prompt_number": 20
},
{
"cell_type": "code",
"collapsed": false,
"input": "def maf(group):\n return ((2 * group['1/1']) + group['0/1']) / (2 * group.sum(axis=1))\n\ndef relative_maf(data):\n return maf(data['case']) / maf(data['control'])\n\nrelative_maf(gt_groups).head()",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 30,
"text": "CHROM POS \nchr22 42522392 inf\n 42522613 0.666667\n 42522755 inf\n 42523003 1.333333\n 42523077 0.000000"
}
],
"prompt_number": 30
}
],
"metadata": {}
}
]
}
@everestial
Copy link

Nice stuff. It might come in hand in my future projects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment