Skip to content

Instantly share code, notes, and snippets.

@theandygross
Created February 25, 2014 20:40
Show Gist options
  • Save theandygross/9217228 to your computer and use it in GitHub Desktop.
Save theandygross/9217228 to your computer and use it in GitHub Desktop.
Firehose MAF Inconsistancies
{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Firehose pipeline inconsistancies"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here I am comparing the January 15 analysis run to the September 23 analysis run. \n",
"\n",
"* The two files I am comparing are located [here (Jan. 15)](http://gdac.broadinstitute.org/runs/analyses__2014_01_15/reports/cancer/HNSC/MutSigNozzleReportCV/HNSC-TP.final_analysis_set.maf) and [here (Sep. 23)](http://gdac.broadinstitute.org/runs/analyses__2013_09_23/reports/cancer/HNSC/MutSigNozzleReportCV/HNSC-TP.final_analysis_set.maf). \n",
"* The January file seems to be a strict subset of the September file. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"firehose = 'http://gdac.broadinstitute.org/runs'\n",
"jan_date = 'analyses__2014_01_15'\n",
"sep_date = 'analyses__2013_09_23'\n",
"ext = 'reports/cancer/HNSC/MutSigNozzleReportCV'\n",
"maf_file = 'HNSC-TP.final_analysis_set.maf'"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Read in MAF files from two different versioned runs"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"maf_jan = pd.read_table('{}/{}/{}/{}'.format(firehose, jan_date, ext, maf_file))\n",
"maf_jan = maf_jan.set_index(['Hugo_Symbol','Chromosome','Start_position'])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"maf_jan.shape"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 4,
"text": [
"(56282, 111)"
]
}
],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"maf_sep = pd.read_table('{}/{}/{}/{}'.format(firehose, sep_date, ext, maf_file))\n",
"maf_sep = maf_sep.set_index(['Hugo_Symbol','Chromosome','Start_position'])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"maf_sep.shape"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 6,
"text": [
"(56914, 111)"
]
}
],
"prompt_number": 6
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"There are 629 filtered calls in this updated file... The input MAFs have not changed on the MAF dashboard. This is not documented anywhere."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"filtered_calls = maf_sep.ix[maf_sep.index.diff(maf_jan.index)]\n",
"filtered_calls.shape"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 8,
"text": [
"(629, 111)"
]
}
],
"prompt_number": 8
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"These mutations are spread across different samples"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"filtered_calls.Tumor_Sample_Barcode.value_counts()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 9,
"text": [
"TCGA-CN-5369-01A-01D-1434-08 20\n",
"TCGA-CV-5436-01A-01D-1512-08 16\n",
"TCGA-BB-7864-01A-11D-2229-08 11\n",
"TCGA-CV-7183-01A-11D-2012-08 10\n",
"TCGA-CV-5430-01A-02D-1683-08 8\n",
"TCGA-CV-7424-01A-11D-2078-08 6\n",
"TCGA-CV-6940-01A-11D-1912-08 6\n",
"TCGA-BA-4075-01A-01D-1434-08 6\n",
"TCGA-BB-4227-01A-01D-1870-08 6\n",
"TCGA-CV-7409-01A-31D-2229-08 6\n",
"TCGA-CV-7097-01A-11D-2012-08 6\n",
"TCGA-CN-6011-01A-11D-1683-08 5\n",
"TCGA-CV-5432-01A-02D-1683-08 5\n",
"TCGA-CR-7364-01A-11D-2012-08 5\n",
"TCGA-BB-7862-01A-21D-2229-08 5\n",
"...\n",
"TCGA-BA-4076-01A-01D-1434-08 1\n",
"TCGA-CV-6441-01A-11D-1683-08 1\n",
"TCGA-DQ-7592-01A-11D-2078-08 1\n",
"TCGA-CN-4726-01A-01D-1434-08 1\n",
"TCGA-CR-6467-01A-11D-1870-08 1\n",
"TCGA-CV-7263-01A-11D-2012-08 1\n",
"TCGA-CV-7437-01A-21D-2129-08 1\n",
"TCGA-CR-6470-01A-11D-1870-08 1\n",
"TCGA-CQ-7065-01A-11D-2078-08 1\n",
"TCGA-CV-5431-01A-01D-1512-08 1\n",
"TCGA-CV-7425-01A-11D-2078-08 1\n",
"TCGA-BA-5149-01A-01D-1512-08 1\n",
"TCGA-CR-7390-01A-11D-2012-08 1\n",
"TCGA-CV-5444-01A-02D-1512-08 1\n",
"TCGA-CN-6988-01A-11D-1912-08 1\n",
"Length: 248, dtype: int64"
]
}
],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These are spread across different mutation types"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"filtered_calls.Variant_Classification.value_counts()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 10,
"text": [
"Missense_Mutation 213\n",
"Frame_Shift_Del 105\n",
"Silent 104\n",
"Frame_Shift_Ins 80\n",
"In_Frame_Del 74\n",
"Splice_Site 18\n",
"In_Frame_Ins 14\n",
"RNA 12\n",
"5'Flank 4\n",
"Translation_Start_Site 3\n",
"Nonsense_Mutation 2\n",
"dtype: int64"
]
}
],
"prompt_number": 10
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Many but not all of these are C-> T mutations. Is this the Oxog filter? "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"filtered_calls.groupby(['Reference_Allele','Tumor_Seq_Allele1','Tumor_Seq_Allele2']).size().order()[::-1].head(20)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 11,
"text": [
"Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2\n",
"C T T 71\n",
"G A A 64\n",
"A C C 51\n",
"T - - 50\n",
" G G 34\n",
"- G G 31\n",
"A G G 30\n",
"T C C 25\n",
"- T T 22\n",
"A - - 20\n",
"C A A 15\n",
"- A A 15\n",
"G C C 14\n",
"- C C 14\n",
"G T T 13\n",
" - - 12\n",
"C - - 12\n",
" G G 9\n",
"GCT - - 7\n",
"GAA - - 7\n",
"dtype: int64"
]
}
],
"prompt_number": 11
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment