Created
February 25, 2014 20:40
-
-
Save theandygross/9217228 to your computer and use it in GitHub Desktop.
Firehose MAF Inconsistancies
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "heading", | |
"level": 1, | |
"metadata": {}, | |
"source": [ | |
"Firehose pipeline inconsistancies" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"import pandas as pd" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 1 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Here I am comparing the January 15 analysis run to the September 23 analysis run. \n", | |
"\n", | |
"* The two files I am comparing are located [here (Jan. 15)](http://gdac.broadinstitute.org/runs/analyses__2014_01_15/reports/cancer/HNSC/MutSigNozzleReportCV/HNSC-TP.final_analysis_set.maf) and [here (Sep. 23)](http://gdac.broadinstitute.org/runs/analyses__2013_09_23/reports/cancer/HNSC/MutSigNozzleReportCV/HNSC-TP.final_analysis_set.maf). \n", | |
"* The January file seems to be a strict subset of the September file. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"firehose = 'http://gdac.broadinstitute.org/runs'\n", | |
"jan_date = 'analyses__2014_01_15'\n", | |
"sep_date = 'analyses__2013_09_23'\n", | |
"ext = 'reports/cancer/HNSC/MutSigNozzleReportCV'\n", | |
"maf_file = 'HNSC-TP.final_analysis_set.maf'" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 2 | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 4, | |
"metadata": {}, | |
"source": [ | |
"Read in MAF files from two different versioned runs" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"maf_jan = pd.read_table('{}/{}/{}/{}'.format(firehose, jan_date, ext, maf_file))\n", | |
"maf_jan = maf_jan.set_index(['Hugo_Symbol','Chromosome','Start_position'])" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 3 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"maf_jan.shape" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 4, | |
"text": [ | |
"(56282, 111)" | |
] | |
} | |
], | |
"prompt_number": 4 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"maf_sep = pd.read_table('{}/{}/{}/{}'.format(firehose, sep_date, ext, maf_file))\n", | |
"maf_sep = maf_sep.set_index(['Hugo_Symbol','Chromosome','Start_position'])" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 5 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"maf_sep.shape" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 6, | |
"text": [ | |
"(56914, 111)" | |
] | |
} | |
], | |
"prompt_number": 6 | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 4, | |
"metadata": {}, | |
"source": [ | |
"There are 629 filtered calls in this updated file... The input MAFs have not changed on the MAF dashboard. This is not documented anywhere." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"filtered_calls = maf_sep.ix[maf_sep.index.diff(maf_jan.index)]\n", | |
"filtered_calls.shape" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 8, | |
"text": [ | |
"(629, 111)" | |
] | |
} | |
], | |
"prompt_number": 8 | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 4, | |
"metadata": {}, | |
"source": [ | |
"These mutations are spread across different samples" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"filtered_calls.Tumor_Sample_Barcode.value_counts()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 9, | |
"text": [ | |
"TCGA-CN-5369-01A-01D-1434-08 20\n", | |
"TCGA-CV-5436-01A-01D-1512-08 16\n", | |
"TCGA-BB-7864-01A-11D-2229-08 11\n", | |
"TCGA-CV-7183-01A-11D-2012-08 10\n", | |
"TCGA-CV-5430-01A-02D-1683-08 8\n", | |
"TCGA-CV-7424-01A-11D-2078-08 6\n", | |
"TCGA-CV-6940-01A-11D-1912-08 6\n", | |
"TCGA-BA-4075-01A-01D-1434-08 6\n", | |
"TCGA-BB-4227-01A-01D-1870-08 6\n", | |
"TCGA-CV-7409-01A-31D-2229-08 6\n", | |
"TCGA-CV-7097-01A-11D-2012-08 6\n", | |
"TCGA-CN-6011-01A-11D-1683-08 5\n", | |
"TCGA-CV-5432-01A-02D-1683-08 5\n", | |
"TCGA-CR-7364-01A-11D-2012-08 5\n", | |
"TCGA-BB-7862-01A-21D-2229-08 5\n", | |
"...\n", | |
"TCGA-BA-4076-01A-01D-1434-08 1\n", | |
"TCGA-CV-6441-01A-11D-1683-08 1\n", | |
"TCGA-DQ-7592-01A-11D-2078-08 1\n", | |
"TCGA-CN-4726-01A-01D-1434-08 1\n", | |
"TCGA-CR-6467-01A-11D-1870-08 1\n", | |
"TCGA-CV-7263-01A-11D-2012-08 1\n", | |
"TCGA-CV-7437-01A-21D-2129-08 1\n", | |
"TCGA-CR-6470-01A-11D-1870-08 1\n", | |
"TCGA-CQ-7065-01A-11D-2078-08 1\n", | |
"TCGA-CV-5431-01A-01D-1512-08 1\n", | |
"TCGA-CV-7425-01A-11D-2078-08 1\n", | |
"TCGA-BA-5149-01A-01D-1512-08 1\n", | |
"TCGA-CR-7390-01A-11D-2012-08 1\n", | |
"TCGA-CV-5444-01A-02D-1512-08 1\n", | |
"TCGA-CN-6988-01A-11D-1912-08 1\n", | |
"Length: 248, dtype: int64" | |
] | |
} | |
], | |
"prompt_number": 9 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"These are spread across different mutation types" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"filtered_calls.Variant_Classification.value_counts()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 10, | |
"text": [ | |
"Missense_Mutation 213\n", | |
"Frame_Shift_Del 105\n", | |
"Silent 104\n", | |
"Frame_Shift_Ins 80\n", | |
"In_Frame_Del 74\n", | |
"Splice_Site 18\n", | |
"In_Frame_Ins 14\n", | |
"RNA 12\n", | |
"5'Flank 4\n", | |
"Translation_Start_Site 3\n", | |
"Nonsense_Mutation 2\n", | |
"dtype: int64" | |
] | |
} | |
], | |
"prompt_number": 10 | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 4, | |
"metadata": {}, | |
"source": [ | |
"Many but not all of these are C-> T mutations. Is this the Oxog filter? " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"filtered_calls.groupby(['Reference_Allele','Tumor_Seq_Allele1','Tumor_Seq_Allele2']).size().order()[::-1].head(20)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 11, | |
"text": [ | |
"Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2\n", | |
"C T T 71\n", | |
"G A A 64\n", | |
"A C C 51\n", | |
"T - - 50\n", | |
" G G 34\n", | |
"- G G 31\n", | |
"A G G 30\n", | |
"T C C 25\n", | |
"- T T 22\n", | |
"A - - 20\n", | |
"C A A 15\n", | |
"- A A 15\n", | |
"G C C 14\n", | |
"- C C 14\n", | |
"G T T 13\n", | |
" - - 12\n", | |
"C - - 12\n", | |
" G G 9\n", | |
"GCT - - 7\n", | |
"GAA - - 7\n", | |
"dtype: int64" | |
] | |
} | |
], | |
"prompt_number": 11 | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment