Skip to content

Instantly share code, notes, and snippets.

@gngdb
Created June 19, 2014 14:47
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gngdb/6fd55578799d5c9bb91a to your computer and use it in GitHub Desktop.
Save gngdb/6fd55578799d5c9bb91a to your computer and use it in GitHub Desktop.
draft notebook extracting GO features
{
"metadata": {
"name": "",
"signature": "sha256:b23e93486e20b6aff97fda2c314462af8b5520f5d167aac630da914b91510d87"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The steps involved in extracting usable features are:\n",
"\n",
"1. Retrieving Gene Ontology annotations for each Entrez IDs\n",
"2. Comparing matches for all possible protein pairs\n",
"\n",
"## Retrieving Gene Ontology annotations for each Entrez ID\n",
"\n",
"To do this we must:\n",
"\n",
"1. Map the Entrez IDs onto corresponding Gene Ontology IDs\n",
"2. Use [goatools][] to retrieve GO terms from `gene_ontology.1_2.obo`\n",
"3. Add these to a dictionary using the Entrez IDs as keys with dictionaries for each term\n",
"\n",
"Alternatively, it looks like these terms are in the `gene2go` file retreived from the [ncbi ftp site][gene2], meaning that we wouldn't need to use [goatools][]:\n",
"\n",
"[goatools]: https://github.com/tanghaibao/goatools/#find-go-enrichment-of-genes-under-study\n",
"[gene2]: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cd ../../geneconversion/"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"/home/gavin/Documents/MRes/geneconversion\n"
]
}
],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!head gene2go"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"#Format: tax_id GeneID GO_ID Evidence Qualifier GO_term PubMed Category (tab is used as a separator, pound sign - start of a comment)\r\n",
"3702\t814629\tGO:0005634\tISM\t-\tnucleus\t-\tComponent\r\n",
"3702\t814629\tGO:0008150\tND\t-\tbiological_process\t-\tProcess\r\n",
"3702\t814630\tGO:0003700\tISS\t-\tsequence-specific DNA binding transcription factor activity\t11118137\tFunction\r\n",
"3702\t814630\tGO:0005634\tISM\t-\tnucleus\t-\tComponent\r\n",
"3702\t814630\tGO:0006355\tTAS\t-\tregulation of transcription, DNA-templated\t11118137\tProcess\r\n",
"3702\t814630\tGO:0006499\tRCA\t-\tN-terminal protein myristoylation\t-\tProcess\r\n",
"3702\t814630\tGO:0006635\tRCA\t-\tfatty acid beta-oxidation\t-\tProcess\r\n",
"3702\t814630\tGO:0006891\tRCA\t-\tintra-Golgi vesicle-mediated transport\t-\tProcess\r\n",
"3702\t814630\tGO:0016558\tRCA\t-\tprotein import into peroxisome matrix\t-\tProcess\r\n"
]
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, it's not certain and goatools seems to produce fairly good results, so using that for now:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import csv"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"c = csv.reader(open(\"gene2go\"), delimiter=\"\\t\")\n",
"#skip first line\n",
"c.next()\n",
"#initialise dictionary\n",
"goEntrezdict = {}\n",
"for line in c:\n",
" #on every line use the Entrez ID as a key and initialise a dictionary\n",
" goEntrezdict[line[1]] = {}"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then iterating through the file again we can add entries to this dictionary, each of which is a dictionary containing empty lists indexed by GO IDs:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"c = csv.reader(open(\"gene2go\"), delimiter=\"\\t\")\n",
"c.next()\n",
"for line in c:\n",
" goEntrezdict[line[1]][line[2]] = []"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 208
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Retrieving GO terms with goatools\n",
"\n",
"Next we use goatools, along with the Gene Ontology flat database file `gene_ontology.1_2.obo`.\n",
"Iterating through the Entrez IDs we have found and then iterating over the GO IDs which they refer to:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cd ../geneontology/"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"/home/gavin/Documents/MRes/geneontology\n"
]
}
],
"prompt_number": 209
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import goatools"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 211
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"parsedobo = goatools.obo_parser.GODag('gene_ontology.1_2.obo')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stderr",
"text": [
"load obo file gene_ontology.1_2.obo\n",
"42995"
]
},
{
"output_type": "stream",
"stream": "stderr",
"text": [
" nodes imported\n"
]
}
],
"prompt_number": 212
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for EntrezID in goEntrezdict.keys():\n",
" for goID in goEntrezdict[EntrezID].keys():\n",
" goEntrezdict[EntrezID][goID] = parsedobo[goID].name"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 213
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at an example entry:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print goEntrezdict[goEntrezdict.keys()[0]]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"{'GO:0016021': 'integral component of membrane'}\n"
]
}
],
"prompt_number": 214
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Looking at crossover\n",
"\n",
"We are looking for annotations that will cross over between pairs of proteins.\n",
"To do this we need to pick a number of annotations we can test each pair of Entrez IDs for.\n",
"These should be annotations that are likely to occur in pairs, any annotations that occur in isolation in our set will be useless.\n",
"\n",
"So we have to count all the occurrences of every term in our dictionary of dictionaries.\n",
"Start by flattening the list:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"gotermlist = []\n",
"for EntrezID in goEntrezdict.keys():\n",
" for goID in goEntrezdict[EntrezID].keys():\n",
" gotermlist.append(goEntrezdict[EntrezID][goID])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 215
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then count duplicates in the list using a dictionary:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"dupdict = {}\n",
"#initialise counters\n",
"for k in gotermlist:\n",
" dupdict[k] = 0\n",
"#count\n",
"for k in gotermlist:\n",
" dupdict[k] += 1"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 217
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import heapq"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 223
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#find term that is repeated most often\n",
"print \"Terms that occurred most often:\"\n",
"for k in heapq.nlargest(10, dupdict, key=lambda x: dupdict[x]):\n",
" print \"\\t\"+\"%s occurred %i times\"%(k,dupdict[k])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Terms that occurred most often:\n",
"\tmolecular_function occurred 52423 times\n",
"\tbiological_process occurred 49804 times\n",
"\tcellular_component occurred 38992 times\n",
"\tnucleus occurred 38516 times\n",
"\tcytoplasm occurred 28668 times\n",
"\tintegral component of membrane occurred 25343 times\n",
"\tmembrane occurred 21333 times\n",
"\tplasma membrane occurred 18145 times\n",
"\tprotein binding occurred 15731 times\n",
"\tmitochondrion occurred 12599 times\n"
]
}
],
"prompt_number": 225
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The three most common terms are __not terms__, they are domains.\n",
"There must be a problem with goatools, or a problem with the database file.\n",
"The rest of the most common terms are all localisations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Splitting by domain\n",
"\n",
"There are [three domains][gohelp] in the Gene Ontology database.\n",
"We could split the above to count within each of these quite easily.\n",
"This was also done by [Qi][qiweb] when using the Gene Ontology as features.\n",
"\n",
"Repeating the above for each domain:\n",
"\n",
"[gohelp]: http://www.geneontology.org/GO.doc.shtml\n",
"[qiweb]: http://www.cs.cmu.edu/%7Eqyj/papers_sulp/proteins05_pages/features.html"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cd ../geneconversion/"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"/home/gavin/Documents/MRes/geneconversion\n"
]
}
],
"prompt_number": 229
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"c = csv.reader(open(\"gene2go\"), delimiter=\"\\t\")\n",
"c.next()\n",
"for line in c:\n",
" goEntrezdict[line[1]][line[2]] = {}"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 231
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for EntrezID in goEntrezdict.keys():\n",
" for goID in goEntrezdict[EntrezID].keys():\n",
" goEntrezdict[EntrezID][goID][parsedobo[goID].namespace] = parsedobo[goID].name"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 232
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print goEntrezdict[goEntrezdict.keys()[4]]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"{'GO:0005575': {'cellular_component': 'cellular_component'}, 'GO:0045596': {'biological_process': 'negative regulation of cell differentiation'}, 'GO:0000332': {'molecular_function': 'template for synthesis of G-rich strand of telomere DNA activity'}, 'GO:0007004': {'biological_process': 'telomere maintenance via telomerase'}}\n"
]
}
],
"prompt_number": 235
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"gotermdict = {}\n",
"#initialise dictionaries\n",
"for EntrezID in goEntrezdict.keys():\n",
" for goID in goEntrezdict[EntrezID].keys():\n",
" for domain in goEntrezdict[EntrezID][goID].keys():\n",
" gotermdict[domain] = {}\n",
"#initialise counters in dictionaries\n",
"for EntrezID in goEntrezdict.keys():\n",
" for goID in goEntrezdict[EntrezID].keys():\n",
" for domain in goEntrezdict[EntrezID][goID].keys():\n",
" term = goEntrezdict[EntrezID][goID][domain]\n",
" gotermdict[domain][term] = 0\n",
"#count\n",
"for EntrezID in goEntrezdict.keys():\n",
" for goID in goEntrezdict[EntrezID].keys():\n",
" for domain in goEntrezdict[EntrezID][goID].keys():\n",
" term = goEntrezdict[EntrezID][goID][domain]\n",
" gotermdict[domain][term] += 1"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 248
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for k in gotermdict.keys():\n",
" print \"Domain %s:\"%k\n",
" for ke in heapq.nlargest(10, gotermdict[k], key=lambda x: gotermdict[k][x]):\n",
" print \"\\t\"+\"%s occurred %i times\"%(ke,gotermdict[k][ke])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Domain molecular_function:\n",
"\tmolecular_function occurred 52423 times\n",
"\tprotein binding occurred 15731 times\n",
"\tmetal ion binding occurred 12410 times\n",
"\tATP binding occurred 10840 times\n",
"\tDNA binding occurred 10210 times\n",
"\tzinc ion binding occurred 8135 times\n",
"\tnucleotide binding occurred 7783 times\n",
"\tsequence-specific DNA binding transcription factor activity occurred 7446 times\n",
"\thydrolase activity occurred 5134 times\n",
"\tG-protein coupled receptor activity occurred 4838 times\n",
"Domain cellular_component:\n",
"\tcellular_component occurred 38992 times\n",
"\tnucleus occurred 38516 times\n",
"\tcytoplasm occurred 28668 times\n",
"\tintegral component of membrane occurred 25343 times\n",
"\tmembrane occurred 21333 times\n",
"\tplasma membrane occurred 18145 times\n",
"\tmitochondrion occurred 12599 times\n",
"\tcytosol occurred 11760 times\n",
"\textracellular region occurred 8696 times\n",
"\textracellular vesicular exosome occurred 6349 times\n",
"Domain biological_process:\n",
"\tbiological_process occurred 49804 times\n",
"\tregulation of transcription, DNA-templated occurred 10962 times\n",
"\ttranscription, DNA-templated occurred 7165 times\n",
"\tmetabolic process occurred 7122 times\n",
"\ttransport occurred 6638 times\n",
"\tsignal transduction occurred 6188 times\n",
"\tG-protein coupled receptor signaling pathway occurred 6024 times\n",
"\toxidation-reduction process occurred 5158 times\n",
"\tproteolysis occurred 4574 times\n",
"\ttranslation occurred 4021 times\n"
]
}
],
"prompt_number": 250
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The question is, which terms should we pick to be effective features?\n",
"Obviously we want ones with a lot of crossover so there'll actually be something happening.\n",
"But, we also want to make sure they'll be useful on the proteins we're actually going to use the classifier on at some point.\n",
"The terms with the largest crossover on the the bait and prey proteins from crossover experiments, taking from all three domains, would probably be the best choice."
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment