Created
January 19, 2016 13:03
-
-
Save cthoyt/ab428243af37b95c7c79 to your computer and use it in GitHub Desktop.
Biological Databases Exercise 2015
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Week 2: Download and analyse UniProt result files" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"1. Please goto http://www.uniprot.org/uniprot/\n", | |
"2. In the search tab, type in <font color='red'>ache</font> and click search.\n", | |
"3. Above the heading <font color='green'>Protein Names</font>, you will find **<font color='blue'>Columns</font>**, click it.\n", | |
"4. This will show you a page where you can choose additional columns that will be displayed for performing *<font color='green'>step 2</font>* in this guideline.\n", | |
"5. Now choose the following:\n", | |
" * **Organism ID** from **Names and Taxonomy**\n", | |
" * **Sequence** from **Sequences**\n", | |
" * **EC number** from **Function**\n", | |
" * **Features** from **Miscellaneous**\n", | |
" * **Gene Ontology (GO)** from **Gene Ontology**\n", | |
" * **PubMed ID** from **Publications**\n", | |
" * **BioGrid** from **Protein-protein Interaction**\n", | |
" * **dbSNP** from **Polymorphism and mutation**\n", | |
"6. Once done, scroll up the page and click *save*.\n", | |
"7. Now Click **Download** which you will find above column **Entry name**.\n", | |
"8. Change the format to **Tab-separated** and choose **uncompressed** and click **GO** to download the file.\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now execute the following lines by pressing **Ctrl+Enter**" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"%load_ext sql\n", | |
"%matplotlib inline" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"%sql mysql+pymysql://root:YourPassword@localhost/" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"*If the above step fails to show **<font color='red'>u'Connected: root@</font>**, this means you have to execute* **<font color='green'>mysqld.exe</font>** in C:\\Program Files\\MySQL\\MySQL Server 5.7\\bin as an administrator using command line. **Hint**: *Please perform step 5 and 6 of the pdf that's inside MySQL folder that you downloaded on the 1st day. **Filename**: HowTo_StartMySQL(blablabla).pdf*" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"%%sql\n", | |
"create database if not exists biodb;" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"%%sql\n", | |
"use biodb;\n", | |
"drop table if exists ache;" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"%%sql\n", | |
"CREATE TABLE ache (\n", | |
" Entry varchar(255) DEFAULT NULL,\n", | |
" Entry_name VARCHAR(255) DEFAULT NULL,\n", | |
" Status char(255) DEFAULT NULL,\n", | |
" Protein_names varchar(255) DEFAULT NULL,\n", | |
" Gene_names VARCHAR(255) DEFAULT NULL,\n", | |
" Organism char(255) DEFAULT NULL,\n", | |
" Length int(11) DEFAULT NULL,\n", | |
" Sequence text,\n", | |
" Organism_ID text,\n", | |
" EC_number text,\n", | |
" Features text,\n", | |
" Gene_Ontology text, \n", | |
" PMIDs text);" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"%%sql\n", | |
"show tables;" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"%%sql\n", | |
"describe ache;" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# What have we done so far?\n", | |
"1. Downloaded **ACHE** related data from uniprot\n", | |
"2. Created a database named **biodb** and table named **ache** with some empty columns.\n", | |
"\n", | |
"Note: Remember that the number of columns and the order in table **ache** should be the same as in the file we've downloaded. " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# What next?\n", | |
"We will now import the data into our database **<font color='red'>biodb</font>**" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"To import the data please execute the following in <font color='green'>command on mysql shell</font> or <font color='red'>ipython notebook</font> (here). This depends where you are working on.\n", | |
"\n", | |
"**mysql -u root -pYourPassword biodb -e \"LOAD DATA LOCAL INFILE '<font color='red'>file location</font>' INTO TABLE ACHE FIELDS TERMINATED BY '\\t' IGNORE 1 LINES\"**\n", | |
"\n", | |
"Note: file location example: C:/Users/jrkarki/Desktop/guidelines/Proteins/ACHE/uniprot-ache.tab (Take care of the slashes)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"!mysql -u root -plfc89dude --local-infile biodb -e \"LOAD DATA LOCAL INFILE 'C:\\Users\\jrkarki\\Desktop\\guidelines\\Proteins\\ACHE\\uniprot-ache.tab' INTO TABLE ache FIELDS TERMINATED BY '\\t' IGNORE 1 LINES\"" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"%%sql\n", | |
"Select count(*) from ache limit 10;" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"%%sql\n", | |
"Select * from ache limit 2;" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"%%sql\n", | |
"Select distinct Status from ache;" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"%%sql\n", | |
"Select count(*) number_of_status_type, Status from ache group by Status" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"result = %%sql select count(*),substr(Organism,1,20) from ache group by Organism order by count(*) desc limit 10;\n", | |
"result.bar()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"result.pie()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"%%sql\n", | |
"Select min(Length),max(Length),avg(Length),std(Length) from ache;" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"result = %%sql Select * from ache where Organism like 'Homo%'\n", | |
"result" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"my_pmids = result[0].PMIDs.split(\"; \")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import urllib2, json\n", | |
"print my_pmids\n", | |
"myfile = urllib2.urlopen(\"http://www.ebi.ac.uk/europepmc/webservices/rest/search?query=EXT_ID:\"+my_pmids[0]+\"&format=json\")\n", | |
"my_pubmed = json.loads(myfile.read())\n", | |
"print type(my_pubmed)\n", | |
"my_pubmed" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import \n", | |
"my_pubmed_result_as_python_opject = my_pubmed_result_as_string" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 2", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.11" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment