Skip to content

Instantly share code, notes, and snippets.

@cthoyt
Created January 19, 2016 13:03
Show Gist options
  • Save cthoyt/ab428243af37b95c7c79 to your computer and use it in GitHub Desktop.
Save cthoyt/ab428243af37b95c7c79 to your computer and use it in GitHub Desktop.
Biological Databases Exercise 2015
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Week 2: Download and analyse UniProt result files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Please goto http://www.uniprot.org/uniprot/\n",
"2. In the search tab, type in <font color='red'>ache</font> and click search.\n",
"3. Above the heading <font color='green'>Protein Names</font>, you will find **<font color='blue'>Columns</font>**, click it.\n",
"4. This will show you a page where you can choose additional columns that will be displayed for performing *<font color='green'>step 2</font>* in this guideline.\n",
"5. Now choose the following:\n",
" * **Organism ID** from **Names and Taxonomy**\n",
" * **Sequence** from **Sequences**\n",
" * **EC number** from **Function**\n",
" * **Features** from **Miscellaneous**\n",
" * **Gene Ontology (GO)** from **Gene Ontology**\n",
" * **PubMed ID** from **Publications**\n",
" * **BioGrid** from **Protein-protein Interaction**\n",
" * **dbSNP** from **Polymorphism and mutation**\n",
"6. Once done, scroll up the page and click *save*.\n",
"7. Now Click **Download** which you will find above column **Entry name**.\n",
"8. Change the format to **Tab-separated** and choose **uncompressed** and click **GO** to download the file.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now execute the following lines by pressing **Ctrl+Enter**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%load_ext sql\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%sql mysql+pymysql://root:YourPassword@localhost/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*If the above step fails to show **<font color='red'>u'Connected: root@</font>**, this means you have to execute* **<font color='green'>mysqld.exe</font>** in C:\\Program Files\\MySQL\\MySQL Server 5.7\\bin as an administrator using command line. **Hint**: *Please perform step 5 and 6 of the pdf that's inside MySQL folder that you downloaded on the 1st day. **Filename**: HowTo_StartMySQL(blablabla).pdf*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%%sql\n",
"create database if not exists biodb;"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%%sql\n",
"use biodb;\n",
"drop table if exists ache;"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%%sql\n",
"CREATE TABLE ache (\n",
" Entry varchar(255) DEFAULT NULL,\n",
" Entry_name VARCHAR(255) DEFAULT NULL,\n",
" Status char(255) DEFAULT NULL,\n",
" Protein_names varchar(255) DEFAULT NULL,\n",
" Gene_names VARCHAR(255) DEFAULT NULL,\n",
" Organism char(255) DEFAULT NULL,\n",
" Length int(11) DEFAULT NULL,\n",
" Sequence text,\n",
" Organism_ID text,\n",
" EC_number text,\n",
" Features text,\n",
" Gene_Ontology text, \n",
" PMIDs text);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%%sql\n",
"show tables;"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%%sql\n",
"describe ache;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# What have we done so far?\n",
"1. Downloaded **ACHE** related data from uniprot\n",
"2. Created a database named **biodb** and table named **ache** with some empty columns.\n",
"\n",
"Note: Remember that the number of columns and the order in table **ache** should be the same as in the file we've downloaded. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# What next?\n",
"We will now import the data into our database **<font color='red'>biodb</font>**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To import the data please execute the following in <font color='green'>command on mysql shell</font> or <font color='red'>ipython notebook</font> (here). This depends where you are working on.\n",
"\n",
"**mysql -u root -pYourPassword biodb -e \"LOAD DATA LOCAL INFILE '<font color='red'>file location</font>' INTO TABLE ACHE FIELDS TERMINATED BY '\\t' IGNORE 1 LINES\"**\n",
"\n",
"Note: file location example: C:/Users/jrkarki/Desktop/guidelines/Proteins/ACHE/uniprot-ache.tab (Take care of the slashes)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"!mysql -u root -plfc89dude --local-infile biodb -e \"LOAD DATA LOCAL INFILE 'C:\\Users\\jrkarki\\Desktop\\guidelines\\Proteins\\ACHE\\uniprot-ache.tab' INTO TABLE ache FIELDS TERMINATED BY '\\t' IGNORE 1 LINES\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%%sql\n",
"Select count(*) from ache limit 10;"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%%sql\n",
"Select * from ache limit 2;"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%%sql\n",
"Select distinct Status from ache;"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%%sql\n",
"Select count(*) number_of_status_type, Status from ache group by Status"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"result = %%sql select count(*),substr(Organism,1,20) from ache group by Organism order by count(*) desc limit 10;\n",
"result.bar()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"result.pie()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%%sql\n",
"Select min(Length),max(Length),avg(Length),std(Length) from ache;"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"result = %%sql Select * from ache where Organism like 'Homo%'\n",
"result"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"my_pmids = result[0].PMIDs.split(\"; \")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import urllib2, json\n",
"print my_pmids\n",
"myfile = urllib2.urlopen(\"http://www.ebi.ac.uk/europepmc/webservices/rest/search?query=EXT_ID:\"+my_pmids[0]+\"&format=json\")\n",
"my_pubmed = json.loads(myfile.read())\n",
"print type(my_pubmed)\n",
"my_pubmed"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import \n",
"my_pubmed_result_as_python_opject = my_pubmed_result_as_string"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment