cthoyt/Day3.ipynb

## Day3.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Week 2: Download and analyse UniProt result files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Please goto http://www.uniprot.org/uniprot/\n",
    "2. In the search tab, type in <font color='red'>ache</font> and click search.\n",
    "3. Above the heading <font color='green'>Protein Names</font>, you will find **<font color='blue'>Columns</font>**, click it.\n",
    "4. This will show you a page where you can choose additional columns that will be displayed for performing *<font color='green'>step 2</font>* in this guideline.\n",
    "5. Now choose the following:\n",
    "    * **Organism ID** from **Names and Taxonomy**\n",
    "    * **Sequence** from **Sequences**\n",
    "    * **EC number** from **Function**\n",
    "    * **Features** from **Miscellaneous**\n",
    "    * **Gene Ontology (GO)** from **Gene Ontology**\n",
    "    * **PubMed ID** from **Publications**\n",
    "    * **BioGrid** from **Protein-protein Interaction**\n",
    "    * **dbSNP** from **Polymorphism and mutation**\n",
    "6. Once done, scroll up the page and click *save*.\n",
    "7. Now Click **Download** which you will find above column **Entry name**.\n",
    "8. Change the format to **Tab-separated** and choose **uncompressed** and click **GO** to download the file.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now execute the following lines by pressing **Ctrl+Enter**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%load_ext sql\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%sql mysql+pymysql://root:YourPassword@localhost/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*If the above step fails to show **<font color='red'>u'Connected: root@</font>**, this means you have to execute* **<font color='green'>mysqld.exe</font>** in C:\\Program Files\\MySQL\\MySQL Server 5.7\\bin as an administrator using command line. **Hint**: *Please perform step 5 and 6 of the pdf that's inside MySQL folder that you downloaded on the 1st day. **Filename**: HowTo_StartMySQL(blablabla).pdf*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%sql\n",
    "create database if not exists biodb;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%sql\n",
    "use biodb;\n",
    "drop table if exists ache;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%sql\n",
    "CREATE TABLE ache (\n",
    "  Entry varchar(255) DEFAULT NULL,\n",
    "  Entry_name VARCHAR(255) DEFAULT NULL,\n",
    "  Status char(255) DEFAULT NULL,\n",
    "  Protein_names varchar(255) DEFAULT NULL,\n",
    "  Gene_names VARCHAR(255) DEFAULT NULL,\n",
    "  Organism char(255) DEFAULT NULL,\n",
    "  Length int(11) DEFAULT NULL,\n",
    "  Sequence text,\n",
    "  Organism_ID text,\n",
    "  EC_number text,\n",
    "  Features text,\n",
    "  Gene_Ontology text,  \n",
    "  PMIDs text);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%sql\n",
    "show tables;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%sql\n",
    "describe ache;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# What have we done so far?\n",
    "1. Downloaded **ACHE** related data from uniprot\n",
    "2. Created a database named **biodb** and table named **ache** with some empty columns.\n",
    "\n",
    "Note: Remember that the number of columns and the order in table **ache** should be the same as in the file we've downloaded.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# What next?\n",
    "We will now import the data into our database **<font color='red'>biodb</font>**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To import the data please execute the following in <font color='green'>command on mysql shell</font> or <font color='red'>ipython notebook</font> (here). This depends where you are working on.\n",
    "\n",
    "**mysql -u root -pYourPassword biodb -e \"LOAD DATA LOCAL INFILE '<font color='red'>file location</font>' INTO TABLE ACHE FIELDS TERMINATED BY '\\t'  IGNORE 1 LINES\"**\n",
    "\n",
    "Note: file location example: C:/Users/jrkarki/Desktop/guidelines/Proteins/ACHE/uniprot-ache.tab (Take care of the slashes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!mysql -u root -plfc89dude --local-infile biodb -e \"LOAD DATA LOCAL INFILE 'C:\\Users\\jrkarki\\Desktop\\guidelines\\Proteins\\ACHE\\uniprot-ache.tab' INTO TABLE ache FIELDS TERMINATED BY '\\t' IGNORE 1 LINES\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%sql\n",
    "Select count(*) from ache limit 10;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%sql\n",
    "Select * from ache limit 2;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%sql\n",
    "Select distinct Status from ache;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%sql\n",
    "Select count(*) number_of_status_type, Status from ache group by Status"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "result = %%sql select count(*),substr(Organism,1,20) from ache group by Organism order by count(*) desc limit 10;\n",
    "result.bar()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "result.pie()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%sql\n",
    "Select min(Length),max(Length),avg(Length),std(Length) from ache;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "result = %%sql Select * from ache where Organism like 'Homo%'\n",
    "result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "my_pmids = result[0].PMIDs.split(\"; \")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import urllib2, json\n",
    "print my_pmids\n",
    "myfile = urllib2.urlopen(\"http://www.ebi.ac.uk/europepmc/webservices/rest/search?query=EXT_ID:\"+my_pmids[0]+\"&format=json\")\n",
    "my_pubmed = json.loads(myfile.read())\n",
    "print type(my_pubmed)\n",
    "my_pubmed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import \n",
    "my_pubmed_result_as_python_opject = my_pubmed_result_as_string"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Week 2: Download and analyse UniProt result files"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"1. Please goto http://www.uniprot.org/uniprot/\n",
	"2. In the search tab, type in <font color='red'>ache</font> and click search.\n",
	"3. Above the heading <font color='green'>Protein Names</font>, you will find <font color='blue'>Columns</font>, click it.\n",
	"4. This will show you a page where you can choose additional columns that will be displayed for performing <font color='green'>step 2</font> in this guideline.\n",
	"5. Now choose the following:\n",
	" * Organism ID from Names and Taxonomy\n",
	" * Sequence from Sequences\n",
	" * EC number from Function\n",
	" * Features from Miscellaneous\n",
	" * Gene Ontology (GO) from Gene Ontology\n",
	" * PubMed ID from Publications\n",
	" * BioGrid from Protein-protein Interaction\n",
	" * dbSNP from Polymorphism and mutation\n",
	"6. Once done, scroll up the page and click save.\n",
	"7. Now Click Download which you will find above column Entry name.\n",
	"8. Change the format to Tab-separated and choose uncompressed and click GO to download the file.\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now execute the following lines by pressing Ctrl+Enter"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"%load_ext sql\n",
	"%matplotlib inline"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"%sql mysql+pymysql://root:YourPassword@localhost/"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"If the above step fails to show <font color='red'>u'Connected: root@</font>, this means you have to execute <font color='green'>mysqld.exe</font> in C:\\Program Files\\MySQL\\MySQL Server 5.7\\bin as an administrator using command line. Hint: Please perform step 5 and 6 of the pdf that's inside MySQL folder that you downloaded on the 1st day. Filename: HowTo_StartMySQL(blablabla).pdf"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"%%sql\n",
	"create database if not exists biodb;"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"%%sql\n",
	"use biodb;\n",
	"drop table if exists ache;"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"%%sql\n",
	"CREATE TABLE ache (\n",
	" Entry varchar(255) DEFAULT NULL,\n",
	" Entry_name VARCHAR(255) DEFAULT NULL,\n",
	" Status char(255) DEFAULT NULL,\n",
	" Protein_names varchar(255) DEFAULT NULL,\n",
	" Gene_names VARCHAR(255) DEFAULT NULL,\n",
	" Organism char(255) DEFAULT NULL,\n",
	" Length int(11) DEFAULT NULL,\n",
	" Sequence text,\n",
	" Organism_ID text,\n",
	" EC_number text,\n",
	" Features text,\n",
	" Gene_Ontology text, \n",
	" PMIDs text);"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"%%sql\n",
	"show tables;"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"%%sql\n",
	"describe ache;"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# What have we done so far?\n",
	"1. Downloaded ACHE related data from uniprot\n",
	"2. Created a database named biodb and table named ache with some empty columns.\n",
	"\n",
	"Note: Remember that the number of columns and the order in table ache should be the same as in the file we've downloaded. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# What next?\n",
	"We will now import the data into our database <font color='red'>biodb</font>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"To import the data please execute the following in <font color='green'>command on mysql shell</font> or <font color='red'>ipython notebook</font> (here). This depends where you are working on.\n",
	"\n",
	"mysql -u root -pYourPassword biodb -e \"LOAD DATA LOCAL INFILE '<font color='red'>file location</font>' INTO TABLE ACHE FIELDS TERMINATED BY '\\t' IGNORE 1 LINES\"\n",
	"\n",
	"Note: file location example: C:/Users/jrkarki/Desktop/guidelines/Proteins/ACHE/uniprot-ache.tab (Take care of the slashes)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"!mysql -u root -plfc89dude --local-infile biodb -e \"LOAD DATA LOCAL INFILE 'C:\\Users\\jrkarki\\Desktop\\guidelines\\Proteins\\ACHE\\uniprot-ache.tab' INTO TABLE ache FIELDS TERMINATED BY '\\t' IGNORE 1 LINES\""
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"%%sql\n",
	"Select count(*) from ache limit 10;"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"%%sql\n",
	"Select * from ache limit 2;"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"%%sql\n",
	"Select distinct Status from ache;"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"%%sql\n",
	"Select count(*) number_of_status_type, Status from ache group by Status"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"result = %%sql select count(),substr(Organism,1,20) from ache group by Organism order by count() desc limit 10;\n",
	"result.bar()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"result.pie()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"%%sql\n",
	"Select min(Length),max(Length),avg(Length),std(Length) from ache;"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"result = %%sql Select * from ache where Organism like 'Homo%'\n",
	"result"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"my_pmids = result[0].PMIDs.split(\"; \")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import urllib2, json\n",
	"print my_pmids\n",
	"myfile = urllib2.urlopen(\"http://www.ebi.ac.uk/europepmc/webservices/rest/search?query=EXT_ID:\"+my_pmids[0]+\"&format=json\")\n",
	"my_pubmed = json.loads(myfile.read())\n",
	"print type(my_pubmed)\n",
	"my_pubmed"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import \n",
	"my_pubmed_result_as_python_opject = my_pubmed_result_as_string"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.11"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}