Skip to content

Instantly share code, notes, and snippets.

@chuttenh
Last active March 20, 2017 17:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save chuttenh/3c9a6a2ed573e344cf47 to your computer and use it in GitHub Desktop.
Save chuttenh/3c9a6a2ed573e344cf47 to your computer and use it in GitHub Desktop.
Genomic Data Manipulation
.ipynb_checkpoints
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Modules and I/O\n",
"\n",
"---\n",
"\n",
"## Bioinformatics functions\n",
"\n",
"We've reached a point where you could now tell the computer to do just about anything you'd like with biological data: you have a good grasp of quantitative methods, you know enough Python to write nearly any program, and you have several different tools available to get data into and out of a computing environment. However, just like a spoken language, what you don't have yet is practice - it's impossible to get familiar with moving bioinformatic data around without doing it. This activity, and the next few problem sets, are designed to help provide that practice while introducing a few additional tools to make computational biology even easier and more convenient.\n",
"\n",
"The first few sections here will thus provide some short examples of how functions to perform specific biological tasks might be developed in Python. Then we'll discuss some pre-existing modules that might be useful (did I mention that \"pre-existing\" means someone else already did the work?), and wrap up with examples of data input and output using the single most common biological data format, a [FASTA file](http://en.wikipedia.org/wiki/FASTA_format) for nucleotide (and amino acid) sequences.\n",
"\n",
"### Locating binding sites\n",
"\n",
"Suppose you're studying a particular transcription factor binding site of interest, and you'd like to determine where it occurs in a set of upstream promoter sequences. We'll keep things simple and suppose that the TFBS is a single, non-degenerate string of nucleotides (an assumption that we'll learn how to work around soon), and that you've already downloaded your promoter sequences from someplace like [BioMart](http://www.biomart.org) (which we'll discuss in detail in a few weeks). The first step in your investigation of what this TFBS does is identifying in which genes' promoters it occurs.\n",
"\n",
"For the sake of argument, let's say your TFBS sequence is `TGACTCA` (bonus points if you can identify what TF this corresponds to, there are at least two correct answers), and your promoter DNA sequences (each arising from a particular gene) are:\n",
"\n",
"> `GGAGTGGTTG\n",
"ATGACTCATA\n",
"CTCCATTCGG\n",
"TTTTTTCGTG\n",
"CATTGACTCA`\n",
"\n",
"The \"correct answer\" would be to identify that the second and fifth genes' promoters include your TFBS:\n",
"\n",
"> `GGAGTGGTTG\n",
"A`**`TGACTCA`**`TA\n",
"CTCCATTCGG\n",
"TTTTTTCGTG\n",
"CAT`**`TGACTCA`**\n",
"\n",
"It's *really* hard to see this type of information using human eyes (even after it's been highlighted!), but it's a great candidate for a task that a computer would be happy to help you with. Python, like any other computational tool, exists to help you tell the computer what tasks you'd like it to help you carry out for your research.\n",
"\n",
"The first step in turning this idea into Python is to define the data you're working with:\n",
"\n",
"* One *input* must be a string representing the one or more nucleotides of the binding site motif.\n",
"\n",
"* One *input* must be a list of strings representing the nucleotides of each gene's promoter sequences.\n",
"\n",
"* For the sake of this example, we'll create as an *output* a list of booleans, of the same length as the input list, indicating `True` for each promoter that contains the motif and `False` otherwise.\n",
"\n",
"Thus our anticipated output for the example above is:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"[False, True, False, False, True]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 1,
"text": [
"[False, True, False, False, True]"
]
}
],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's write a function that accepts these inputs and produces the desired output for us. Using our knowledge of Python syntax, we know the appropriate **pseudocode** (something that has the right semantics but not the right syntax) is:\n",
"\n",
" def bindingSites( strTFBS, astrPromoters ):\n",
" \n",
" for each promoter string:\n",
" if it contains the TFBS:\n",
" remember True\n",
" else:\n",
" remember False\n",
" return a list of everything we've remembered\n",
"\n",
"Well, you know how to examine each promoter string:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def bindingSites( strTFBS, astrPromoters ):\n",
" \n",
" for strPromoter in astrPromoters:\n",
" pass"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And you know how to return a list of results (particularly a list of boolean **f**lag values) after each promoter has been processed:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def bindingSites( strTFBS, astrPromoters ):\n",
" \n",
" for strPromoter in astrPromoters:\n",
" pass\n",
" \n",
" return afResults"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How do we fill up that list of results? Well, the key test is whether each particular promoter *contains* the TFBS, which is well-satisfied by the [Python string `find` function](http://docs.python.org/2/library/stdtypes.html?highlight=find#str.find). The `find` function **targets** a string, accepts one additional string input argument representing the substring to locate, and returns the integer index of that substring if it's found (starting with `0` as usual) or `-1` if it's not located:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\"test\".find( \"es\" )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 4,
"text": [
"1"
]
}
],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\"taste\".find( \"es\" )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 5,
"text": [
"-1"
]
}
],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\"testy\".find( \"t\" )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 6,
"text": [
"0"
]
}
],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\"testy\".find( \"ty\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 7,
"text": [
"3"
]
}
],
"prompt_number": 7
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\"This is a test of the emergency broadborking system\".find( \"bork\" )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 8,
"text": [
"37"
]
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Thus if a string contains a substring, the `find` method returns a value of at least `0`. Let's use that to continue expanding our pseudocode:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def bindingSites( strTFBS, astrPromoters ):\n",
" \n",
" for strPromoter in astrPromoters:\n",
" if strPromoter.find( strTFBS ) >= 0:\n",
" afResults.append( True )\n",
" else:\n",
" afResults.append( False )\n",
" \n",
" return afResults"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This looks like it might be close, but if we try it out on a very simple test case, we have a problem:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"bindingSites( \"T\", [\"A\", \"C\", \"T\", \"G\"] )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "global name 'afResults' is not defined",
"output_type": "pyerr",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-10-2f3390af411d>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mbindingSites\u001b[0m\u001b[0;34m(\u001b[0m \u001b[0;34m\"T\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m\"A\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"C\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"T\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"G\"\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m<ipython-input-9-cc9d235805c6>\u001b[0m in \u001b[0;36mbindingSites\u001b[0;34m(strTFBS, astrPromoters)\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mafResults\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m \u001b[0mTrue\u001b[0m \u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0mafResults\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m \u001b[0mFalse\u001b[0m \u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 8\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mafResults\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mNameError\u001b[0m: global name 'afResults' is not defined"
]
}
],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In general, when reading Python errors, you should pay attention to two pieces of the garbledegook: *what* it said went wrong, at the very bottom, and *where* it said it went wrong (what line of code). Here, Python's telling us that a variable named `afResults` wasn't defined when we tried to use it on line `7`. It might be hard to tell exactly which promoter we were processing at the time, so we can always add `print` statements to help see what's in Python's head while it runs our code:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def bindingSites( strTFBS, astrPromoters ):\n",
" \n",
" for strPromoter in astrPromoters:\n",
" print( \"I'm at location 0, and strPromoter is: \" + strPromoter )\n",
" if strPromoter.find( strTFBS ) >= 0:\n",
" afResults.append( True )\n",
" else:\n",
" print( \"I'm at location 1, and strPromoter is: \" + strPromoter )\n",
" afResults.append( False )\n",
" \n",
" return afResults\n",
"\n",
"bindingSites( \"T\", [\"A\", \"C\", \"T\", \"G\"] )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "global name 'afResults' is not defined",
"output_type": "pyerr",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-11-723aff200950>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mafResults\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 13\u001b[0;31m \u001b[0mbindingSites\u001b[0m\u001b[0;34m(\u001b[0m \u001b[0;34m\"T\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m\"A\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"C\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"T\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"G\"\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m<ipython-input-11-723aff200950>\u001b[0m in \u001b[0;36mbindingSites\u001b[0;34m(strTFBS, astrPromoters)\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0;32mprint\u001b[0m\u001b[0;34m(\u001b[0m \u001b[0;34m\"I'm at location 1, and strPromoter is: \"\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0mstrPromoter\u001b[0m \u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 9\u001b[0;31m \u001b[0mafResults\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m \u001b[0mFalse\u001b[0m \u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 10\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mafResults\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mNameError\u001b[0m: global name 'afResults' is not defined"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"I'm at location 0, and strPromoter is: A\n",
"I'm at location 1, and strPromoter is: A\n"
]
}
],
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So it's crashing on the very first promoter sequence, `\"A\"`. If you look closely, though, the reason why has nothing to do with the promoter sequence itself - we've forgotten to tell Python what the initial value of `afResults` should be! After all, we know we'd like it to end up as a list containing boolean values, and we've included instructions to `append` new values into a pre-existing list, but we never gave Python an empty list to fill. Let's fix that and see what happens:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def bindingSites( strTFBS, astrPromoters ):\n",
" \n",
" afResults = []\n",
" for strPromoter in astrPromoters:\n",
" print( \"I'm at location 0, and strPromoter is: \" + strPromoter )\n",
" if strPromoter.find( strTFBS ) >= 0:\n",
" afResults.append( True )\n",
" else:\n",
" print( \"I'm at location 1, and strPromoter is: \" + strPromoter )\n",
" afResults.append( False )\n",
" \n",
" return afResults\n",
"\n",
"bindingSites( \"T\", [\"A\", \"C\", \"T\", \"G\"] )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"I'm at location 0, and strPromoter is: A\n",
"I'm at location 1, and strPromoter is: A\n",
"I'm at location 0, and strPromoter is: C\n",
"I'm at location 1, and strPromoter is: C\n",
"I'm at location 0, and strPromoter is: T\n",
"I'm at location 0, and strPromoter is: G\n",
"I'm at location 1, and strPromoter is: G\n"
]
},
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 12,
"text": [
"[False, False, True, False]"
]
}
],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Not only are our results correct now, the remaining `print` statements let you see exactly where Python's **program counter** is during (almost) every step of the process and what data it was modifying at each point. We can take those `print`s out now that the function appears to be working:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def bindingSites( strTFBS, astrPromoters ):\n",
" \n",
" afResults = []\n",
" for strPromoter in astrPromoters:\n",
" if strPromoter.find( strTFBS ) >= 0:\n",
" afResults.append( True )\n",
" else:\n",
" afResults.append( False )\n",
" \n",
" return afResults\n",
"\n",
"bindingSites( \"T\", [\"A\", \"C\", \"T\", \"G\"] )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 13,
"text": [
"[False, False, True, False]"
]
}
],
"prompt_number": 13
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And we can even simplify the function a bit by repacing our `if`/`else` combination with a single `append` of an equivalent value:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def bindingSites( strTFBS, astrPromoters ):\n",
" \n",
" afResults = []\n",
" for strPromoter in astrPromoters:\n",
" fFound = ( strPromoter.find( strTFBS ) >= 0 )\n",
" afResults.append( fFound )\n",
" \n",
" return afResults\n",
"\n",
"bindingSites( \"T\", [\"A\", \"C\", \"T\", \"G\"] )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 14,
"text": [
"[False, False, True, False]"
]
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or we can get even shorter by cutting out the middleman:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def bindingSites( strTFBS, astrPromoters ):\n",
" \n",
" afResults = []\n",
" for strPromoter in astrPromoters:\n",
" afResults.append( strPromoter.find( strTFBS ) >= 0 )\n",
" \n",
" return afResults\n",
"\n",
"bindingSites( \"T\", [\"A\", \"C\", \"T\", \"G\"] )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 15,
"text": [
"[False, False, True, False]"
]
}
],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you don't want to see Python syntax we haven't covered yet, close your eyes, but if you're curious you can even use a [list comprehension](http://docs.python.org/2/tutorial/datastructures.html#list-comprehensions) to make this even shorter:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def bindingSites( strTFBS, astrPromoters ):\n",
" \n",
" afResults = [( s.find( strTFBS ) >= 0 ) for s in astrPromoters]\n",
" return afResults\n",
"\n",
"bindingSites( \"T\", [\"A\", \"C\", \"T\", \"G\"] )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 16,
"text": [
"[False, False, True, False]"
]
}
],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And you can again remove the temporary variable to define this function in just two lines:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def bindingSites( strTFBS, astrPromoters ):\n",
" \n",
" return [( s.find( strTFBS ) >= 0 ) for s in astrPromoters]\n",
"\n",
"bindingSites( \"T\", [\"A\", \"C\", \"T\", \"G\"] )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 17,
"text": [
"[False, False, True, False]"
]
}
],
"prompt_number": 17
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now would be a good time to try it out on some more complicated test cases, like our original example:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"strTFBS = \"TGACTCA\"\n",
"astrPromoters = [\"GGAGTGGTTG\", \"ATGACTCATA\", \"CTCCATTCGG\", \"TTTTTTCGTG\", \"CATTGACTCA\"]\n",
"\n",
"bindingSites( strTFBS, astrPromoters )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 18,
"text": [
"[False, True, False, False, True]"
]
}
],
"prompt_number": 18
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or some randomly generated sequences, taking advantage of the [random module](http://docs.python.org/2/library/random.html), one of dozens in the [Python standard library](http://docs.python.org/2/library/index.html) of built-in modules."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import random\n",
"\n",
"def randomNucleotides( iLength ):\n",
" \n",
" astrNucleotides = []\n",
" for i in xrange( iLength ):\n",
" astrNucleotides.append( \"ACGT\"[random.randrange( 4 )] )\n",
"\n",
"# The join method targets a string, accepts a list of strings as input,\n",
"# and returns a new string created by concatenating all strings in the list\n",
"# separated by the target, e.g.\n",
"# \"-\".join( [\"a\", \"b\", \"c\"] ) => \"a-b-c\"\n",
"# \", \".join( [\"one\", \"two\"] ) => \"one, two\"\n",
"# \"\".join( [\"sm\", \"oo\", \"sh\"] ) => \"smoosh\"\n",
" return \"\".join( astrNucleotides )\n",
"\n",
"strTFBS = randomNucleotides( 4 )\n",
"strTFBS"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 19,
"text": [
"'TTCA'"
]
}
],
"prompt_number": 19
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"astrPromoters = []\n",
"for i in xrange( 100 ):\n",
" astrPromoters.append( randomNucleotides( 50 ) )\n",
"\n",
"# We'll just show the first 10 for brevity's sake\n",
"astrPromoters[:10]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 20,
"text": [
"['GAAAATTAAGGAGGCGATTCGTCGCTCATTACTTAAAATTACGAATGGAC',\n",
" 'ATTAGTCAGTATCAAAGGTAAGAGCCCGAGTAAACCTTGGCATTTTCCTA',\n",
" 'GCTTCCGTCTTTCGACAATCGTAAAGGCCCATCTTAGAGGTTGAGTGAGA',\n",
" 'CCTGAATGGCAATTTTTACTGAGCTAGACCCATTCATCAAGCGCTTAATT',\n",
" 'CGCCCGACCATCGACGCTTCTGTGGGCCCGTCAACACCGAGTTCAGGGAG',\n",
" 'TTGCATGTGTCGTCCTGGAGGCATATAATTTGTGCCGTCATAAGTGAGGG',\n",
" 'TCGGGATTACGAGCAACTATCTGACCCCCCTGTGGCGAGATCGGTCCCGG',\n",
" 'CGACTGGAGTCTTTGAGCTGCAGCGATCCAATACGAGTTAGAGGTAGGCA',\n",
" 'TGCCAGCCCATGTCGACCTCATTAGAGAGTTACGGGTCTATCTGACGTTT',\n",
" 'ACGTCTGGGCTATGGACTTGTCGTCTCATCAGGATATTCGACCCGCTCCG']"
]
}
],
"prompt_number": 20
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print( bindingSites( strTFBS, astrPromoters ) )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[False, False, False, True, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, True, False, True, True, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, True, False, False, True, False, True, True, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, True, False, True, False, False, False, True, False, False]\n"
]
}
],
"prompt_number": 21
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Depending on exactly how your random data turned out, you should have a nice mix of `True` and `False` hits there. Great job - now it would be time to get to work figuring out what those promoters' genes were! Fortunately, we can move on to other bioinformatics problems instead.\n",
"\n",
"### Counting codon usage\n",
"\n",
"For our next trick, suppose we're still processing DNA sequences, but we now have an **[open reading frame (ORF)](http://en.wikipedia.org/wiki/Open_reading_frame)** or **coding sequence** instead of a noncoding upstream promoter. That is, we have a (spliced) gene sequence that we know encodes a protein, such that every three nucleotides represent one codon. For example, the first several dozen nucleotides of the first isoform of [BRCA1](http://www.ncbi.nlm.nih.gov/nuccore/237757283) are:\n",
"\n",
"> `ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAAT\n",
"GCTATGCAGAAAATCTTAGAGTGTCCCATCTGTCTGGAGTTGATCAAG\n",
"GAACCTGTCTCCACAAAGTGTGACCACATATTTTGCAAATTTTGCATG\n",
"CTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGT\n",
"AAGAATGATATAACCAAAAGGAGCCTACAAGAAAGTACGAGATTTAGT\n",
"CAACTTGTTGAAGAGCTATTGAAAATCATTTGTGCTTTTCAGCTTGAC\n",
"ACAGGTTTGGAGTATGCAAACAGCTATAATTTTGCAAAAAAGGAAAAT`\n",
"\n",
"High school biochemistry tells us that the first `ATG` is a methionine, the next `GAT` is aspartic acid, `TTA` is leucine, and so forth. And especially for amino acids like leucine that can be coded for by many different possibly codons (`TTA`, `CTT`, `CTG`, etc. etc.), exactly which codons are used in a particular ORF can tell us something about its transcription rate, function, or even charge or hydrophobicity (if there are biases for particular amino acids as well as particular codons). What would it take for us to count how many of each codon occurs in an ORF?\n",
"\n",
"That sounds like a function to me, with some fairly simple parameters:\n",
"\n",
"* One *input* string, containing the nucleotide sequence of the ORF of interest.\n",
"\n",
"* One *output* dictionary pairing each codon string in the ORF with an integer count of how many times it occurs.\n",
"\n",
"Thus in a very simple example, if our input was:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\"ATGGATTTAATG\""
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 22,
"text": [
"'ATGGATTTAATG'"
]
}
],
"prompt_number": 22
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'd like our output to be:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"{\"ATG\" : 2, \"GAT\" : 1, \"TTA\" : 1}"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 23,
"text": [
"{'ATG': 2, 'GAT': 1, 'TTA': 1}"
]
}
],
"prompt_number": 23
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that we're making all sorts of assumptions about our input string - it contains only valid nucleotide letters, it's all upper case, its length is a multiple of three - but for the sake of simplicity, that's ok! Just don't forget about when those assumptions might fail in the real world.\n",
"\n",
"That being said, let's try describing this task in Python so the computer can help us carry it out. We know what our function skeleton should look like:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def codonCount( strORF ):\n",
" \n",
" return hashCounts"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 24
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And we have a pretty good idea of the pseudocode:\n",
"\n",
" def codonCount( strORF ):\n",
" \n",
" for each codon in the ORF:\n",
" increment the number of times we've seen it\n",
" \n",
" return the dictionary of all counts\n",
"\n",
"Let's start by figuring out how to find \"each codon in the ORF.\" A codon is a substring of length exactly three, which we can pluck out of a string using the [slice operator](http://docs.python.org/2/tutorial/introduction.html#strings), which has the syntax `string[optional beginning:optional end]`:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Slice the full string from beginning (index 0) to end (index 5)\n",
"\"abcde\"[0:5]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 25,
"text": [
"'abcde'"
]
}
],
"prompt_number": 25
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Slice off only the first three characters\n",
"\"abcde\"[0:3]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 26,
"text": [
"'abc'"
]
}
],
"prompt_number": 26
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Slice off the middle two characters\n",
"\"abcde\"[1:3]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 27,
"text": [
"'bc'"
]
}
],
"prompt_number": 27
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# A missing beginning index is assumed to be zero\n",
"\"abcde\"[:3]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 28,
"text": [
"'abc'"
]
}
],
"prompt_number": 28
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# A missing end index is assumed to be the length of the string\n",
"\"abcde\"[1:]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 29,
"text": [
"'bcde'"
]
}
],
"prompt_number": 29
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# And by the way, you can slice lists as well as strings\n",
"[1, 2, 3, 4, 5][1:3]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 30,
"text": [
"[2, 3]"
]
}
],
"prompt_number": 30
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using slice notation, a the first codon in an ORF is just the first three characters:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\"ATGGATTTAATG\"[0:3]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 31,
"text": [
"'ATG'"
]
}
],
"prompt_number": 31
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The next codon is the following three characters:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\"ATGGATTTAATG\"[3:6]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 32,
"text": [
"'GAT'"
]
}
],
"prompt_number": 32
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And in general, a codon at any position `i` is the slice from `i` to three characters later:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"i = 6\n",
"\"ATGGATTTAATG\"[i:( i + 3 )]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 33,
"text": [
"'TTA'"
]
}
],
"prompt_number": 33
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can combine this with the fact that the full [range](http://docs.python.org/2/library/functions.html#range) function takes not just one (its usual) but three integer arguments, a *beginning*, an *end*, and a *step*:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Default is just an end\n",
"range( 5 )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 34,
"text": [
"[0, 1, 2, 3, 4]"
]
}
],
"prompt_number": 34
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Two arguments mean a beginning and an end\n",
"range( 1, 5 )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 35,
"text": [
"[1, 2, 3, 4]"
]
}
],
"prompt_number": 35
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Three arguments mean a beginning, end, and step, the last of which defaults to one\n",
"range( 1, 5, 2 )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 36,
"text": [
"[1, 3]"
]
}
],
"prompt_number": 36
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# 10 to 89 in multiples of 13\n",
"range( 10, 90, 13 )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 37,
"text": [
"[10, 23, 36, 49, 62, 75, 88]"
]
}
],
"prompt_number": 37
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# 100 to 1000 in multiples of 111\n",
"range( 100, 1001, 111 )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 38,
"text": [
"[100, 211, 322, 433, 544, 655, 766, 877, 988]"
]
}
],
"prompt_number": 38
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we have a way to examine every codon in an ORF string:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def codonCount( strORF ):\n",
" \n",
"# Start our counts as an empty dictionary to avoid our previous mistake!\n",
" hashCounts = {}\n",
" for i in xrange( 0, len( strORF ), 3 ):\n",
" strCodon = strORF[i:( i + 3 )]\n",
" print( strCodon )\n",
" \n",
" return hashCounts"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 39
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although this function isn't doing what we want just yet, we can sneak in a `print` statement to show that it's finding every codon successfully:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"codonCount( \"ATGGATTTAATG\" )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"ATG\n",
"GAT\n",
"TTA\n",
"ATG\n"
]
},
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 40,
"text": [
"{}"
]
}
],
"prompt_number": 40
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The next part of the pseudocode we need to translate is \"increment the number of times we've seen it\":\n",
"\n",
" def codonCount( strORF ):\n",
" \n",
" for each codon in the ORF:\n",
" increment the number of times we've seen it\n",
" \n",
" return the dictionary of all counts\n",
"\n",
"We know that eventually, `hashCounts[strCodon]` will be \"the number of times we've seen it.\" So how about:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def codonCount( strORF ):\n",
" \n",
" hashCounts = {}\n",
" for i in xrange( 0, len( strORF ), 3 ):\n",
" strCodon = strORF[i:( i + 3 )]\n",
" hashCounts[strCodon] = hashCounts[strCodon] + 1\n",
" \n",
" return hashCounts\n",
"\n",
"codonCount( \"ATGGATTTAATG\" )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "KeyError",
"evalue": "'ATG'",
"output_type": "pyerr",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-41-d1e08cff6cb6>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mhashCounts\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mcodonCount\u001b[0m\u001b[0;34m(\u001b[0m \u001b[0;34m\"ATGGATTTAATG\"\u001b[0m \u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m<ipython-input-41-d1e08cff6cb6>\u001b[0m in \u001b[0;36mcodonCount\u001b[0;34m(strORF)\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mxrange\u001b[0m\u001b[0;34m(\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m \u001b[0mstrORF\u001b[0m \u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m3\u001b[0m \u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mstrCodon\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mstrORF\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m(\u001b[0m \u001b[0mi\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;36m3\u001b[0m \u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mhashCounts\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mstrCodon\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mhashCounts\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mstrCodon\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 7\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mhashCounts\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mKeyError\u001b[0m: 'ATG'"
]
}
],
"prompt_number": 41
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's pretty inscrutable - almost all Python errors are at first - but the combination of *what* it said and *where* is still correct. On the line in which we tried to increment the codon count, the first codon (`ATG`) was not a valid *key* to use with the `hashCounts` dictionary. That's because we can't retrieve the *value* of a key that's never been stored:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# This is OK since it's in the dictionary\n",
"{\"example\" : 1, \"hash\" : 2}[\"example\"]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 42,
"text": [
"1"
]
}
],
"prompt_number": 42
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# This is not!\n",
"{\"example\" : 1, \"hash\" : 2}[\"barf\"]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "KeyError",
"evalue": "'barf'",
"output_type": "pyerr",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-43-b4c1fe57ecb3>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# This is not!\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0;34m{\u001b[0m\u001b[0;34m\"example\"\u001b[0m \u001b[0;34m:\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"hash\"\u001b[0m \u001b[0;34m:\u001b[0m \u001b[0;36m2\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"barf\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mKeyError\u001b[0m: 'barf'"
]
}
],
"prompt_number": 43
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using square brackets to retrieve a value from a dictionary only works if the requested key is present, although of course we can add a new key at any time:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hashExample = {\"example\" : 1, \"hash\" : 2}\n",
"hashExample[\"barf\"] = \"barfs no more\"\n",
"hashExample"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 44,
"text": [
"{'barf': 'barfs no more', 'example': 1, 'hash': 2}"
]
}
],
"prompt_number": 44
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hashExample[\"barf\"]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 45,
"text": [
"'barfs no more'"
]
}
],
"prompt_number": 45
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Python dictionaries support an additional, less gastrointestinally sensitive value retrieval function as well, however, the [`get`](http://docs.python.org/2/library/stdtypes.html) method. `get` targets a dictionary, accepts one additional input of any type (the key), and returns the key's value if present and `None` otherwise (without generating an error):"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"{\"example\" : 1, \"hash\" : 2}.get( \"example\" )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 46,
"text": [
"1"
]
}
],
"prompt_number": 46
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print( {\"example\" : 1, \"hash\" : 2}.get( \"doesn't barf!\" ) )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"None\n"
]
}
],
"prompt_number": 47
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Thus if the result of `get`ing a key is `None`, it's not in the dictionary; if it's any other value, then it has already been stored. This lets us expand our pseudocode significantly:\n",
"\n",
" def codonCount( strORF ):\n",
" \n",
" for each codon in the ORF:\n",
" if we haven't yet seen the codon:\n",
" set its count to 1\n",
" else:\n",
" increment the number of times we've seen it\n",
" \n",
" return the dictionary of all counts\n",
"\n",
"That looks like something we can sink our Pythonic teeth into, since we know how to `get` the codon's count (or `None`) to see if we've yet seen it or not:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def codonCount( strORF ):\n",
" \n",
" hashCounts = {}\n",
" for i in xrange( 0, len( strORF ), 3 ):\n",
" strCodon = strORF[i:( i + 3 )]\n",
" iCount = hashCounts.get( strCodon )\n",
" if iCount == None:\n",
" hashCounts[strCodon] = 1\n",
" else:\n",
" hashCounts[strCodon] = iCount + 1\n",
" \n",
" return hashCounts\n",
"\n",
"codonCount( \"ATGGATTTAATG\" )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 48,
"text": [
"{'ATG': 2, 'GAT': 1, 'TTA': 1}"
]
}
],
"prompt_number": 48
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Bingo! Suddenly it looks like we have a working Python function. [But wait, there's more!](http://en.wikipedia.org/wiki/Ron_Popeil) `get` takes an optional second argument, the default value to be returned if the requested key is not present. The default value defaults to `None`, but we can request that it be something different:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"{\"example\" : 1, \"hash\" : 2}.get( \"doesn't barf!\", \"I'm not here\" )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 49,
"text": [
"\"I'm not here\""
]
}
],
"prompt_number": 49
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Like, say, the value `0`:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def codonCount( strORF ):\n",
" \n",
" hashCounts = {}\n",
" for i in xrange( 0, len( strORF ), 3 ):\n",
" strCodon = strORF[i:( i + 3 )]\n",
" iCount = 1 + hashCounts.get( strCodon, 0 )\n",
" hashCounts[strCodon] = iCount\n",
" \n",
" return hashCounts\n",
"\n",
"codonCount( \"ATGGATTTAATG\" )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 50,
"text": [
"{'ATG': 2, 'GAT': 1, 'TTA': 1}"
]
}
],
"prompt_number": 50
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And remove the middleman again gets us down to:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def codonCount( strORF ):\n",
" \n",
" hashCounts = {}\n",
" for i in xrange( 0, len( strORF ), 3 ):\n",
" strCodon = strORF[i:( i + 3 )]\n",
" hashCounts[strCodon] = 1 + hashCounts.get( strCodon, 0 )\n",
" \n",
" return hashCounts\n",
"\n",
"codonCount( \"ATGGATTTAATG\" )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 51,
"text": [
"{'ATG': 2, 'GAT': 1, 'TTA': 1}"
]
}
],
"prompt_number": 51
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It never hurts to do more tests:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"codonCount( \"ATGGATTTAATGATGGATTTAATGATGGATTTAATGATGGATTTAATGATGGATTTAATG\" )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 52,
"text": [
"{'ATG': 10, 'GAT': 5, 'TTA': 5}"
]
}
],
"prompt_number": 52
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"strORF = randomNucleotides( 100 )\n",
"strORF"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 53,
"text": [
"'AAATCGTGGAAGACGAATACGGTGTTGTACGTGTAGCAAATAAGAGTGACATGTACTGGAGTCTAGCTTGATCTACTATATTGCCCGTTGCGATGTGCAT'"
]
}
],
"prompt_number": 53
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print( codonCount( strORF ) )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"{'CTT': 1, 'AAG': 1, 'AAA': 1, 'ATA': 1, 'ACA': 1, 'AGA': 1, 'AAT': 1, 'CTA': 2, 'ACT': 1, 'ACG': 2, 'CAA': 1, 'CCG': 1, 'TAT': 1, 'TGT': 2, 'CGA': 1, 'GAT': 1, 'TGC': 1, 'TAG': 2, 'GGA': 1, 'TGG': 1, 'TAC': 1, 'TCG': 1, 'TTG': 2, 'GCA': 1, 'GTC': 1, 'GTG': 3, 'T': 1}\n"
]
}
],
"prompt_number": 54
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print( codonCount( randomNucleotides( 1000 ) ) )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"{'CTT': 8, 'ATG': 6, 'ACA': 8, 'AAA': 5, 'ATC': 3, 'AAC': 10, 'ATA': 6, 'AGG': 6, 'CCT': 8, 'ACT': 8, 'AGC': 7, 'AAG': 7, 'AGA': 7, 'CAT': 6, 'AAT': 1, 'ATT': 4, 'CTG': 6, 'CTA': 8, 'CTC': 10, 'CAC': 3, 'ACG': 4, 'CCG': 4, 'AGT': 3, 'CAG': 8, 'CAA': 3, 'CCC': 4, 'TAT': 4, 'GGT': 5, 'GCG': 3, 'TGT': 6, 'CGA': 2, 'CCA': 3, 'CGC': 3, 'GAT': 4, 'CGG': 3, 'TTT': 3, 'TGC': 4, 'GGG': 5, 'TAG': 7, 'GGA': 2, 'TGG': 4, 'GGC': 7, 'TAC': 4, 'TTC': 5, 'TCG': 6, 'TTA': 6, 'GAC': 5, 'CGT': 7, 'GAA': 7, 'TCA': 7, 'GCA': 4, 'GTA': 8, 'GCC': 3, 'GTC': 6, 'TAA': 3, 'GTG': 8, 'GAG': 2, 'GTT': 7, 'GCT': 5, 'ACC': 3, 'TGA': 6, 'C': 1, 'TTG': 4, 'TCC': 4, 'TCT': 5}\n"
]
}
],
"prompt_number": 55
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But I think you've got that one licked!\n",
"\n",
"### Bioinformatics function exercises\n",
"\n",
"**1.** We never did get around to defining a factorial function in class. Recall that the [factorial operator](http://en.wikipedia.org/wiki/Factorial), written $n!$, operates on an integer, such that:\n",
"\n",
"> $$n! = \\prod_{i=1}^{n} i$$\n",
"\n",
"Thus $1! = 1$, $2! = 2$, $3! = 6$, $4! = 24$, $5! = 120$, and $0! = 1$ by definition. Complete the following function definition so that the examples below all return `True`:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def factorial( iN ):\n",
" \n",
" iNFactorial = 1\n",
" return iNFactorial"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 56
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"factorial( 0 ) == 1"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 57,
"text": [
"True"
]
}
],
"prompt_number": 57
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"factorial( 1 ) == 1"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 58,
"text": [
"True"
]
}
],
"prompt_number": 58
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"factorial( 6 ) == 720"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 59,
"text": [
"False"
]
}
],
"prompt_number": 59
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"factorial( 20 ) == 2432902008176640000"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 60,
"text": [
"False"
]
}
],
"prompt_number": 60
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**2.** Lifted from one of my favorite bioinformatics tutorials, [Project Rosalind](http://rosalind.info), complete the following function definition that inputs two positive integers and returns the sum of all odd integers between them (inclusive). Note that it's useful to figure out which of the two inputs is larger, and that the modulo operator will tell you if an integer is odd or even:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Even\n",
"14 % 2"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 61,
"text": [
"0"
]
}
],
"prompt_number": 61
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Odd\n",
"13 % 2"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 62,
"text": [
"1"
]
}
],
"prompt_number": 62
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def anOddSum( iOne, iTwo ):\n",
" \n",
" iSum = 0\n",
" return iSum"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 63
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def anOddSum( iOne, iTwo ):\n",
" \n",
" if iOne > iTwo:\n",
" iOne, iTwo = iTwo, iOne\n",
" if not ( iOne % 2 ):\n",
" iOne += 1\n",
" iSum = 0\n",
" for i in xrange( iOne, iTwo + 1, 2 ):\n",
" iSum += i\n",
" return iSum"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 64
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"anOddSum( 1, 3 ) == 4"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 65,
"text": [
"True"
]
}
],
"prompt_number": 65
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"anOddSum( 3, 1 ) == 4"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 66,
"text": [
"True"
]
}
],
"prompt_number": 66
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# 3 + 5 + 7\n",
"anOddSum( 2, 8 ) == 15"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 67,
"text": [
"True"
]
}
],
"prompt_number": 67
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"anOddSum( 100, 200 ) == 7500"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 68,
"text": [
"True"
]
}
],
"prompt_number": 68
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Data input and output\n",
"\n",
"The ability to read and write information in bulk is one of the most useful tools you'll have for manipulating your own biological data. While standard file formats like [FASTAs](http://en.wikipedia.org/wiki/FASTA_format) and [BAMs]() are critical, if you do serious biological research you'll inevitably need to ask a question of your data that nobody's ever asked before. Hence calling it research! Given a set of gene sequences, which ones have one of three degenerate RNA binding protein sites in their first intron or second exon? Given a hundred gene expression conditions, in which are at least one gene from your favorite pathway upregulated and at least one of its genes downregulated? These aren't particularly unusual biological questions, but they'd be very difficult to address with a pre-built system - and very easy to address using a flexible environment like Python. Imagine trying to do science in a foreign language using only a pre-defined phrasebook - it's not impossible, but it's no fun, either. Learning the whole language lets you pursue your own questions on your own terms.\n",
"\n",
"### Standard input and output streams\n",
"\n",
"The easiest ways to get data *into* and *out of* Python are the **standard input** (or **stdin**) and **standard output** (**stdout**) streams. Recall that an **input stream** acts like a special, read-only collection of strings that you can't move backward through. Likewise an **output stream** is a special target object that knows how to **write** strings to the screen or a file. Standard input and output are default streams that exist for every program, on every computer, that are by default attached to the command line and the screen, respectively.\n",
"\n",
"Since the IPython Notebook doesn't have a command line, it's rather difficult to demonstrate standard input here. However, you've already seen standard output over and over. Every time you use the `print` function, it displays its argument to standard output. IPython Notebook *captures* that and displays it in your web browser, just like it would normally go to the screen. Hence we can write:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print( \"This will show up on standard output\" )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"This will show up on standard output\n"
]
}
],
"prompt_number": 69
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is equivalent to writing:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sys.stdout.write( \"This will show up on standard output\\n\" )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"This will show up on standard output\n"
]
}
],
"prompt_number": 70
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note the one small but important difference: `print` automatically appends a newline to the end of its argument, `write`ing to an output stream does not. You can see this in the following examples:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print( \"I am the very model\" )\n",
"print( \"of a modern Major General\" )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"I am the very model\n",
"of a modern Major General\n"
]
}
],
"prompt_number": 71
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sys.stdout.write( \"I am the very model\" )\n",
"sys.stdout.write( \"of a modern Major General\" )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"I am the very modelof a modern Major General"
]
}
],
"prompt_number": 72
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This can be useful if you want to accumulate a variety of output one piece at a time that's shown all on one line, or it can be irritating if you want to `write` lines to an output stream (standard out or otherwise) and keep forgetting to append newlines. If you're `write`ing string variables rather than literal strings, you can also do this using concatenation:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def someFunctionThatWritesOutput( ostm, astrOutput ):\n",
" \n",
" for strOutput in astrOutput:\n",
" ostm.write( strOutput + \"\\n\" )\n",
"\n",
"someFunctionThatWritesOutput( sys.stdout, [\"animal\", \"vegetable\", \"mineral\"] )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"animal\n",
"vegetable\n",
"mineral\n"
]
}
],
"prompt_number": 73
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a very typical use of an output stream such as standard output: an output stream is a targetable data type on which the `write` method is used to produce output. There's only rarely any reason to use `sys.stdout` rather than `print` if you just need to show data on the screen (or in a standard output file - more on that later). But `sys.stdin` can be very useful to read input from the user or a file - I just can't show you here! Typical usage of `sys.stdin` looks like:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def someFunctionThatReadsInput( istm ):\n",
" \n",
" astrInput = []\n",
" for strLine in istm:\n",
" astrInput.append( strLine )\n",
" \n",
" return astrInput\n",
"\n",
"someFunctionThatReadsInput( sys.stdin )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 74,
"text": [
"[]"
]
}
],
"prompt_number": 74
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But notice that nothing happened - IPython Notebook defines `sys.stdin` to be empty. That's no problem - I can show you more about input streams using one that's opened on a file rather than standard input - but first we have to discuss creating files and non-standard input/output streams in the first place.\n",
"\n",
"### Writing files\n",
"\n",
"You might have noticed above that I created a variable named `ostm` to \"hold\" the standard output stream that's normally contained in `sys.stdout`. This is possible because all output streams are simply special, targetable data values, just like all strings are data values targetable by methods like `find` and lists are data values targetable by methods like `reverse`. Standard output happens to be a pre-defined output stream that writes to the screen, but we can create new output streams that write to files instead using the [`open` function](http://docs.python.org/2/library/functions.html#open).\n",
"\n",
"`open` takes two arguments, a string file name and a **mode** requesting to read from that file (`\"r\"`) for an input stream or to write to that file (`\"w\"`) for an output stream. It returns either an input or an output stream depending on the mode. Since we're going to create files using output streams first, let's try making one using mode `\"w\"` first:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"ostmTest = open( \"test.txt\", \"w\" )\n",
"\n",
"ostmTest.write( \"I am the very model\\n\" )\n",
"ostmTest.write( \"of a modern Major General\\n\" )"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 75
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you'd like to see what just happened, use a text editor like [jEdit](http://www.jedit.org) to open up the `test.txt` file that should have been created in the same directory as your IPython Notebook file. It should contain:"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"I am the very model\n",
"of a modern Major General"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An `open`ed file stays open either until you stop using the variable in Python or you `close` it. We can continue writing new data to `test.txt`:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"ostmTest.write( \"I've information vegetable, animal, and mineral\\n\" )\n",
"ostmTest.write( \"I know the kings of England\\n\" )"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 76
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now if you open up the text file again (or refresh it), it should contain:"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"I am the very model\n",
"of a modern Major General\n",
"I've information vegetable, animal, and mineral\n",
"I know the kings of England"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we wrote our example functions above to work with any output stream, not just standard output, we can even reuse them with a newly opened file:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"someFunctionThatWritesOutput( open( \"test.txt\", \"w\" ), [\"animal\", \"vegetable\", \"mineral\"] )"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 77
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now if you look at your file... hey, wait, what happened? It just contains:"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"animal\n",
"vegetable\n",
"mineral"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Where did all of your (other) [Gilbert and Sullivan lyrics](http://en.wikipedia.org/wiki/Major-General's_Song) go? Be warned that when you `open` a file to write to it in mode `\"w\"`, if it already exists, its contents are completely replaced. Let's say that again for emphasis:\n",
"\n",
"> **When you `open` a file, it will be created if it doesn't already exist. But if it does already exist, any previous contents are completely deleted.**\n",
"\n",
"This is a great way to lose data! Never, ever `open` a file in mode `\"w\"` that you don't want to lose - for example, a file named `python03.py`, or `my_important_experimental_data.txt`. It's just fine to open such files for reading (we'll discuss how to do that next), or you can open then using the **append** mode `\"a\"` to add new data without deleting what's already there:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"someFunctionThatWritesOutput( open( \"test.txt\", \"a\" ), [\"Marathon\", \"Waterloo\"] )"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 78
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now your file should contain something more like you expected earlier:"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"animal\n",
"vegetable\n",
"mineral\n",
"Marathon\n",
"Waterloo"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use a variety of **structured data formats** to create files that store information that's easy for Python to read and write. We'll discuss this a lot more later on, but for now, an easy way to create some example data is using the string `split` and `join` methods. `split` targets a string, accepts one string argument, and returns a list of strings **tokenizing** the targeted string by separating substrings between each occurrence of the argument:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\"this is a test\".split( \" \" )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 79,
"text": [
"['this', 'is', 'a', 'test']"
]
}
],
"prompt_number": 79
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\"the, argument, can, be, more, than, one, character, long\".split( \", \" )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 80,
"text": [
"['the', 'argument', 'can', 'be', 'more', 'than', 'one', 'character', 'long']"
]
}
],
"prompt_number": 80
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\"aaabbbcccbbbdddbbbeeebbbfffbbb\".split( \"bbb\" )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 81,
"text": [
"['aaa', 'ccc', 'ddd', 'eee', 'fff', '']"
]
}
],
"prompt_number": 81
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\":Note the empty:first and last strings:\".split( \":\" )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 82,
"text": [
"['', 'Note the empty', 'first and last strings', '']"
]
}
],
"prompt_number": 82
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `join` method also targets a string, accepts one argument that's a list of strings, and reverses a `split` by returning one string combining the given list of tokens separated by the targeted string:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\" \".join( [\"this\", \"is\", \"a\", \"test\"] )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 83,
"text": [
"'this is a test'"
]
}
],
"prompt_number": 83
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\", \".join( [\"the\", \"target\", \"can\", \"be\", \"more\", \"than\", \"one\", \"character\", \"long\"] )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 84,
"text": [
"'the, target, can, be, more, than, one, character, long'"
]
}
],
"prompt_number": 84
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\"bbb\".join( [\"aaa\", \"ccc\", \"ddd\", \"eee\", \"fff\", \"\"] )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 85,
"text": [
"'aaabbbcccbbbdddbbbeeebbbfffbbb'"
]
}
],
"prompt_number": 85
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\":\".join( \":Note the empty:first and last strings:\".split( \":\" ) )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 86,
"text": [
"':Note the empty:first and last strings:'"
]
}
],
"prompt_number": 86
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's use `join` in particular to write an example file that we'll use to demonstrate input streams in the next section. It will contain a series of transcript sequences, one per line, with the exons separated by commas:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"iNumberOfTranscripts = 1000\n",
"\n",
"ostmExample = open( \"transcripts.txt\", \"w\" )\n",
"for i in xrange( iNumberOfTranscripts ):\n",
"# Each gene has between 1 and 10 exons\n",
" iNumberOfExons = random.randrange( 1, 11 )\n",
" astrExons = []\n",
" for j in xrange( iNumberOfExons ):\n",
"# Each exon is between 100 and 1000 nucleotides long\n",
" iNumberOfBases = random.randrange( 100, 1001 )\n",
" astrExons.append( randomNucleotides( iNumberOfBases ) )\n",
" ostmExample.write( \",\".join( astrExons ) + \"\\n\" )"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 87
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Take a look at `transcripts.txt`, which we just created in the same folder as your IPython Notebook. It should contain a whole bunch of nucleotides, with a few commas thrown in there for good measure. It's no fun to look at by eye - but it's no problem to read back into Python, either!\n",
"\n",
"### Reading files\n",
"\n",
"An input stream is conceptually the reverse of an output stream, unsurprisingly. It's a data object that acts like a collection of strings. If you iterate over it using, for example, a `for` loop, each line of text in the stream is generated one after the other. Unlike a list collection, though, you can't go backwards, only forwards! Again, think of a cursor reading one line at a time through the stream. Although we can't easily see `sys.stdin` provide this behavior for us in IPython Notebook, we can see it by `open`ing a file with read mode `\"r\"`, for example the `test.txt` file created above:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"istmTest = open( \"test.txt\", \"r\" )\n",
"\n",
"for strLine in istmTest:\n",
" print( strLine )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"animal\n",
"\n",
"vegetable\n",
"\n",
"mineral\n",
"\n",
"Marathon\n",
"\n",
"Waterloo\n",
"\n"
]
}
],
"prompt_number": 88
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The default mode of the `open` function is `\"r\"`, so we can omit it if we want to:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"istmTest = open( \"test.txt\" )\n",
"\n",
"for strLine in istmTest:\n",
" print( strLine )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"animal\n",
"\n",
"vegetable\n",
"\n",
"mineral\n",
"\n",
"Marathon\n",
"\n",
"Waterloo\n",
"\n"
]
}
],
"prompt_number": 89
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And like we have before, we can cut out the middleman (temporary variable) in many cases to iterate over the input stream directly:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for strLine in open( \"test.txt\" ):\n",
" print( strLine )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"animal\n",
"\n",
"vegetable\n",
"\n",
"mineral\n",
"\n",
"Marathon\n",
"\n",
"Waterloo\n",
"\n"
]
}
],
"prompt_number": 90
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or we can pass it as an input data object to any function that expects an input stream, like the one we wrote above:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"someFunctionThatReadsInput( open( \"test.txt\" ) )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 91,
"text": [
"['animal\\n', 'vegetable\\n', 'mineral\\n', 'Marathon\\n', 'Waterloo\\n']"
]
}
],
"prompt_number": 91
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hey, that's funny - why do all of our lines of input text have a newline at the end? We couldn't see it explicitly before when we `print`ed them, but they did look unusually spaced out. Input streams do not typically remove the **line delimiter** at the end of each line of text they read in, so if we don't like it, we have to remove it manually. Under some circumstances, it might not be a problem, say if we just wanted to print some lines of data back out again:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for strLine in open( \"test.txt\" ):\n",
" if strLine[1] == \"a\":\n",
" sys.stdout.write( strLine )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Marathon\n",
"Waterloo\n"
]
}
],
"prompt_number": 92
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But most of the time it's a pain, and we can use a string function like [`rstrip`](http://docs.python.org/2/library/stdtypes.html?highlight=rstrip#str.rstrip) to remove the **trailing newline**:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for strLine in open( \"test.txt\" ):\n",
" print( strLine.rstrip( ) )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"animal\n",
"vegetable\n",
"mineral\n",
"Marathon\n",
"Waterloo\n"
]
}
],
"prompt_number": 93
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that `rstrip` removes (strips) all whitespace at the end (right-hand side, hence \"r\") of a string, not just newlines, so you do have to be a bit careful with it sometimes if you expect your data to contain spaces or tabs:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\"no whitespace, nothing happens\".rstrip( )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 94,
"text": [
"'no whitespace, nothing happens'"
]
}
],
"prompt_number": 94
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\"one newline gets nuked\\n\".rstrip( )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 95,
"text": [
"'one newline gets nuked'"
]
}
],
"prompt_number": 95
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\"many newlines, all gone\\n\\n\\n\\n\".rstrip( )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 96,
"text": [
"'many newlines, all gone'"
]
}
],
"prompt_number": 96
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\"but so is everything else: \\t \\n \".rstrip( )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 97,
"text": [
"'but so is everything else:'"
]
}
],
"prompt_number": 97
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Input streams become most useful for biological data handling when they contain interestingly structured data, though, such as the nucleotide sequences we generated above. Let's revisit that list of transcript sequences and print out 1) which genes contained the motif `GATACAACG` and 2) which exon index they were in:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"iGene = 0\n",
"for strLine in open( \"transcripts.txt\" ):\n",
" iGene += 1\n",
" astrExons = strLine.rstrip( ).split( \",\" )\n",
" for iExon in xrange( len( astrExons ) ):\n",
" if astrExons[iExon].find( \"GATACAACG\" ) >= 0:\n",
" print( \"Found: gene \" + str(iGene) + \", exon \" + str(iExon + 1) )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Found: gene 26, exon 5\n",
"Found: gene 33, exon 4\n",
"Found: gene 52, exon 1\n",
"Found: gene 150, exon 4\n",
"Found: gene 175, exon 1\n",
"Found: gene 196, exon 4\n",
"Found: gene 259, exon 1\n",
"Found: gene 325, exon 6\n",
"Found: gene 484, exon 4\n",
"Found: gene 663, exon 4\n",
"Found: gene 821, exon 2\n",
"Found: gene 836, exon 1\n",
"Found: gene 959, exon 9\n"
]
}
],
"prompt_number": 98
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We've made a few **design decisions** here - gene and exon indices are 1-based rather than 0-based, for example - but otherwise each step in this Python program should make sense:\n",
"\n",
" counting up lines of input (genes) starting before the first...\n",
" for each line in \"transcripts.txt\":\n",
" increment which gene (line of input) we're working on\n",
" remove the trailing newline, and tokenize the transcript into comma-delimited exons\n",
" for each index in the list of exons:\n",
" if the exon string value at that index contains \"GATACAACG\":\n",
" print which gene and exon index we found it in\n",
"\n",
"If they don't make sense, try changing the program above and seeing what happens, or use the blank cell below to create some new code that reads the file differently (remember, you can always create new cells to experiment using the \"Insert\" menu above, too). Then go on to try the exercises below - probably more useful than comic opera lyrics!"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 98
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### I/O exercises\n",
"\n",
"**1.** Complete the following function that accepts four arguments:\n",
"\n",
"1. An output stream.\n",
"\n",
"2. A column width.\n",
"\n",
"3. A list of strings, each representing a sequence identifier.\n",
"\n",
"4. A list of strings, each representing a nucleotide or amino acid sequence.\n",
"\n",
"and writes the sequence identifiers and sequences to the output stream as a correctly formatted [multi-FASTA](http://en.wikipedia.org/wiki/FASTA_format) file, with sequences wrapped at the requested width. That is, given the inputs:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"iWidth = 4\n",
"astrIDs = [\"one\", \"two\", \"three\"]\n",
"astrSeqs = [\"AACGTAATGCAATAA\", \"GCTTTGGC\", \"GCAGAATTAAT\"]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 99
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"the function should produce the output:"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
">one\n",
"AACG\n",
"TAAT\n",
"GCAA\n",
"TAA\n",
">two\n",
"GCTT\n",
"TGGC\n",
">three\n",
"GCAG\n",
"AATT\n",
"AAT"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similarly, if we change the column width to `6`, it should produce:"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
">one\n",
"AACGTA\n",
"ATGCAA\n",
"TAA\n",
">two\n",
"GCTTTG\n",
"GC\n",
">three\n",
"GCAGAA\n",
"TTAAT"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def writeFasta( ostm, iWidth, astrIDs, astrSeqs ):\n",
" \n",
" pass"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 100
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**2.** Call the function you just wrote so that it produces a file named `sequences.fasta` in your current directory, containing `1000` nucleotide sequences named `gene1` through `gene1000` with lengths randomly chosen between `800` and `1200` nucleotides. Use the `randomNucleotides` function we wrote above."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"writeFasta( sys.stdout, 60, [], [] )"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 101
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**3.** Complete the following function that accepts one argument, an input stream, and reads a FASTA file from it to return a list containing two lists of strings, the first sequence IDs and the second sequences. That is, given an input stream containing either of the examples above (regardless of column width), it should return:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"[[\"one\", \"two\", \"three\"], [\"AACGTAATGCAATAA\", \"GCTTTGGC\", \"GCAGAATTAAT\"]]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 102,
"text": [
"[['one', 'two', 'three'], ['AACGTAATGCAATAA', 'GCTTTGGC', 'GCAGAATTAAT']]"
]
}
],
"prompt_number": 102
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def readFasta( istm ):\n",
" \n",
" return []"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 103
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**4.** Complete the following function that accepts three arguments:\n",
"\n",
"1. A nucleotide motif to find.\n",
"\n",
"2. A list of gene IDs.\n",
"\n",
"3. A list of gene sequences.\n",
"\n",
"and returns a list of the gene IDs whose corresponding sequences contain the requested motif. For example, using the example sequences above, the input:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"strMotif = \"TAA\""
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 104
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"should produce the output:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"[\"one\", \"three\"]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 105,
"text": [
"['one', 'three']"
]
}
],
"prompt_number": 105
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 105
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**5.** Use `readFasta` to read in the `sequences.fasta` file you created above, and pass the resulting data to `findMotif` to list which of your randomly generated gene sequences contain the motif `GATACAAC`."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 105
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**6.** Why did I choose a motif of that particular length?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"strAnswer = \"\""
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 106
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**7.** Why is `randomNucleotides` a crummy way to generate biologically realistic sequences?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"strAnswer = \"\""
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 107
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 107
}
],
"metadata": {}
}
]
}
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment