Skip to content

Instantly share code, notes, and snippets.

@jeffhussmann
Created February 15, 2013 21:06
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save jeffhussmann/4963535 to your computer and use it in GitHub Desktop.
Save jeffhussmann/4963535 to your computer and use it in GitHub Desktop.
Interactive lecture on 2/15/13
{
"metadata": {
"name": "jhussmann_021513"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "##Rosalind: Counting DNA Nucleotides##"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "open() returns a [file object](http://docs.python.org/2/library/stdtypes.html#file-objects)."
},
{
"cell_type": "code",
"collapsed": false,
"input": "file_name = 'data/rosalind_dna.txt'\nfh = open(file_name)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": "File objects have a read() method. Execute the cell below to see its documentation. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "fh.read?",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": "dna_string = fh.read()",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We will use a [dictionary](http://docs.python.org/2/library/stdtypes.html#mapping-types-dict) to keep track of nucleotide counts. Using dictionaries effectively is one of the keys to getting the most out of Python."
},
{
"cell_type": "code",
"collapsed": false,
"input": "counts = {'A': 0,\n 'C': 0,\n 'G': 0,\n 'T': 0,\n }",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Iterating over a string produces the characters of the string one at a time."
},
{
"cell_type": "code",
"collapsed": false,
"input": "for c in dna_string:\n counts[c] += 1",
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "KeyError",
"evalue": "'\\n'",
"output_type": "pyerr",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mKeyError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-5-ca52b0f4e7ca>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mc\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mdna_string\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mcounts\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mc\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m+=\u001b[0m \u001b[1;36m1\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;31mKeyError\u001b[0m: '\\n'"
]
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Asking for the value of a nonexistent key raises a KeyError. There are specialized kinds of dictionaries in the [collections](http://docs.python.org/2/library/collections.html) module provide a way around this, but for now, we will just strip() the trailing newline character."
},
{
"cell_type": "code",
"collapsed": false,
"input": "stripped_string = dna_string.strip()",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": "counts = {'A': 0,\n 'C': 0,\n 'G': 0,\n 'T': 0,\n }\n\nfor c in stripped_string:\n counts[c] += 1\n \nprint counts",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "{'A': 195, 'C': 203, 'T': 226, 'G': 223}\n"
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Iterating over a dictionary produces the keys one at a time in an arbitrary order."
},
{
"cell_type": "code",
"collapsed": false,
"input": "for key in counts:\n print key",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "A\nC\nT\nG\n"
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": "List comprehensions are a powerful way to construct lists by peforming some operation on each of the elements in an object that can be iterated over."
},
{
"cell_type": "code",
"collapsed": false,
"input": "count_strings = [str(counts[base]) for base in 'ACGT']",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 9
},
{
"cell_type": "code",
"collapsed": false,
"input": "print ' '.join(count_strings)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "195 203 223 226\n"
}
],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": "##Rosalind: Transcribing DNA into RNA##"
},
{
"cell_type": "code",
"collapsed": false,
"input": "file_name = 'data/rosalind_rna.txt'\nfh = open(file_name)\ndna_string = fh.read().strip()",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Strings are immutable."
},
{
"cell_type": "code",
"collapsed": false,
"input": "for i, c in enumerate(dna_string):\n if c == 'T':\n dna_string[i] = 'U'",
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "TypeError",
"evalue": "'str' object does not support item assignment",
"output_type": "pyerr",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-12-e1e72f21c172>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mi\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mc\u001b[0m \u001b[1;32min\u001b[0m \u001b[0menumerate\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdna_string\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mc\u001b[0m \u001b[1;33m==\u001b[0m \u001b[1;34m'T'\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 3\u001b[1;33m \u001b[0mdna_string\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mi\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;34m'U'\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;31mTypeError\u001b[0m: 'str' object does not support item assignment"
]
}
],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We could build the string up incrementally ... "
},
{
"cell_type": "code",
"collapsed": false,
"input": "rna_string = ''\nfor c in dna_string:\n if c == 'T':\n rna_string += 'U'\n else:\n rna_string += c\n \nprint rna_string",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "UUAUGUGUCGCUGACGAGCGGAAUUGUGACAUUAGUACCGUCCGCAGCAUUCGCAUUAAACACGGAAGUCCGAGCCCCCCGUUGUUUUCGGAAACCUAGAACCCCCGACAUAAGGGUCACGAGUAAGUAAUAUCGCGUCUGAUUACCAUUAAACAGGACGGUUAGGAAGUCGGCAACGGGGUGUGAGGUAGGUUACACCUAAGGCAACCCUUACUUCGGCGUGAAGAGCUAGCCAACUGGCAAUAAGUGGGUCGGACUUAAUUUCCCUAUAUAGCUAAUGCGAGCCUGGGCCUCGGCCCCGGUCGUACAUUCCCUUCGGUGAACGACUCGCAAUGCCGCGUCCUAUAAGCCAGAUAGAUCUAUUUGUGCCGGAAGGUUUCCUGCUGCAGGUGCAAAGUGCGAGGUUGGAACCUUGCCUCAAGGGUAGGUUCCCCUGUAGGUGGGAUAGGAUGACAUAAGCUCCGACGCCGAGACACAUUUAAUCUGCAUCAUUUUCCUGAAGUGCGCGCUAUGCAAACUCACUGGGGCAACGCCUCCCUGUGACGUCGGACUCGCCUCACUAUAGACUUUGCUUCCGCGCAUGUGAGACGGAACAAAGCAUAAAGCUUCCGACAUCUAAGCCUGCCUUACUAUGAUAGUCCAUUGAUGCUUAUUUGUCUCGGCUGAUCUCAGUACCUAAAGACCGGGAGCGUGGAACUCUGGUGGAGCUUCUGCGUUACCGUCUUUAUAGUAUCUAAGACUUUCGCCUAAACAGUGGCUACCCGCAAGAAAAGAUUGCGUUCACAGCUACGCCCAAUUAUAGGUCCUUGGGCGCGAGAGAGAGUAGUCGUUGAUAGUAUGGUCAUCUGGGAACGUGGGCACAAGAACAAGGAACGGACUAUCCGCUUCAGGGGAGAGCCUCUGAUCGAACGACAUGUUAACUCUAGGUAGACCGCUAAAUCGGAGACGAUGACCACCUGUCUCAAAUCAGAC\n"
}
],
"prompt_number": 13
},
{
"cell_type": "markdown",
"metadata": {},
"source": "... but string objects have a replace() method that does exactly what we want."
},
{
"cell_type": "code",
"collapsed": false,
"input": "rna_string_2 = dna_string.replace('T', 'U')\nprint rna_string == rna_string_2",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "True\n"
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": "##Rosalind: Complementing a Strand of DNA##"
},
{
"cell_type": "code",
"collapsed": false,
"input": "file_name = 'data/rosalind_revc.txt'\nfh = open(file_name)\ndna_string = fh.read().strip()",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 15
},
{
"cell_type": "code",
"collapsed": false,
"input": "print dna_string",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "TTACAGTTCGTACTTTGCTATCATACCTGATACTCCATGCGGGGCCTAGCAAAACAGGCTTTAGTACATACGAAATGTAAATCGGTTATCCATGCAAAACGAGTGAATCTAACGTGTAAAGCATCACGCAACATTGGTATCTACAGGACTTGACGCTCCTTAGTTTTCCGGCGCTTTAACATTAGCTTAATAGATCACAGCTAGATCGGCTGACAACGCCTGGTGGTATGACTCCCTATCTTTCATTAATTCTTCTGGTGCGACGCATTACGGTCTTCGTCGACAGCCGACGAATTGCGCCTCGGCCCGTAATCTCACACCACAAAGCCCCAATTGTTCCCCGTGACCTCGGTTTAGCTCATTTGTTGCTTCAGTCCAATCAAGGCAGTCGTTTCAGCGTGCTGCGTTTCTTGGGATTGGGCATTTGCGCCCGCTGCGGGTAACGGACTCTGGGTTCCCATTGTGAAGAGATGCGTTATAGGCATTGGGGGGGCCGAGTGAGAGTACACAGTACCCGCGATGCATCCTACTCGGATTGCCGTGCCCTCTATGTTACGGAGGGCGTTCACAAGCAAGAGTTATGAAGTTTGCCTCAAGCATGACACGGGAGACGGAGGCCTTGCTAGACAGTGATCGTGGCCCGTATAAGTGAAGGAGACGTGTCGGCTGATCGCCACACGGGAGGTATTAATCGTAAGGCTAAGCTCATCCCGCTTCATCTCGCTAGGACCTAGCCTGCGGCCGTAGAGGCGCGGCAAGCAGCACTCTTCTTGGAGAGGTCATGTTCTCCTATACACAAAGGTATCCTCTCCG\n"
}
],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": "One approach is to use a dictionary to encode how each nucleotide gets complemented and to use the built-in function reversed() to iterate over the DNA string in reversed order."
},
{
"cell_type": "code",
"collapsed": false,
"input": "complement = {'T': 'A',\n 'C': 'G',\n 'A': 'T',\n 'G': 'C',\n }\n\nrc_string = ''\nfor c in reversed(dna_string):\n rc_string += complement[c]\n \nprint rc_string",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "CGGAGAGGATACCTTTGTGTATAGGAGAACATGACCTCTCCAAGAAGAGTGCTGCTTGCCGCGCCTCTACGGCCGCAGGCTAGGTCCTAGCGAGATGAAGCGGGATGAGCTTAGCCTTACGATTAATACCTCCCGTGTGGCGATCAGCCGACACGTCTCCTTCACTTATACGGGCCACGATCACTGTCTAGCAAGGCCTCCGTCTCCCGTGTCATGCTTGAGGCAAACTTCATAACTCTTGCTTGTGAACGCCCTCCGTAACATAGAGGGCACGGCAATCCGAGTAGGATGCATCGCGGGTACTGTGTACTCTCACTCGGCCCCCCCAATGCCTATAACGCATCTCTTCACAATGGGAACCCAGAGTCCGTTACCCGCAGCGGGCGCAAATGCCCAATCCCAAGAAACGCAGCACGCTGAAACGACTGCCTTGATTGGACTGAAGCAACAAATGAGCTAAACCGAGGTCACGGGGAACAATTGGGGCTTTGTGGTGTGAGATTACGGGCCGAGGCGCAATTCGTCGGCTGTCGACGAAGACCGTAATGCGTCGCACCAGAAGAATTAATGAAAGATAGGGAGTCATACCACCAGGCGTTGTCAGCCGATCTAGCTGTGATCTATTAAGCTAATGTTAAAGCGCCGGAAAACTAAGGAGCGTCAAGTCCTGTAGATACCAATGTTGCGTGATGCTTTACACGTTAGATTCACTCGTTTTGCATGGATAACCGATTTACATTTCGTATGTACTAAAGCCTGTTTTGCTAGGCCCCGCATGGAGTATCAGGTATGATAGCAAAGTACGAACTGTAA\n"
}
],
"prompt_number": 17
},
{
"cell_type": "markdown",
"metadata": {},
"source": "It turns out that there are built-in function in the string module that do exactly what we want."
},
{
"cell_type": "code",
"collapsed": false,
"input": "import string",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 18
},
{
"cell_type": "code",
"collapsed": false,
"input": "complement_table = string.maketrans('TCAG', 'AGTC')\nc_string = dna_string.translate(complement_table)\nrc_string_2 = c_string[::-1]\n\nprint rc_string == rc_string_2",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "True\n"
}
],
"prompt_number": 19
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Why does [::-1] work?"
},
{
"cell_type": "code",
"collapsed": false,
"input": "up_to_ten = range(10)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 20
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Taking a slice a:b returns the interval of the list starting at index a and going up to but not including index b. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "up_to_ten[2:4]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 21,
"text": "[2, 3]"
}
],
"prompt_number": 21
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Leaving off part of the slice means 'start from the beginning'..."
},
{
"cell_type": "code",
"collapsed": false,
"input": "up_to_ten[:4]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 22,
"text": "[0, 1, 2, 3]"
}
],
"prompt_number": 22
},
{
"cell_type": "markdown",
"metadata": {},
"source": "... or 'go to the end.'"
},
{
"cell_type": "code",
"collapsed": false,
"input": "up_to_ten[4:]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 23,
"text": "[4, 5, 6, 7, 8, 9]"
}
],
"prompt_number": 23
},
{
"cell_type": "markdown",
"metadata": {},
"source": "An optional third slice argument gives a stride length to step along the interval with."
},
{
"cell_type": "code",
"collapsed": false,
"input": "up_to_ten[2:9:3]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 24,
"text": "[2, 5, 8]"
}
],
"prompt_number": 24
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Stride length can be negative...."
},
{
"cell_type": "code",
"collapsed": false,
"input": "up_to_ten[9:2:-2]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 25,
"text": "[9, 7, 5, 3]"
}
],
"prompt_number": 25
},
{
"cell_type": "markdown",
"metadata": {},
"source": "... so this idiom means 'start at the end and go through to the beginning stepping backwards one index at a time', which is a complicated way to say 'reverse'."
},
{
"cell_type": "code",
"collapsed": false,
"input": "up_to_ten[::-1]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 26,
"text": "[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]"
}
],
"prompt_number": 26
},
{
"cell_type": "markdown",
"metadata": {},
"source": "##Rosalind: Computing GC Content##"
},
{
"cell_type": "code",
"collapsed": false,
"input": "def fasta_records(fh):\n ''' Returns a list of (name, seq) pairs for fasta records in file_name. '''\n records = []\n name = fh.readline().strip().lstrip('>')\n while name:\n seq = ''\n line = fh.readline().strip()\n while line and not line.startswith('>'):\n seq += line\n line = fh.readline().strip()\n records.append((name, seq))\n name = line.lstrip('>')\n return records",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 27
},
{
"cell_type": "code",
"collapsed": false,
"input": "file_name = 'data/rosalind_gc.txt'\nfh = open(file_name)\nrecords = fasta_records(fh)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 28
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Argument unpacking makes code more readable."
},
{
"cell_type": "code",
"collapsed": false,
"input": "for name, seq in records:\n print name\n print seq\n print",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Rosalind_8336\nGCTCACCCGTGTCGCCCGAACATGCTTAACTAACTTGGACAGGTTTGAAATAGCACTTGAAGAAAACGTTTCAGCGATGACCAGACGAGGATACGATCGCTCCTCGGAAGCCCTCTAAGCTCGTCGCAGTGAACAAGACTCTAAGAGTCTTCAGACGCTTGGCGCCGGTAGGTGGTTCTGTGCACTACTCCACGTCGTCCGGTCGAGTAATAAGCAGGGATCTTTCGGCGTTGTACTAACAGCTCATAACTCCAAAGCCAAAAACTCAACTGGGAAAGCGTAGTGTCAGATGACTGTACACTCACACAAGTACGGTCGGCAGGGGCACCGTTACGCCCCAGACCGTTGCGTATTTCTGCAGTCCGGCGTAAGTAGATTGTCCGTGGGCTTACTCTTGGAATTATATCATGTCTCTGGCTTGGAATTGGCGTTACATTGTGGGGACCCGCTGCGGCCGTCCCTATTCATATCATTTTTGAGCATAATGATTTTTTACAACAAAAATGTAGCACGGAATGCCGAATTTAAGTCAGTTTACATCAGTGAAGCGGTTTGCCCAACGGCCAGACTGCGTGGGGCACAGTTGGCCTATGTAGGACCACTCGTGCGCATTAGAACTTCAACCCGTTGACTACGCCGGAGCGGGACAACGTAATTCATTGATCACTGGACCCGACAGTACGGGTGGACACAATGTGTCAATGTAGGCCAGCATAGTTAAACATCGAGCCCCGTGTGACAGAAGTTTGAAGCGATGTAGTGATGGTCTTCACTTGTCATGACGGTGGGGCCTCGGGAAAGTTGGCATCCCTACAGCGTTTCAGATGCGGTCCAATGCTAGCGGACAATTCTAGCACGTTCTAGGCTGCCGCTAGGTCAACCCGCCACATGATCCACTGAAAGT\n\nRosalind_6236\nTATGCAAATTTATGGCGCGGCCATTGAGTAAGACTTCAGCAGCACCTGATGTTCGCATCTAAAACCTGTTAGGTACTGTATTGCAATCGAATATTGTGACCCGGTCCACACGCACGCATACTTGTGCTATGAACAATATCAACTACCCCACACAAGCGTTGGCGGTTGGGTACTCGAAACTATAAGATTCGCCCATGTGTGTAAAGGCGTTGTCATCTCACTGAAACCTTGCCTCGAGTCCTAGAGAAACTGTTTAGAATGAACCGCTATCGGCCAGCGATGCCCCTCCTCTGTCTGGCCGTAATTGGGGCCTAGACTCCGATGTATCTGATGTATGCACGGAAAAAGCTCCACGCAGGTTACGGAGGGTGCGCTAATAATGGTCACAGCGGGTCGATGGGGTTACGTTATACATTCCCAAGCAGTTCGACATCCCGCAGATCGAATACACACTGTCATGCAGCGTCACTGCCCACCTGCACCGCCGGTTAAACAAGAATTAGTTCCGGTCCTCCGCCTGGGTGATGGGTAACTTCGGCATGAGGTTACAACCACGAGATCGACAAATGCCTCACCGGAGGAGATTCGAACCGCGTGGGATGTTCGAATTGGCGCCCCGCATAGTTCCGTTCCAATATCATCTTTTGGCCACTGATCCCTGCGGGTTCTTGCCGGAGCCGGTTGAAATGTGGGCGCGACCTGGCAACATTTGAATGGCAGATCCAGACCACAACTGTAGGAAACCCTCCTATTCTAGTCTGGATTCAACTACCGTGCACCACAGTAGTTGGACTGAACGGACAGCTTCCGTCTGGGCCTTTCCTTCAAGGGTTGCGAATTAACGCCCTTCCCCTCTTCACTAGTCAGCGTTCGCGGACCCTAACGAAGACGAGAGATCATCATTGGTGTACTAGCGTC\n\nRosalind_9301\nGATATGAGGTGACCTTTCGGGTATCGAACAGCGAAACAACGGTGGCATAAGCTCTCTCTGCCTTTGATATTTCCCAAGTGCGTGTCTGTGTGGATGAGCCTTCGACCAGTCTACCACTCAACAGTGAGTTAAAAGCCAAATTACAAATTTTGCGGTTTCCCTTCTTGGTGCGCATGCCTATCAAGCTGCCGTACTTTAAGAGACGGCAAACCTCACAGTACGCCAAGGGGTTATTCGTCTTGTCCGCGGATAGGATATGTCAGCGCGCGATAACGAGCTCATCCAAAGATTCGCGTAAGCATTGCCAGGTCGTAAGGGCTTAGCAGTCGAACTTTATTAGTATCCAAATCAGTCAAGGAGACGGATTCCCTTAACGCCTACCTGATAAACTACCTAGCCCTGTTACGCCTGCGAGAGATACCCGCGGCGAGAGTCGGATGCTCGATGCCGGTAGCCTGTCTAACGGTATACTAGTAAAATCGGGTGTCAGCAAATGTAGGTTACACACGCGTTTAAGTAACGCTGACAAAGAACCTGTCCGTGCTTCTTGCCAACACTATCTTCATCGTATTCTACTTAAGCGTGCACACGGGTCCTCTTAGACCTAGTTATTTGTGTGAAGTAGACCGCTGGGCGAGTAAGCCGGTTTATTCTCCGCAGACTTGGCCAACTCGTAGTAGATCTCACAGAATCCCATCAGGTATAACACAGGCAACTGGGCGGCGGCTCAAGAAGCCGGGATAACTTAGGGGGCTATTCGGAGCTGTCTTGGGGAGCTGGTATCGAGTAGCGAGTTAATAATACTTGCAAA\n\nRosalind_6955\nTGACTTATCAAAGACCTAGGAAGTTTCAGGCCTCTCTTAGTATGCCACCCACGCTGGACGGCGCTGAATGCCTCAACCAGAACGACCCGTAAACAAAAACCGGAATCGGGCTGGGATGTTCTAGGCGGTAAGGTGCGATGCTGTAGCTTATTTCGCACACTCCCCTAATTCATAAGAAACGTAACGCCGTAGCCTAATTTACTCCGATCGAAACGCATGAGTCGTGCGTTTGGGGGAGCAAAGATCATCATTGAGGCGATGGGCCGCCAAGTAAGCATGGTGGGTGTGGGTCTTAGCGGCTGCTTGGGGCGTGGACGTGTCCGCGCTATACCCGAAATTTTGCGGCAAGCTCGGCCTGGGAAGTGAGCTCCAGCGAACGTAGTATCTGGTGCGGACAGACTTAGCAGTATCAAAAAACGTCGGTCCCATCTTACATAGGATCTTTAACTGTATGTGCCATCGCATGCGTAACTTCAGTCGCTCGGTTCCGACGGAGCGGACAGGTCCGTGAACGGTGACCCCATGCAAGCCAGAACTCCAGGGATCGCCGGCGGGCGTCTGAAGGAACAAACACCAATGTAACATGCTTAACCCGCGCGATGTTTAAAACTCTGCACCATGAGCTTAGATACTGGTGACCATTCCGAGCCGCCCTGATCCGAATCTCTTGTAAACTGTCGGAGGGTACTATTGGGGCATGTGTGGAGGACCGAGATGTACGAATGGATATAGCTTAACGCTGCTGCACGAAATCGACTGTTACGTCTCACCTGTTCGGCCTTAGCGTCAGACCTGAACCCAGAGGTGCGCGTTACATTGAGGACCGGTACGAAATTGGCTCTGATTATAAGTCATCGTAAGAGGATTACAAGCGCATCTGCGTGGCTCTCGTGCCGTTCGTTTTGGAACACCTCCGATTGGAGGTTATCAGTCAGCCCATCAATATCAGTGCTAACCAGCTCGTGGCGCCATGCCCGTTACCCGAAGATTCAACC\n\nRosalind_0896\nGCTGCAAAACGTACTCAACAGAATAGTAACCAGCAATCCGGATTCGCATGCTAAGTAGTATCAGGAGCCACGATTATTGCCTGCGCTTGCGTATGGCAGCCAGTTGTGAAGCGGTGAAGGGATATGATAGTAAGAAATCTTTAGAACCAGAGTGCATGACATGTCCGACCCTTTTGGGCCCATTATCCGCGACGTCCAGTCGTACCGTACCGGCGGCCCCTTAATTCTAGCTAAAGTTAGTCTCATAGACGTACGGTCAACCATCTGTCGTTGTCGCAGCAGATTAATCACGCTTATTAAGATCCGCTTCTACATCCACGTCTCTAGTACCTGCTGCCGACGCATCCGCTAACTATGCTTCTTGTGCAGTCGTCTATGCGCCGAGCCCATTATCTAATCTCTGCATACTTGCGTGTCCACGGATTTCCGTATTCTAACCCTCGGATCTACGTTTGCCGGCTGTCGACCGATCTTTAGGCTCACAGTCATCCGACCTGATCTGGGTCTAGTGCCGGTTATTACAGTTGACAGGCCATGCGTTCCACGGTGGAGACTTATACGTTGGGTGGAATGCTTCTTTTAACTGTGCATTTGATCCCTGCGTCAGTCTTTTTCAGACCTTAAAACCGACCGTGCTTGACGGCGAGTCACCGTCCATATAATTGTAGGACCCCGCTCTAGACTCGTGAGGTCATAGTGTAGGCCAAAGAGGTTGTTGTGCTCGTCTGTACTACCAACAGCTCTAGAAGGACGCGTGCAGTCATGTGGAGATGGGGGGGCCGCACGTTGAAAGGACCGGCAAGTCCACGACCCAACAAGCGTTTCGATACCCAAGCAGGCAGATTCATGTTACAATCGTGCGGCTCAGTACCAATTGCATAGTGTCACAGACATTGATATGTCTAAGTACTTATCCCGGGATTGCCACCTGACAGTTT\n\n"
}
],
"prompt_number": 29
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Conditional expressions can be used to filter list comprehensions."
},
{
"cell_type": "code",
"collapsed": false,
"input": "test_seq = 'GCATATATGCTAG'\ngcs = [char for char in test_seq if char == 'G' or char == 'C']\nprint gcs",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "['G', 'C', 'G', 'C', 'G']\n"
}
],
"prompt_number": 30
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Leaving out the square brackets produces a generator expression."
},
{
"cell_type": "code",
"collapsed": false,
"input": "gc_count = sum(1 for char in test_seq if char == 'G' or char == 'C')\nprint gc_count",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "5\n"
}
],
"prompt_number": 31
},
{
"cell_type": "code",
"collapsed": false,
"input": "def gc_content(seq):\n ''' Returns the GC content of seq as a percentage. '''\n gc_count = sum(1 for char in seq if char == 'G' or char == 'C')\n return 100 * float(gc_count) / len(seq)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 32
},
{
"cell_type": "markdown",
"metadata": {},
"source": "lambda expressions can be used to create anonymous functions."
},
{
"cell_type": "code",
"collapsed": false,
"input": "records_gc = [(name, gc_content(seq)) for name, seq in records]\nname, gc = max(records_gc, key=lambda (name, gc): gc)\nprint name\nprint gc",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Rosalind_6955\n52.6633165829\n"
}
],
"prompt_number": 33
},
{
"cell_type": "markdown",
"metadata": {},
"source": "##Rosalind: Counting Point Mutations##"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "zip() is useful when you have several lists that you want to process element-wise together. \n\n(Questions: Why do you think this is called zip? What built-in function that we have seen is equivalent to zip(range(len(seq)), seq)?)"
},
{
"cell_type": "code",
"collapsed": false,
"input": "numbers = range(4)\nletters = 'ABCD'\n\nfor n, l in zip(numbers, letters):\n print n, l",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "0 A\n1 B\n2 C\n3 D\n"
}
],
"prompt_number": 34
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Lets use zip in a filtered list comprehension to produce a compact, simple Hamming distance function."
},
{
"cell_type": "code",
"collapsed": false,
"input": "def hamming_distance(first, second):\n ''' Returns the number of positions that differ between first and second. '''\n distance = sum(1 for f, s in zip(first, second) if f != s)\n return distance",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 35
},
{
"cell_type": "code",
"collapsed": false,
"input": "file_name = 'data/rosalind_hamm.txt'\nfh = open(file_name)\nfirst = fh.readline().strip()\nsecond = fh.readline().strip()\n\nprint hamming_distance(first, second)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "482\n"
}
],
"prompt_number": 36
},
{
"cell_type": "markdown",
"metadata": {},
"source": "##Rosalind: Enumerating Gene Orders##"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The itertools module is full of all kinds of cool stuff."
},
{
"cell_type": "code",
"collapsed": false,
"input": "import itertools",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 37
},
{
"cell_type": "code",
"collapsed": false,
"input": "perms_three = itertools.permutations(range(1, 3 + 1))",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 38
},
{
"cell_type": "markdown",
"metadata": {},
"source": "In general, it doesn't make sense to ask for the length of an iterator."
},
{
"cell_type": "code",
"collapsed": false,
"input": "print len(perms_three)",
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "TypeError",
"evalue": "object of type 'itertools.permutations' has no len()",
"output_type": "pyerr",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-39-28a8bc59710c>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[1;32mprint\u001b[0m \u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mperms_three\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;31mTypeError\u001b[0m: object of type 'itertools.permutations' has no len()"
]
}
],
"prompt_number": 39
},
{
"cell_type": "markdown",
"metadata": {},
"source": "For iterators we know to be finite, we can 'cast' to a list."
},
{
"cell_type": "code",
"collapsed": false,
"input": "list_perms_three = list(perms_three)\nprint len(list_perms_three)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "6\n"
}
],
"prompt_number": 40
},
{
"cell_type": "markdown",
"metadata": {},
"source": "To produce the desired output format, we need to convert lists (actually, tuples) of numbers to lists of strings.\n\nmap(function, list) produces a new list that is function applied to each element of list.\n\nThe choice between map and a list comprehension is often just a matter of taste."
},
{
"cell_type": "code",
"collapsed": false,
"input": "for perm in list_perms_three:\n strings = map(str, perm)\n print ' '.join(strings)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "1 2 3\n1 3 2\n2 1 3\n2 3 1\n3 1 2\n3 2 1\n"
}
],
"prompt_number": 41
},
{
"cell_type": "markdown",
"metadata": {},
"source": "##Rosalind: Finding a Motif in DNA##"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "One way to find all occurrences of substring in string_to_search is to check if the substring is a prefix of each suffix of string_to_search."
},
{
"cell_type": "code",
"collapsed": false,
"input": "def find_locations(substring, string_to_search):\n ''' Returns a list of all positions in string where substring occurs. '''\n locations = []\n for i in range(len(string_to_search)):\n if string_to_search[i:i + len(substring)] == substring:\n locations.append(i)\n return locations",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 42
},
{
"cell_type": "code",
"collapsed": false,
"input": "file_name = 'data/rosalind_subs.txt'\nfh = open(file_name)\nstring_to_search = fh.readline().strip()\nsubstring = fh.readline().strip()\n\none_based = [str(l + 1) for l in find_locations(substring, string_to_search)]\nprint ' '.join(one_based)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "2 63 74 92 99 126 143 158 166 174 199 253 315 332 339 363 403 410 426 433 500 526 535 542 582 589 661 733 752 768 775 808 815 822 872 950\n"
}
],
"prompt_number": 43
},
{
"cell_type": "markdown",
"metadata": {},
"source": "When you are searching for matches to a pattern in a string, you should think 'regular expressions'. (Of course, [there is this](http://regex.info/blog/2006-09-15/247).)"
},
{
"cell_type": "code",
"collapsed": false,
"input": "import re",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 44
},
{
"cell_type": "markdown",
"metadata": {},
"source": "re.finditer(pattern, string_to_search) will produce (an iterator over) objects representing matches of pattern in string_to_search. These match objects have a start() method that gives the location of the match."
},
{
"cell_type": "code",
"collapsed": false,
"input": "pattern = substring\nfor match in re.finditer(pattern, string_to_search):\n print match.start(),",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "1 62 73 91 125 142 157 173 198 252 314 331 362 402 425 499 525 534 581 660 732 751 767 807 821 871 949\n"
}
],
"prompt_number": 45
},
{
"cell_type": "markdown",
"metadata": {},
"source": "A gotcha here is that overlapping pattern matches will not be found. (By default, the re engine 'consumes' the string as it moves through, so regions that are part of one match never get looked at again to be part of a potential second match.)"
},
{
"cell_type": "code",
"collapsed": false,
"input": "matches = re.finditer(pattern, string_to_search)\nre_one_based = [str(m.start() + 1) for m in matches]\nprint re_one_based == one_based",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "False\n"
}
],
"prompt_number": 46
},
{
"cell_type": "markdown",
"metadata": {},
"source": " Adding a (cryptic to the uninitiated) lookahead flag to the re pattern changes this behavior."
},
{
"cell_type": "code",
"collapsed": false,
"input": "pattern = '(?={0})'.format(substring)\nmatches = re.finditer(pattern, string_to_search)\nre_one_based_2 = [str(m.start() + 1) for m in matches]\nprint re_one_based_2 == one_based",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "True\n"
}
],
"prompt_number": 47
},
{
"cell_type": "markdown",
"metadata": {},
"source": "##Rosalind: Protein Translation##"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "One of the strengths of Python is the existence of high-quality modules for solving common problems in different domains, such as Biopython."
},
{
"cell_type": "code",
"collapsed": false,
"input": "from Bio.Seq import Seq",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 48
},
{
"cell_type": "code",
"collapsed": false,
"input": "file_name = 'data/rosalind_prot.txt'\nfh = open(file_name)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 49
},
{
"cell_type": "code",
"collapsed": false,
"input": "dna_string = fh.readline().strip()",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 50
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Make a Seq object out of dna_string."
},
{
"cell_type": "code",
"collapsed": false,
"input": "dna_seq = Seq(dna_string)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 51
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Translate to amino acid sequence, with Biopython taking care of knowing the genetic code."
},
{
"cell_type": "code",
"collapsed": false,
"input": "aa_seq = dna_seq.translate(to_stop=True)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 52
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Convert from a Seq object to a string."
},
{
"cell_type": "code",
"collapsed": false,
"input": "aa_string = str(aa_seq)\nprint aa_string",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "MGMREIQTLADLSSVLGIRYRVVRRRSCQRVIKTWVSYVIRTFHPHRVGLAGLEGRSLGTHMVGANGVMLTVTTLTSLGAIAAISSRPSQSIRGSVFSPRTIYGIQAYQHIHSPAYGKTWPSRERNSRTSSGQNCAAMLLELQHPRLVTCRSRNMFLGTACLPLSGPDSVSIEIMALVEFGGVSWKRIALSVYSRMGAGSTWALTQITKHIDTGTRLAGAHCLCWARGYKFVSGPTPPANWTLEYHSPGTSYNSTSCKRGMITRRMSWHLFGSLPNLSCKIHCCSRCDHWERSIRATCFRWSQCKAQECSDIDVDHICFAARGSRKTSLASWIVGYQWVVVLMETGALKAGLRPRAMRIQSPSSVSSFGTRVAHFRSRGPIGQPSSAESMIVAYWTIAYKLCRAPYLVVERLSNWSMTVYTFRLYKTNHGIPSNIKTPSRTARAFPTWKIPTDPVTAARFGGSRCNTKPKSPRVFEADIDRCSYSCDLKAQYQLSPDEITCLLSLVISVKHPRRRLLQSAFYCPMVGLGRSLKRGCQPFTACNDWTYRPSVKGSLACCAVLLQVSSVGRFIFPKRPKTIFLGERWPRARIISCRTIKAMKQYILIRRPTVAPEVCLRTSDPMHPLRLQPRGSCEYMLYSHLSRSRKARHAIETPRAPITKLLGSRTLPPQRVEVPIDSLVPLSPLQYGRQLWVFRATVWRGFLKTRLSQDLIVQSKMQNFYMLMSLQSTPEKNRPNPKNIRKFRTQCTSLLYCRNQHLRISFPRDGLVAVRGVCYLNRPRSLKVHRYRANIVRFKQFGVITVRYPSVTTGGGEKWSPDGRYMITRVNHRQMRARLPLNDRAHLLYPLGAELIFWYVGAARIHQMADDRNHMASTGLGTNCVREALAILARGRCMPRSHATFTSVRLHLTLTVPESHLCWKELLRAFSLATRNLQLVASIKPSTKHSHTINNLRRWVWLTRWMESYLPLYLVLRRCRVTLLLLLILTLSGQQIADGSDRGNHFTTLLHLYLDTKNHGKVPAVEPSSIIPLHSKPTHSGLLMSQYHRVRSCPVISRYAYCGRQPRVHNVIARNVFRPKVSEASYVKNTSTLRHSEYSNSCGLSALIDQNIFEAIYEGNVNTCENARSVKRRVISVILPYRLLMYICRAEAATPFLLRVNTRPDGDGANININKFVATSNIEAPDYKVNDKAATVGDAEVAIGCLELWSASTFCAVIGGTQCAILLMISDCALHISVRVPLSRAEHRCIGVTGGAHVDLPFARGIRHAGWHHARVDRGYDGVTLHFRQIIRKDFGSAGIIRVDGSHDFVSKDSYTNGVYSKNFTSTDPLNEITYDTERCYRVSITGLRLTYQCRFGSQVQMQTPLTNATIELYSRVNEPRTYDRFSRKLASNYIGKLVASWPNTAVGASRNTRRLGRTVASWGSESVEQTVLSCLLSLSVLQLIESARKPSVPKCYELRALSPMTCALHTSVITATVLDVNPQRVKVWLPGFLQSDARYDVVGPPFVKRSDGKSCPLRGPYPPAISDLSPWPLHPEPLFGTRTGTQPSMYSKRVFLLPTLRVGTSQADVTRMTAYCTLRLAHPTFTGASEQVQDTIPVIRSNGAHFLKAPYPLGHKFRCTSTASAGWHVDLITASGSFTTYVRLNGSTLTIVWRSPWSTHFDKASFSTVVLLRCSFSYCGSPRGANRGLALHQRYLLPWRQGAKKRILLFVRQLCTAMNFSFIETVRLEPNYSKHATVSTAGPARTCCYVFRANAKTGLVELCSPIAYPENLIGVLELKCFRHSPSSARACQPLPVAWRPHNITEPQTIYGYTQFANFSRYSPEYSTFTGVMASLYSPGSAPIAMREEICQLALADAVGPRKSPWLGQLTPLHSYVTSTSMSTDGRWRGPRQSSLCAPGARASQPTTLWAYPSSKPLRRRQIALTRQRYKSGQTMACVLLCLAIFIGLRCNQPKINSISLVTFYVSSVILHYTHGPVLGSRFFHLSGYHAMRPRDACSIKMICGTGPSPCLRVANNTLAVVTHHIYILALDLPSSRSCLTTGRDLRATLVYVYAFPFPNGAFRGGWLLRFNPLLKPSRLYREGYIHESELGPMHASGSGGTSCPIEAGVCIDRIPQERLIQVRLLMLFVLAASLTLGAQNIAIQRARSVVYAIPLYVHKRLRLGSMLHDSTLALGYLKTRFSVFRGRRLAAASIPDLAQLTDDVRLLPHVKQSTCINDDVLQPKSKSVSGITKRIELGPNKKGRTPGVNTLKQLVGDTRGVLRKTRGLSAVSTHPTRSSNCGVYDEGKPGGTSNNLLYKTILATALLGQLVSIAYMCLLCRGFKLANITERTSGSTYCLPVIASTTPAFQPTGNRTLGSTKPTRKGELLGGMPCFCNYRPSSRSRVTTLIRVSYVCSCEINKVGTSPNLPPERLASSYFGSRSLVGLIGVAIVESSDCSRVRKLLFVLCKQPIKPLFDKTPAEPLRAVVTSLCYASIQELPPGLRLATNIRKNAILGPICIRSDHLRSGLVLQNQITRTLATILALTQWEFNPARWLPEPRSDGDDLAASPKAYDPQLCNFFDLRRRISRTCKELSNGCGSFSSIIVLAGLIVSKSYVKEQRSLSGTRIFLSFEAPAVCLMLQIRKKWINGDAMEREAPVTSCIKQKRIINAFSFAVLQQSVLSAPAPKCHHYNRCPFESSPFNAVSVSLGHSRTPYNPCSQQAGDVMTFLEKSCDQVVGIVVRSQYCIATNHVESRCGWKVPSEAWLLNKLLRSISGCTGNNARNTAWSMTRTEVHSWKWQYANGGLRGEGMKHFRQWGHLHMLGFVRDRRSEHYLVPNNPSTCDWPQSSSRAIGELHIFGVSAVRTSGLLFSPLCSGDSGSNLEGALNEYTDHAFKPVTVLTSYKGPRHTRSDRWLHDVITTAIRRRCQFTAELSPRFNAAPHMGRTCVFRCVLPTSNASLNPPMSTRHRSGFQSYNVKIRSSGTYSPGRVESPVCLWQIPLRALTNSPAPDIMFILGKQHLFALELVLRPTNEAKDRDRISYSSPAPALLVTSGGSPVSYNTWPIMNSRCSEFASTVRLTFYLSHRRSRETRVTSIIQVPKSSQCALTERTVDRTVGREAISVRQLLSHSATAVFLIKTANQAPLIRGQPSCFNGFDASHVHHWHTITRRPLQLLTGGNLYIKRHHHLWACKVSVDFWIGCPVPGAIPDVFPGGALGSEVRLGARRFVINPPRPGPLHNPQLNLGSRLDLYCGSRRVLLEKGDGRRTV\n"
}
],
"prompt_number": 53
},
{
"cell_type": "markdown",
"metadata": {},
"source": "##Rosalind: Finding a Protein Motif##"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "I have placed gc_content.py in the same folder that this notebook is in. Now I can import gc_content and have access to all its guts in a namespace called 'gc_content'."
},
{
"cell_type": "code",
"collapsed": false,
"input": "import gc_content",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 54
},
{
"cell_type": "markdown",
"metadata": {},
"source": "This import statement evaluates all of the code in gc_content. Usually this is not what you want to happen. The file being imported will have some stuff it is supposed to do on its own that you don't care about when you are importing it. The \"if \\__name__ = '\\__main__':\" idiom takes care of this problem"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "If we write docstrings for our functions, we can get reminders about what they do."
},
{
"cell_type": "code",
"collapsed": false,
"input": "gc_content.fasta_records?",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 55
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Generators are a convenient way to define your own iterators."
},
{
"cell_type": "code",
"collapsed": false,
"input": "def simple_generator(n):\n for i in range(n):\n yield i\n \nmy_gen = simple_generator(4)\nprint [number**2 for number in my_gen]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "[0, 1, 4, 9]\n"
}
],
"prompt_number": 56
},
{
"cell_type": "markdown",
"metadata": {},
"source": "urllib provides a way to read files from the web as seamlessly as if they were local. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "import urllib\n\nopened_url = urllib.urlopen('http://www.uniprot.org/uniprot/A2Z669.fasta')\nname, seq = gc_content.fasta_records(opened_url).next()\nprint name\nprint seq",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "sp|A2Z669|CSPLT_ORYSI CASP-like protein OsI_33147 OS=Oryza sativa subsp. indica GN=OsI_33147 PE=3 SV=1\nMRASRPVVHPVEAPPPAALAVAAAAVAVEAGVGAGGGAAAHGGENAQPRGVRMKDPPGAPGTPGGLGLRLVQAFFAAAALAVMASTDDFPSVSAFCYLVAAAILQCLWSLSLAVVDIYALLVKRSLRNPQAVCIFTIGDGITGTLTLGAACASAGITVLIGNDLNICANNHCASFETATAMAFISWFALAPSCVLNFWSMASR\n"
}
],
"prompt_number": 57
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Regular expressions make it easy to look for the N-glycosylation motif."
},
{
"cell_type": "code",
"collapsed": false,
"input": "import urllib\n\nmotif = re.compile(r'(?=(N[^P][ST][^P]))')\n\nfile_name = 'data/rosalind_mprt.txt'\nfh = open(file_name)\n\nfor line in fh:\n uniprot_id = line.strip()\n url = 'http://www.uniprot.org/uniprot/{0}.fasta'.format(uniprot_id)\n opened_url = urllib.urlopen(url)\n\n for name, seq in gc_content.fasta_records(opened_url):\n ps = [str(match.start() + 1) for match in motif.finditer(seq)]\n if ps:\n print uniprot_id\n print ' '.join(ps)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Q3T0C9\n15 38\nQ32LI2"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n157\nP00740_FA9_HUMAN"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n203 213\nP04233_HG2A_HUMAN"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n130 136 256 270\nP22891_PRTZ_HUMAN"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n99 225 233 306 332\nP80195_MPP3_BOVIN"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n95\nP02974_FMM1_NEIGO"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n67 68 121\nB3CNE0"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n107\nP98119_URT1_DESRO"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n153 398\nP21810_PGS1_HUMAN"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n270 311\nQ9D9T0"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n154\nQ4FZD7"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n528\nP19823_ITH2_HUMAN"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n118 445\n"
}
],
"prompt_number": 58
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment