Skip to content

Instantly share code, notes, and snippets.

@jeffhussmann
Created January 18, 2014 00:47
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jeffhussmann/8484499 to your computer and use it in GitHub Desktop.
Save jeffhussmann/8484499 to your computer and use it in GitHub Desktop.
Explanations of some Rosalind problems used in the Python tutorial.
{
"metadata": {
"name": "Jeff Hussmann 01-16-14 Rosalind Bioinformatics Stronghold "
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "##Rosalind: Counting DNA Nucleotides##"
},
{
"cell_type": "code",
"collapsed": false,
"input": "dna_string = 'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC'",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We will use a [dictionary](http://docs.python.org/2/library/stdtypes.html#mapping-types-dict) to keep track of nucleotide counts. Using dictionaries effectively is one of the keys to getting the most out of Python."
},
{
"cell_type": "code",
"collapsed": false,
"input": "counts = {'A': 0,\n 'C': 0,\n 'G': 0,\n 'T': 0,\n }",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Iterating over a string produces the characters of the string one at a time."
},
{
"cell_type": "code",
"collapsed": false,
"input": "for c in dna_string:\n counts[c] += 1",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": "List comprehensions are a powerful way to construct lists by peforming some operation on each of the elements in an object that can be iterated over."
},
{
"cell_type": "code",
"collapsed": false,
"input": "count_strings = [str(counts[base]) for base in 'ACGT']",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": "print ' '.join(count_strings)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "20 12 17 21\n"
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": "##Rosalind: Transcribing DNA into RNA##"
},
{
"cell_type": "code",
"collapsed": false,
"input": "dna_string = 'GATGGAACTTGACTACGTAAAT'",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We can't directly change the characters in the string - strings are immutable. \n(If you aren't familiar with the `enumerate` built-in function, check out it's documentation.)"
},
{
"cell_type": "code",
"collapsed": false,
"input": "for i, c in enumerate(dna_string):\n if c == 'T':\n dna_string[i] = 'U'",
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "TypeError",
"evalue": "'str' object does not support item assignment",
"output_type": "pyerr",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-7-e1e72f21c172>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mi\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mc\u001b[0m \u001b[1;32min\u001b[0m \u001b[0menumerate\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdna_string\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mc\u001b[0m \u001b[1;33m==\u001b[0m \u001b[1;34m'T'\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 3\u001b[1;33m \u001b[0mdna_string\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mi\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;34m'U'\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;31mTypeError\u001b[0m: 'str' object does not support item assignment"
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We could build the string up incrementally ... "
},
{
"cell_type": "code",
"collapsed": false,
"input": "rna_string = ''\nfor c in dna_string:\n if c == 'T':\n rna_string += 'U'\n else:\n rna_string += c\n \nprint rna_string",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "GAUGGAACUUGACUACGUAAAU\n"
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": "... but string objects have a `replace()` method that does exactly what we want."
},
{
"cell_type": "code",
"collapsed": false,
"input": "rna_string_2 = dna_string.replace('T', 'U')\nprint rna_string == rna_string_2",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "True\n"
}
],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": "##Rosalind: Complementing a Strand of DNA##"
},
{
"cell_type": "code",
"collapsed": false,
"input": "dna_string = 'AAAACCCGGT'",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": "One approach is to use a dictionary to encode how each nucleotide gets complemented and to use the built-in function reversed() to iterate over the DNA string in reversed order."
},
{
"cell_type": "code",
"collapsed": false,
"input": "complement = {'T': 'A',\n 'C': 'G',\n 'A': 'T',\n 'G': 'C',\n }\n\nrc_string = ''\nfor c in reversed(dna_string):\n rc_string += complement[c]\n \nprint rc_string",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "ACCGGGTTTT\n"
}
],
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": "It turns out that there are built-in functions in the `string` module that do exactly what we want."
},
{
"cell_type": "code",
"collapsed": false,
"input": "import string",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 12
},
{
"cell_type": "code",
"collapsed": false,
"input": "complement_table = string.maketrans('TCAG', 'AGTC')\nc_string = dna_string.translate(complement_table)\nrc_string_2 = c_string[::-1]\n\nprint rc_string == rc_string_2",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "True\n"
}
],
"prompt_number": 13
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Why does [::-1] work?"
},
{
"cell_type": "code",
"collapsed": false,
"input": "up_to_ten = range(10)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Taking a slice a:b returns the interval of the list starting at index a and going up to but not including index b. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "up_to_ten[2:4]",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 15,
"text": "[2, 3]"
}
],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Leaving off part of the slice means 'start from the beginning'..."
},
{
"cell_type": "code",
"collapsed": false,
"input": "up_to_ten[:4]",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 16,
"text": "[0, 1, 2, 3]"
}
],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": "... or 'go to the end.'"
},
{
"cell_type": "code",
"collapsed": false,
"input": "up_to_ten[4:]",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 17,
"text": "[4, 5, 6, 7, 8, 9]"
}
],
"prompt_number": 17
},
{
"cell_type": "markdown",
"metadata": {},
"source": "An optional third slice argument gives a stride length to step along the interval with."
},
{
"cell_type": "code",
"collapsed": false,
"input": "up_to_ten[2:9:3]",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 18,
"text": "[2, 5, 8]"
}
],
"prompt_number": 18
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Stride length can be negative...."
},
{
"cell_type": "code",
"collapsed": false,
"input": "up_to_ten[9:2:-2]",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 19,
"text": "[9, 7, 5, 3]"
}
],
"prompt_number": 19
},
{
"cell_type": "markdown",
"metadata": {},
"source": "... so this idiom means 'start at the end and go through to the beginning stepping backwards one index at a time', which is a complicated way to say 'reverse'."
},
{
"cell_type": "code",
"collapsed": false,
"input": "up_to_ten[::-1]",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 20,
"text": "[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]"
}
],
"prompt_number": 20
},
{
"cell_type": "markdown",
"metadata": {},
"source": "##Rosalind: Counting Point Mutations##"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "`zip()` is useful when you have several lists that you want to process element-wise together. \n\n(Questions: Why do you think this is called zip? What built-in function that we have seen is equivalent to zip(range(len(seq)), seq)?)"
},
{
"cell_type": "code",
"collapsed": false,
"input": "numbers = range(4)\nletters = 'ABCD'\n\nfor n, l in zip(numbers, letters):\n print n, l",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "0 A\n1 B\n2 C\n3 D\n"
}
],
"prompt_number": 21
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Lets use `zip` in a filtered list comprehension to produce a compact, simple Hamming distance function."
},
{
"cell_type": "code",
"collapsed": false,
"input": "def hamming_distance(first, second):\n ''' Returns the number of positions that differ between first and second. '''\n distance = sum(1 for f, s in zip(first, second) if f != s)\n return distance",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 22
},
{
"cell_type": "code",
"collapsed": false,
"input": "file_name = 'data/rosalind_hamm.txt'\nfh = open(file_name)\nfirst = fh.readline().strip()\nsecond = fh.readline().strip()\n\nprint hamming_distance(first, second)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "482\n"
}
],
"prompt_number": 23
},
{
"cell_type": "markdown",
"metadata": {},
"source": "##Rosalind: Enumerating Gene Orders##"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The `itertools` module is full of all kinds of cool stuff."
},
{
"cell_type": "code",
"collapsed": false,
"input": "import itertools",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 24
},
{
"cell_type": "code",
"collapsed": false,
"input": "how_many = 3\nperms = itertools.permutations(range(1, how_many + 1))",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 25
},
{
"cell_type": "markdown",
"metadata": {},
"source": "In general, it doesn't make sense to ask for the length of an iterator."
},
{
"cell_type": "code",
"collapsed": false,
"input": "print len(perms)",
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "TypeError",
"evalue": "object of type 'itertools.permutations' has no len()",
"output_type": "pyerr",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-26-bb2d7ce08cd2>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[1;32mprint\u001b[0m \u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mperms\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;31mTypeError\u001b[0m: object of type 'itertools.permutations' has no len()"
]
}
],
"prompt_number": 26
},
{
"cell_type": "markdown",
"metadata": {},
"source": "For iterators we know to be finite, we can 'cast' to a list."
},
{
"cell_type": "code",
"collapsed": false,
"input": "list_perms = list(perms)\nprint len(list_perms)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "6\n"
}
],
"prompt_number": 27
},
{
"cell_type": "markdown",
"metadata": {},
"source": "To produce the desired output format, we need to convert lists (actually, tuples) of numbers to lists of strings.\nLet's use a list comprehension."
},
{
"cell_type": "code",
"collapsed": false,
"input": "for perm in list_perms:\n strings = [str(value) for value in perm]\n print ' '.join(strings)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "1 2 3\n1 3 2\n2 1 3\n2 3 1\n3 1 2\n3 2 1\n"
}
],
"prompt_number": 28
},
{
"cell_type": "markdown",
"metadata": {},
"source": "##Rosalind: Finding a Motif in DNA##"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "One way to find all occurrences of substring in string_to_search is to check if the substring is a prefix of each suffix of string_to_search."
},
{
"cell_type": "code",
"collapsed": false,
"input": "def find_locations(substring, string_to_search):\n ''' Returns a list of all positions in string where substring occurs. '''\n locations = []\n for i in range(len(string_to_search)):\n if string_to_search[i:i + len(substring)] == substring:\n locations.append(i)\n return locations",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 29
},
{
"cell_type": "code",
"collapsed": false,
"input": "file_name = 'data/rosalind_subs.txt'\nfh = open(file_name)\nstring_to_search = fh.readline().strip()\nsubstring = fh.readline().strip()\n\none_based = [str(l + 1) for l in find_locations(substring, string_to_search)]\nprint ' '.join(one_based)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "2 63 74 92 99 126 143 158 166 174 199 253 315 332 339 363 403 410 426 433 500 526 535 542 582 589 661 733 752 768 775 808 815 822 872 950\n"
}
],
"prompt_number": 30
},
{
"cell_type": "markdown",
"metadata": {},
"source": "When you are searching for matches to a pattern in a string, you should think 'regular expressions'. (Of course, [there is this](http://regex.info/blog/2006-09-15/247).)"
},
{
"cell_type": "code",
"collapsed": false,
"input": "import re",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 31
},
{
"cell_type": "markdown",
"metadata": {},
"source": "`re.finditer(pattern, string_to_search)` will produce (an iterator over) objects representing matches of pattern in string_to_search. These match objects have a `start()` method that gives the location of the match."
},
{
"cell_type": "code",
"collapsed": false,
"input": "pattern = substring\nfor match in re.finditer(pattern, string_to_search):\n print match.start(),",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "1 62 73 91 125 142 157 173 198 252 314 331 362 402 425 499 525 534 581 660 732 751 767 807 821 871 949\n"
}
],
"prompt_number": 32
},
{
"cell_type": "markdown",
"metadata": {},
"source": "A gotcha here is that overlapping pattern matches will not be found. (By default, the re engine 'consumes' the string as it moves through, so regions that are part of one match never get looked at again to be part of a potential second match.)"
},
{
"cell_type": "code",
"collapsed": false,
"input": "matches = re.finditer(pattern, string_to_search)\nre_one_based = [str(m.start() + 1) for m in matches]\nprint re_one_based == one_based",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "False\n"
}
],
"prompt_number": 33
},
{
"cell_type": "markdown",
"metadata": {},
"source": " Adding a (cryptic to the uninitiated) lookahead flag to the re pattern changes this behavior."
},
{
"cell_type": "code",
"collapsed": false,
"input": "pattern = '(?={0})'.format(substring)\nmatches = re.finditer(pattern, string_to_search)\nre_one_based_2 = [str(m.start() + 1) for m in matches]\nprint re_one_based_2 == one_based",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "True\n"
}
],
"prompt_number": 34
},
{
"cell_type": "markdown",
"metadata": {},
"source": "##Rosalind: Protein Translation##"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "One of the strengths of Python is the existence of high-quality modules for solving common problems in different domains, such as Biopython."
},
{
"cell_type": "code",
"collapsed": false,
"input": "from Bio.Seq import Seq",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 35
},
{
"cell_type": "code",
"collapsed": false,
"input": "file_name = 'data/rosalind_prot.txt'\nfh = open(file_name)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 36
},
{
"cell_type": "code",
"collapsed": false,
"input": "dna_string = fh.readline().strip()",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 37
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Make a Seq object out of dna_string."
},
{
"cell_type": "code",
"collapsed": false,
"input": "dna_seq = Seq(dna_string)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 38
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Translate to amino acid sequence, with Biopython taking care of knowing the genetic code."
},
{
"cell_type": "code",
"collapsed": false,
"input": "aa_seq = dna_seq.translate(to_stop=True)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 39
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Convert from a Seq object to a string."
},
{
"cell_type": "code",
"collapsed": false,
"input": "aa_string = str(aa_seq)\nprint aa_string",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "MGMREIQTLADLSSVLGIRYRVVRRRSCQRVIKTWVSYVIRTFHPHRVGLAGLEGRSLGTHMVGANGVMLTVTTLTSLGAIAAISSRPSQSIRGSVFSPRTIYGIQAYQHIHSPAYGKTWPSRERNSRTSSGQNCAAMLLELQHPRLVTCRSRNMFLGTACLPLSGPDSVSIEIMALVEFGGVSWKRIALSVYSRMGAGSTWALTQITKHIDTGTRLAGAHCLCWARGYKFVSGPTPPANWTLEYHSPGTSYNSTSCKRGMITRRMSWHLFGSLPNLSCKIHCCSRCDHWERSIRATCFRWSQCKAQECSDIDVDHICFAARGSRKTSLASWIVGYQWVVVLMETGALKAGLRPRAMRIQSPSSVSSFGTRVAHFRSRGPIGQPSSAESMIVAYWTIAYKLCRAPYLVVERLSNWSMTVYTFRLYKTNHGIPSNIKTPSRTARAFPTWKIPTDPVTAARFGGSRCNTKPKSPRVFEADIDRCSYSCDLKAQYQLSPDEITCLLSLVISVKHPRRRLLQSAFYCPMVGLGRSLKRGCQPFTACNDWTYRPSVKGSLACCAVLLQVSSVGRFIFPKRPKTIFLGERWPRARIISCRTIKAMKQYILIRRPTVAPEVCLRTSDPMHPLRLQPRGSCEYMLYSHLSRSRKARHAIETPRAPITKLLGSRTLPPQRVEVPIDSLVPLSPLQYGRQLWVFRATVWRGFLKTRLSQDLIVQSKMQNFYMLMSLQSTPEKNRPNPKNIRKFRTQCTSLLYCRNQHLRISFPRDGLVAVRGVCYLNRPRSLKVHRYRANIVRFKQFGVITVRYPSVTTGGGEKWSPDGRYMITRVNHRQMRARLPLNDRAHLLYPLGAELIFWYVGAARIHQMADDRNHMASTGLGTNCVREALAILARGRCMPRSHATFTSVRLHLTLTVPESHLCWKELLRAFSLATRNLQLVASIKPSTKHSHTINNLRRWVWLTRWMESYLPLYLVLRRCRVTLLLLLILTLSGQQIADGSDRGNHFTTLLHLYLDTKNHGKVPAVEPSSIIPLHSKPTHSGLLMSQYHRVRSCPVISRYAYCGRQPRVHNVIARNVFRPKVSEASYVKNTSTLRHSEYSNSCGLSALIDQNIFEAIYEGNVNTCENARSVKRRVISVILPYRLLMYICRAEAATPFLLRVNTRPDGDGANININKFVATSNIEAPDYKVNDKAATVGDAEVAIGCLELWSASTFCAVIGGTQCAILLMISDCALHISVRVPLSRAEHRCIGVTGGAHVDLPFARGIRHAGWHHARVDRGYDGVTLHFRQIIRKDFGSAGIIRVDGSHDFVSKDSYTNGVYSKNFTSTDPLNEITYDTERCYRVSITGLRLTYQCRFGSQVQMQTPLTNATIELYSRVNEPRTYDRFSRKLASNYIGKLVASWPNTAVGASRNTRRLGRTVASWGSESVEQTVLSCLLSLSVLQLIESARKPSVPKCYELRALSPMTCALHTSVITATVLDVNPQRVKVWLPGFLQSDARYDVVGPPFVKRSDGKSCPLRGPYPPAISDLSPWPLHPEPLFGTRTGTQPSMYSKRVFLLPTLRVGTSQADVTRMTAYCTLRLAHPTFTGASEQVQDTIPVIRSNGAHFLKAPYPLGHKFRCTSTASAGWHVDLITASGSFTTYVRLNGSTLTIVWRSPWSTHFDKASFSTVVLLRCSFSYCGSPRGANRGLALHQRYLLPWRQGAKKRILLFVRQLCTAMNFSFIETVRLEPNYSKHATVSTAGPARTCCYVFRANAKTGLVELCSPIAYPENLIGVLELKCFRHSPSSARACQPLPVAWRPHNITEPQTIYGYTQFANFSRYSPEYSTFTGVMASLYSPGSAPIAMREEICQLALADAVGPRKSPWLGQLTPLHSYVTSTSMSTDGRWRGPRQSSLCAPGARASQPTTLWAYPSSKPLRRRQIALTRQRYKSGQTMACVLLCLAIFIGLRCNQPKINSISLVTFYVSSVILHYTHGPVLGSRFFHLSGYHAMRPRDACSIKMICGTGPSPCLRVANNTLAVVTHHIYILALDLPSSRSCLTTGRDLRATLVYVYAFPFPNGAFRGGWLLRFNPLLKPSRLYREGYIHESELGPMHASGSGGTSCPIEAGVCIDRIPQERLIQVRLLMLFVLAASLTLGAQNIAIQRARSVVYAIPLYVHKRLRLGSMLHDSTLALGYLKTRFSVFRGRRLAAASIPDLAQLTDDVRLLPHVKQSTCINDDVLQPKSKSVSGITKRIELGPNKKGRTPGVNTLKQLVGDTRGVLRKTRGLSAVSTHPTRSSNCGVYDEGKPGGTSNNLLYKTILATALLGQLVSIAYMCLLCRGFKLANITERTSGSTYCLPVIASTTPAFQPTGNRTLGSTKPTRKGELLGGMPCFCNYRPSSRSRVTTLIRVSYVCSCEINKVGTSPNLPPERLASSYFGSRSLVGLIGVAIVESSDCSRVRKLLFVLCKQPIKPLFDKTPAEPLRAVVTSLCYASIQELPPGLRLATNIRKNAILGPICIRSDHLRSGLVLQNQITRTLATILALTQWEFNPARWLPEPRSDGDDLAASPKAYDPQLCNFFDLRRRISRTCKELSNGCGSFSSIIVLAGLIVSKSYVKEQRSLSGTRIFLSFEAPAVCLMLQIRKKWINGDAMEREAPVTSCIKQKRIINAFSFAVLQQSVLSAPAPKCHHYNRCPFESSPFNAVSVSLGHSRTPYNPCSQQAGDVMTFLEKSCDQVVGIVVRSQYCIATNHVESRCGWKVPSEAWLLNKLLRSISGCTGNNARNTAWSMTRTEVHSWKWQYANGGLRGEGMKHFRQWGHLHMLGFVRDRRSEHYLVPNNPSTCDWPQSSSRAIGELHIFGVSAVRTSGLLFSPLCSGDSGSNLEGALNEYTDHAFKPVTVLTSYKGPRHTRSDRWLHDVITTAIRRRCQFTAELSPRFNAAPHMGRTCVFRCVLPTSNASLNPPMSTRHRSGFQSYNVKIRSSGTYSPGRVESPVCLWQIPLRALTNSPAPDIMFILGKQHLFALELVLRPTNEAKDRDRISYSSPAPALLVTSGGSPVSYNTWPIMNSRCSEFASTVRLTFYLSHRRSRETRVTSIIQVPKSSQCALTERTVDRTVGREAISVRQLLSHSATAVFLIKTANQAPLIRGQPSCFNGFDASHVHHWHTITRRPLQLLTGGNLYIKRHHHLWACKVSVDFWIGCPVPGAIPDVFPGGALGSEVRLGARRFVINPPRPGPLHNPQLNLGSRLDLYCGSRRVLLEKGDGRRTV\n"
}
],
"prompt_number": 40
},
{
"cell_type": "markdown",
"metadata": {},
"source": "##Rosalind: Computing GC Content##"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Parsing a FASTA file is a little messy. After each line that starts with a `>`, we need to read an indeterminant amount of lines until we encounter the next one that starts with a `>`.\nThis function takes in a file name and return a list of tuples of (name, sequence)."
},
{
"cell_type": "code",
"collapsed": false,
"input": "def fasta_records(file_name):\n ''' Returns a list of (name, seq) pairs for fasta records in file_name. '''\n records = []\n fh = open(file_name)\n name = fh.readline().strip().lstrip('>')\n while name:\n seq = ''\n line = fh.readline().strip()\n while line and not line.startswith('>'):\n seq += line\n line = fh.readline().strip()\n records.append((name, seq))\n name = line.lstrip('>')\n return records",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 41
},
{
"cell_type": "code",
"collapsed": false,
"input": "file_name = 'data/rosalind_gc.txt'\nrecords = fasta_records(file_name)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 42
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Argument unpacking makes code more readable."
},
{
"cell_type": "code",
"collapsed": false,
"input": "for name, seq in records:\n print name\n print seq\n print",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Rosalind_8336\nGCTCACCCGTGTCGCCCGAACATGCTTAACTAACTTGGACAGGTTTGAAATAGCACTTGAAGAAAACGTTTCAGCGATGACCAGACGAGGATACGATCGCTCCTCGGAAGCCCTCTAAGCTCGTCGCAGTGAACAAGACTCTAAGAGTCTTCAGACGCTTGGCGCCGGTAGGTGGTTCTGTGCACTACTCCACGTCGTCCGGTCGAGTAATAAGCAGGGATCTTTCGGCGTTGTACTAACAGCTCATAACTCCAAAGCCAAAAACTCAACTGGGAAAGCGTAGTGTCAGATGACTGTACACTCACACAAGTACGGTCGGCAGGGGCACCGTTACGCCCCAGACCGTTGCGTATTTCTGCAGTCCGGCGTAAGTAGATTGTCCGTGGGCTTACTCTTGGAATTATATCATGTCTCTGGCTTGGAATTGGCGTTACATTGTGGGGACCCGCTGCGGCCGTCCCTATTCATATCATTTTTGAGCATAATGATTTTTTACAACAAAAATGTAGCACGGAATGCCGAATTTAAGTCAGTTTACATCAGTGAAGCGGTTTGCCCAACGGCCAGACTGCGTGGGGCACAGTTGGCCTATGTAGGACCACTCGTGCGCATTAGAACTTCAACCCGTTGACTACGCCGGAGCGGGACAACGTAATTCATTGATCACTGGACCCGACAGTACGGGTGGACACAATGTGTCAATGTAGGCCAGCATAGTTAAACATCGAGCCCCGTGTGACAGAAGTTTGAAGCGATGTAGTGATGGTCTTCACTTGTCATGACGGTGGGGCCTCGGGAAAGTTGGCATCCCTACAGCGTTTCAGATGCGGTCCAATGCTAGCGGACAATTCTAGCACGTTCTAGGCTGCCGCTAGGTCAACCCGCCACATGATCCACTGAAAGT\n\nRosalind_6236\nTATGCAAATTTATGGCGCGGCCATTGAGTAAGACTTCAGCAGCACCTGATGTTCGCATCTAAAACCTGTTAGGTACTGTATTGCAATCGAATATTGTGACCCGGTCCACACGCACGCATACTTGTGCTATGAACAATATCAACTACCCCACACAAGCGTTGGCGGTTGGGTACTCGAAACTATAAGATTCGCCCATGTGTGTAAAGGCGTTGTCATCTCACTGAAACCTTGCCTCGAGTCCTAGAGAAACTGTTTAGAATGAACCGCTATCGGCCAGCGATGCCCCTCCTCTGTCTGGCCGTAATTGGGGCCTAGACTCCGATGTATCTGATGTATGCACGGAAAAAGCTCCACGCAGGTTACGGAGGGTGCGCTAATAATGGTCACAGCGGGTCGATGGGGTTACGTTATACATTCCCAAGCAGTTCGACATCCCGCAGATCGAATACACACTGTCATGCAGCGTCACTGCCCACCTGCACCGCCGGTTAAACAAGAATTAGTTCCGGTCCTCCGCCTGGGTGATGGGTAACTTCGGCATGAGGTTACAACCACGAGATCGACAAATGCCTCACCGGAGGAGATTCGAACCGCGTGGGATGTTCGAATTGGCGCCCCGCATAGTTCCGTTCCAATATCATCTTTTGGCCACTGATCCCTGCGGGTTCTTGCCGGAGCCGGTTGAAATGTGGGCGCGACCTGGCAACATTTGAATGGCAGATCCAGACCACAACTGTAGGAAACCCTCCTATTCTAGTCTGGATTCAACTACCGTGCACCACAGTAGTTGGACTGAACGGACAGCTTCCGTCTGGGCCTTTCCTTCAAGGGTTGCGAATTAACGCCCTTCCCCTCTTCACTAGTCAGCGTTCGCGGACCCTAACGAAGACGAGAGATCATCATTGGTGTACTAGCGTC\n\nRosalind_9301\nGATATGAGGTGACCTTTCGGGTATCGAACAGCGAAACAACGGTGGCATAAGCTCTCTCTGCCTTTGATATTTCCCAAGTGCGTGTCTGTGTGGATGAGCCTTCGACCAGTCTACCACTCAACAGTGAGTTAAAAGCCAAATTACAAATTTTGCGGTTTCCCTTCTTGGTGCGCATGCCTATCAAGCTGCCGTACTTTAAGAGACGGCAAACCTCACAGTACGCCAAGGGGTTATTCGTCTTGTCCGCGGATAGGATATGTCAGCGCGCGATAACGAGCTCATCCAAAGATTCGCGTAAGCATTGCCAGGTCGTAAGGGCTTAGCAGTCGAACTTTATTAGTATCCAAATCAGTCAAGGAGACGGATTCCCTTAACGCCTACCTGATAAACTACCTAGCCCTGTTACGCCTGCGAGAGATACCCGCGGCGAGAGTCGGATGCTCGATGCCGGTAGCCTGTCTAACGGTATACTAGTAAAATCGGGTGTCAGCAAATGTAGGTTACACACGCGTTTAAGTAACGCTGACAAAGAACCTGTCCGTGCTTCTTGCCAACACTATCTTCATCGTATTCTACTTAAGCGTGCACACGGGTCCTCTTAGACCTAGTTATTTGTGTGAAGTAGACCGCTGGGCGAGTAAGCCGGTTTATTCTCCGCAGACTTGGCCAACTCGTAGTAGATCTCACAGAATCCCATCAGGTATAACACAGGCAACTGGGCGGCGGCTCAAGAAGCCGGGATAACTTAGGGGGCTATTCGGAGCTGTCTTGGGGAGCTGGTATCGAGTAGCGAGTTAATAATACTTGCAAA\n\nRosalind_6955\nTGACTTATCAAAGACCTAGGAAGTTTCAGGCCTCTCTTAGTATGCCACCCACGCTGGACGGCGCTGAATGCCTCAACCAGAACGACCCGTAAACAAAAACCGGAATCGGGCTGGGATGTTCTAGGCGGTAAGGTGCGATGCTGTAGCTTATTTCGCACACTCCCCTAATTCATAAGAAACGTAACGCCGTAGCCTAATTTACTCCGATCGAAACGCATGAGTCGTGCGTTTGGGGGAGCAAAGATCATCATTGAGGCGATGGGCCGCCAAGTAAGCATGGTGGGTGTGGGTCTTAGCGGCTGCTTGGGGCGTGGACGTGTCCGCGCTATACCCGAAATTTTGCGGCAAGCTCGGCCTGGGAAGTGAGCTCCAGCGAACGTAGTATCTGGTGCGGACAGACTTAGCAGTATCAAAAAACGTCGGTCCCATCTTACATAGGATCTTTAACTGTATGTGCCATCGCATGCGTAACTTCAGTCGCTCGGTTCCGACGGAGCGGACAGGTCCGTGAACGGTGACCCCATGCAAGCCAGAACTCCAGGGATCGCCGGCGGGCGTCTGAAGGAACAAACACCAATGTAACATGCTTAACCCGCGCGATGTTTAAAACTCTGCACCATGAGCTTAGATACTGGTGACCATTCCGAGCCGCCCTGATCCGAATCTCTTGTAAACTGTCGGAGGGTACTATTGGGGCATGTGTGGAGGACCGAGATGTACGAATGGATATAGCTTAACGCTGCTGCACGAAATCGACTGTTACGTCTCACCTGTTCGGCCTTAGCGTCAGACCTGAACCCAGAGGTGCGCGTTACATTGAGGACCGGTACGAAATTGGCTCTGATTATAAGTCATCGTAAGAGGATTACAAGCGCATCTGCGTGGCTCTCGTGCCGTTCGTTTTGGAACACCTCCGATTGGAGGTTATCAGTCAGCCCATCAATATCAGTGCTAACCAGCTCGTGGCGCCATGCCCGTTACCCGAAGATTCAACC\n\nRosalind_0896\nGCTGCAAAACGTACTCAACAGAATAGTAACCAGCAATCCGGATTCGCATGCTAAGTAGTATCAGGAGCCACGATTATTGCCTGCGCTTGCGTATGGCAGCCAGTTGTGAAGCGGTGAAGGGATATGATAGTAAGAAATCTTTAGAACCAGAGTGCATGACATGTCCGACCCTTTTGGGCCCATTATCCGCGACGTCCAGTCGTACCGTACCGGCGGCCCCTTAATTCTAGCTAAAGTTAGTCTCATAGACGTACGGTCAACCATCTGTCGTTGTCGCAGCAGATTAATCACGCTTATTAAGATCCGCTTCTACATCCACGTCTCTAGTACCTGCTGCCGACGCATCCGCTAACTATGCTTCTTGTGCAGTCGTCTATGCGCCGAGCCCATTATCTAATCTCTGCATACTTGCGTGTCCACGGATTTCCGTATTCTAACCCTCGGATCTACGTTTGCCGGCTGTCGACCGATCTTTAGGCTCACAGTCATCCGACCTGATCTGGGTCTAGTGCCGGTTATTACAGTTGACAGGCCATGCGTTCCACGGTGGAGACTTATACGTTGGGTGGAATGCTTCTTTTAACTGTGCATTTGATCCCTGCGTCAGTCTTTTTCAGACCTTAAAACCGACCGTGCTTGACGGCGAGTCACCGTCCATATAATTGTAGGACCCCGCTCTAGACTCGTGAGGTCATAGTGTAGGCCAAAGAGGTTGTTGTGCTCGTCTGTACTACCAACAGCTCTAGAAGGACGCGTGCAGTCATGTGGAGATGGGGGGGCCGCACGTTGAAAGGACCGGCAAGTCCACGACCCAACAAGCGTTTCGATACCCAAGCAGGCAGATTCATGTTACAATCGTGCGGCTCAGTACCAATTGCATAGTGTCACAGACATTGATATGTCTAAGTACTTATCCCGGGATTGCCACCTGACAGTTT\n\n"
}
],
"prompt_number": 43
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Conditional expressions can be used to filter list comprehensions."
},
{
"cell_type": "code",
"collapsed": false,
"input": "test_seq = 'GCATATATGCTAG'\ngcs = [char for char in test_seq if char == 'G' or char == 'C']\nprint gcs",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "['G', 'C', 'G', 'C', 'G']\n"
}
],
"prompt_number": 44
},
{
"cell_type": "code",
"collapsed": false,
"input": "gc_count = sum(1 for char in test_seq if char == 'G' or char == 'C')\nprint gc_count",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "5\n"
}
],
"prompt_number": 45
},
{
"cell_type": "code",
"collapsed": false,
"input": "def gc_content(seq):\n ''' Returns the GC content of seq as a percentage. '''\n gc_count = sum(1 for char in seq if char == 'G' or char == 'C')\n return 100 * float(gc_count) / len(seq)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 46
},
{
"cell_type": "markdown",
"metadata": {},
"source": "lambda expressions can be used to create anonymous functions.\nWe will use a lambda expression to tell `max` to sort a list of tuples by the second element in each tuple."
},
{
"cell_type": "code",
"collapsed": false,
"input": "records_gc = [(name, gc_content(seq)) for name, seq in records]\nname, gc = max(records_gc, key=lambda (name, gc): gc)\nprint name\nprint gc",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Rosalind_6955\n52.6633165829\n"
}
],
"prompt_number": 47
},
{
"cell_type": "markdown",
"metadata": {},
"source": "##Rosalind: Finding a Protein Motif##"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "I have placed a file `gc_content.py` in the same folder that this notebook is in. Now I can import `gc_content` and have access to all its guts in a namespace called 'gc_content'."
},
{
"cell_type": "code",
"collapsed": false,
"input": "import gc_content",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 48
},
{
"cell_type": "markdown",
"metadata": {},
"source": "This `import` statement evaluates all of the code in `gc_content`. Sometimes this is not what you want to happen. The file being imported will have some stuff it is supposed to do on its own that you don't care about when you are importing it. The `if __name__ = '__main__':` idiom takes care of this problem."
},
{
"cell_type": "code",
"collapsed": false,
"input": "!cat gc_content.py",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "def fasta_records(fh):\r\n ''' Yields (name, seq) pairs for fasta records in the file-like object fh. '''\r\n name = fh.readline().strip().lstrip('>')\r\n while name:\r\n seq = ''\r\n line = fh.readline().strip()\r\n while line and not line.startswith('>'):\r\n seq += line\r\n line = fh.readline().strip()\r\n yield name, seq\r\n name = line.lstrip('>')\r\n\r\ndef gc_content(seq):\r\n ''' Returns the GC content of seq as a percentage. '''\r\n gc_count = sum(1 for char in seq if char == 'G' or char == 'C')\r\n return 100 * float(gc_count) / len(seq)\r\n\r\nif __name__ == '__main__':\r\n file_name = '../data/rosalind_gc.txt'\r\n fh = open(file_name)\r\n records = [(name, gc_content(seq)) for name, seq in fasta_records(fh)]\r\n name, gc = max(records, key=lambda (name, gc): gc)\r\n print name\r\n print gc\r\n"
}
],
"prompt_number": 49
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Notice that the `fasta_records` has been changed to accept a file-like object instead of a file name, for reasons that will be clear below.\n\nIf we write docstrings for our functions, we can get reminders about what they do. \nThis is what pops up in your browser if you were to evaluate `gc_content.fasta_records?`. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "gc_content.fasta_records.__doc__",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 50,
"text": "' Yields (name, seq) pairs for fasta records in the file-like object fh. '"
}
],
"prompt_number": 50
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Notice the `yield` keyword in the `fasta_records` code. That makes it a special kind of function called a generator. Generators are a convenient way to define your own iterators."
},
{
"cell_type": "code",
"collapsed": false,
"input": "def simple_generator(n):\n for i in range(n):\n yield i\n \nmy_gen = simple_generator(4)\nprint [number**2 for number in my_gen]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "[0, 1, 4, 9]\n"
}
],
"prompt_number": 51
},
{
"cell_type": "markdown",
"metadata": {},
"source": "`urllib` provides a way to read files from the web as seamlessly as if they were local. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "import urllib\n\nopened_url = urllib.urlopen('http://www.uniprot.org/uniprot/A2Z669.fasta')\nname, seq = gc_content.fasta_records(opened_url).next()\nprint name\nprint seq",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "sp|A2Z669|CSPLT_ORYSI CASP-like protein OsI_33147 OS=Oryza sativa subsp. indica GN=OsI_33147 PE=2 SV=1\nMRASRPVVHPVEAPPPAALAVAAAAVAVEAGVGAGGGAAAHGGENAQPRGVRMKDPPGAPGTPGGLGLRLVQAFFAAAALAVMASTDDFPSVSAFCYLVAAAILQCLWSLSLAVVDIYALLVKRSLRNPQAVCIFTIGDGITGTLTLGAACASAGITVLIGNDLNICANNHCASFETATAMAFISWFALAPSCVLNFWSMASR\n"
}
],
"prompt_number": 52
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Regular expressions make it easy to look for the N-glycosylation motif."
},
{
"cell_type": "code",
"collapsed": false,
"input": "import urllib\n\nmotif = re.compile(r'(?=(N[^P][ST][^P]))')\n\nfile_name = 'data/rosalind_mprt.txt'\nfh = open(file_name)\n\nfor line in fh:\n uniprot_id = line.strip()\n url = 'http://www.uniprot.org/uniprot/{0}.fasta'.format(uniprot_id)\n opened_url = urllib.urlopen(url)\n\n for name, seq in gc_content.fasta_records(opened_url):\n ps = [str(match.start() + 1) for match in motif.finditer(seq)]\n if ps:\n print uniprot_id\n print ' '.join(ps)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Q3T0C9\n15 38\nQ32LI2"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n157\nP00740_FA9_HUMAN"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n203 213\nP04233_HG2A_HUMAN"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n130 136 256 270\nP22891_PRTZ_HUMAN"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n99 225 233 306 332\nP80195_MPP3_BOVIN"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n95\nP02974_FMM1_NEIGO"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n67 68 121\nB3CNE0"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n107\nP98119_URT1_DESRO"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n153 398\nP21810_PGS1_HUMAN"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n270 311\nQ9D9T0"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n154\nQ4FZD7"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n528\nP19823_ITH2_HUMAN"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n118 445\n"
}
],
"prompt_number": 53
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment