chuttenh/.gitignore Secret

## .gitignore
.ipynb_checkpoints

## M09: Performance evaluation.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              M09: Performance evaluation.ipynb
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## Manipulating data elements.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              Manipulating data elements.ipynb
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## W02: Biological sequences.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Welcome to Jupyter!\n",
    "\n",
    "Welcome to Jupyter Notebook, an interactive IPython environment!  As introduced on the first day of class, Jupyter provides a web application that can interactively mix formatted text, Python code, and real-time results of those Python calculations.  We won't use all of its capabilities just yet, but it's a great way to run your own experiments later on, save the results and documentation, and test new ideas.\n",
    "\n",
    "For now, start by exploring the Jupyter interface.  Start by selecting \"User Interface Tour\" from the \"Help\" menu above, and click the Next arrow until completing the tour.\n",
    "\n",
    "Next, select \"Keyboard Shortcuts\" from the \"Help\" menu, read the resulting popup, and make sure to scroll all the way to the bottom.\n",
    "\n",
    "Great!  Now you're ready to try running some Python.  If you watched the tour and shortcuts list, one of the main ways to advance through a Jupyter notebook is using `Shift-Return` to run the currently selected cell (in addition to the up and down arrows to navigate between cells).  \"Running\" a text cell doesn't do anything, but things get interesting when you try it on Python.\n",
    "\n",
    "I've inserted a simple Python cell below.  ***Press `Shift-Return` until it's highlighted as the current cell...***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hello, world!\n",
      "This is a random number: 0.3047380552388047\n"
     ]
    }
   ],
   "source": [
    "import random\n",
    "\n",
    "print( \"Hello, world!\" )\n",
    "print( \"This is a random number: %s\" % random.random( ) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***...and then press `Shift-Return` once more so that it runs.***  You should see a different random number!  Jupyter runs everything interactively, which means that you can immediately modify and re-run any code snippet.  This can be dangerous for real, large-scale, reproducible research workflows, but it's great for exploratory analyses - and for learning and experimentation!  Try it now - use the arrow keys to move back up, and rerun the Python cell a few times to see different random numbers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Everything in Jupyter runs sequentially - not necessarily top down, although that's the default.  To see what I mean, consider the following variable:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "iNumber = 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n"
     ]
    }
   ],
   "source": [
    "print( iNumber )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you ***continue to press `Shift-Return` through the next Python cell***, you should see 2:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2\n"
     ]
    }
   ],
   "source": [
    "iNumber += 1\n",
    "print( iNumber )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But now, ***use the arrow keys and `Shift-Return` to rerun the preceding cell a few times.***  `iNumber` just keeps getting bigger and bigger - it never \"resets\" to an earlier value.  This behavior can be good or bad - just keep it in mind!  By default, Jupyter and everything we'll do does proceed from the top down.  However, keep an eye on the \"Index\" number to the left of each Python cell.  That will remind you what order it was last run in."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, before we really get going, you should know how to edit cells as well.  You can of course just click on them, or press `Return` by itself, to change the contents of a cell.  This is true either for text or for Python - if you've accidentally pressed `Return` on any of these text cells, you'll see what I mean!  But it's most important for making Python cells do what you want.\n",
    "\n",
    "Try it now: in the Python cell below, ***change the variable assignment so that `iNumber` prints a prime number.***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Make me prime: 2\n"
     ]
    }
   ],
   "source": [
    "iNumber = 2\n",
    "print( \"Make me prime: %s\" % iNumber )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Don't forget to `Shift-Return` after editing the cell so that it runs, and just in case you run into problems at any point, you can stop a runaway cell (which will be marked with an asterisk `*` on the left-hand side) using the \"Kernel\" menu's \"Interrupt\" item.\n",
    "\n",
    "Once you see a prime number printed above, you're ready to analyze some biological sequences!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Generating biological sequences\n",
    "\n",
    "Most biological sequences are stored and read either from data files or from shared databases, but we haven't learned how to access either of those in Python yet!  Instead, we'll make some simple, simulated biological sequences of our own, ensuring that they're formatted and manipulated like real sequences would be.  First, run the following cell to give yourself a tool for generating alphabetical sequences:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import random\n",
    "\n",
    "def generate_sequence( iLength, strAlphabet ):\n",
    "    \n",
    "    strRet = \"\"\n",
    "    for i in range( iLength ):\n",
    "        strRet += random.sample( strAlphabet, 1 )[0]\n",
    "    return strRet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll talk more about defining and running functions soon, but for now, take a look at the `generate_sequence` function, and then run the three Python cells below to see how it works when you use it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'GGATTGTATA'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generate_sequence( 10, \"ACGT\" )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'CCACUCAGUACCAUUAAGAA'"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generate_sequence( 20, \"ACGU\" )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'NRYKLLLTITRKWQGPMSILLPCKCMIILB'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generate_sequence( 30, \"BADEGFLSYCWPHQRIMTNK\" )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now try it yourself!  ***Add Python to each of the following three cells, one at a time, to generate your own sequences.***  They don't even have to be biological - you can make up alphabets of your own.  Don't forget to run each cell after you've added the Python that you want."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For convenience, we can save each of these \"important\" alphabets to reuse with appropriately-named functions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def generate_dna( iLength ):\n",
    "    \n",
    "    return generate_sequence( iLength, \"ACGT\" )\n",
    "\n",
    "def generate_rna( iLength ):\n",
    "    \n",
    "    return generate_sequence( iLength, \"ACGU\" )\n",
    "\n",
    "def generate_peptide( iLength ):\n",
    "    \n",
    "    return generate_sequence( iLength, \"BADEGFLSYCWPHQRIMTNK\" )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'AGTGTGGGGGGTGAGAAAACAAAGGTACTG'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generate_dna( 30 )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'UGUUGACCCCUGGCUCACGG'"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generate_rna( 20 )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'WEIRPABNCQ'"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generate_peptide( 10 )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, try it yourself!  ***Use the three Python cells below to add code that generates additional sequences, using the more realistic biological helper functions this time***, again remembering to run each cell after it contains the code you want."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since each of our biological sequences is a Python string, we can save and combine them using normal variables and concatenation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "TGCCT\n",
      "TCCGCT\n",
      "TGCCTTCCGCT\n"
     ]
    }
   ],
   "source": [
    "strFirst = generate_dna( 5 )\n",
    "print( strFirst )\n",
    "strSecond = generate_dna( 6 )\n",
    "print( strSecond )\n",
    "strCombined = strFirst + strSecond\n",
    "print( strCombined )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One more chance to try it yourself: ***generate a peptide of length 30 by first generating, printing, and then concatenating three shorter peptides (each of any nonzero length).***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Manipulating biological sequences\n",
    "\n",
    "Now that we can create basic DNA-, RNA-, and protein-like sequences, we can teach Python to manipulate them for us.  Transcription of DNA to RNA is easy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def transcribe_dna( strDNA ):\n",
    "    \n",
    "    strRNA = \"\"\n",
    "    for s in strDNA:\n",
    "        strRNA += \"U\" if ( s == \"T\" ) else s\n",
    "    return strRNA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ACGUUGCA'"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "transcribe_dna( \"ACGTTGCA\" )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GCAGGTCAAGACGTAGCGAACGAG\n",
      "GCAGGUCAAGACGUAGCGAACGAG\n"
     ]
    }
   ],
   "source": [
    "strDNA = generate_dna( 24 )\n",
    "print( strDNA )\n",
    "print( transcribe_dna( strDNA ) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, try it yourself!  ***Make a new variable to store some generated DNA, print it, and then print its transcription.***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***Change the following Python (and run it) to make it `True`:***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "transcribe_dna( \"ACGT\" ) == \"UUGGAACC\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Translation is a little harder, since first we have to teach Python the genetic code.  Let's do that, and remember it in a variable so that we can use it later.  Assuming you're still running everything from (mostly) top to bottom, keep going through the following cell..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "c_hashGeneticCode = {\n",
    "    \"UUU\":\"F\", \"UUC\":\"F\", \"UUA\":\"L\", \"UUG\":\"L\",\n",
    "    \"UCU\":\"S\", \"UCC\":\"s\", \"UCA\":\"S\", \"UCG\":\"S\",\n",
    "    \"UAU\":\"Y\", \"UAC\":\"Y\", \"UAA\":None, \"UAG\":None,\n",
    "    \"UGU\":\"C\", \"UGC\":\"C\", \"UGA\":None, \"UGG\":\"W\",\n",
    "    \"CUU\":\"L\", \"CUC\":\"L\", \"CUA\":\"L\", \"CUG\":\"L\",\n",
    "    \"CCU\":\"P\", \"CCC\":\"P\", \"CCA\":\"P\", \"CCG\":\"P\",\n",
    "    \"CAU\":\"H\", \"CAC\":\"H\", \"CAA\":\"Q\", \"CAG\":\"Q\",\n",
    "    \"CGU\":\"R\", \"CGC\":\"R\", \"CGA\":\"R\", \"CGG\":\"R\",\n",
    "    \"AUU\":\"I\", \"AUC\":\"I\", \"AUA\":\"I\", \"AUG\":\"M\",\n",
    "    \"ACU\":\"T\", \"ACC\":\"T\", \"ACA\":\"T\", \"ACG\":\"T\",\n",
    "    \"AAU\":\"N\", \"AAC\":\"N\", \"AAA\":\"K\", \"AAG\":\"K\",\n",
    "    \"AGU\":\"S\", \"AGC\":\"S\", \"AGA\":\"R\", \"AGG\":\"R\",\n",
    "    \"GUU\":\"V\", \"GUC\":\"V\", \"GUA\":\"V\", \"GUG\":\"V\",\n",
    "    \"GCU\":\"A\", \"GCC\":\"A\", \"GCA\":\"A\", \"GCG\":\"A\",\n",
    "    \"GAU\":\"D\", \"GAC\":\"D\", \"GAA\":\"E\", \"GAG\":\"E\",\n",
    "    \"GGU\":\"G\", \"GGC\":\"G\", \"GGA\":\"G\", \"GGG\":\"G\",}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "...and before we go on, let's try looking up a few individual codons:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'P'"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c_hashGeneticCode[\"CCC\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'G'"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c_hashGeneticCode[\"GGG\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "c_hashGeneticCode[\"UAA\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that's weird - why did that happen?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Modify me; the reason that UAA has a special value in our genetic code is:\n",
      "???\n"
     ]
    }
   ],
   "source": [
    "print( \"Modify me; the reason that UAA has a special value in our genetic code is:\" )\n",
    "print( \"???\" )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It gets tedious to translate codons one at a time in a long RNA, though:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "P\n",
      "T\n",
      "G\n"
     ]
    }
   ],
   "source": [
    "strRNA = \"CCCACCGGC\"\n",
    "print( c_hashGeneticCode[\"CCC\"] )\n",
    "print( c_hashGeneticCode[\"ACC\"] )\n",
    "print( c_hashGeneticCode[\"GGC\"] )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can certainly have Python simplify it for us, though:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CCC\n",
      "P\n",
      "ACC\n",
      "T\n",
      "GGC\n",
      "G\n"
     ]
    }
   ],
   "source": [
    "strRNA = \"CCCACCGGC\"\n",
    "strCodon = strRNA[:3]\n",
    "print( strCodon )\n",
    "print( c_hashGeneticCode[strCodon] )\n",
    "strCodon = strRNA[3:6]\n",
    "print( strCodon )\n",
    "print( c_hashGeneticCode[strCodon] )\n",
    "strCodon = strRNA[6:9]\n",
    "print( strCodon )\n",
    "print( c_hashGeneticCode[strCodon] )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we can further simplify this Python:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CCC: P\n",
      "ACC: T\n",
      "GGC: G\n"
     ]
    }
   ],
   "source": [
    "strRNA = \"CCCACCGGC\"\n",
    "strCodon = strRNA[:3]\n",
    "print( \"%s: %s\" % (strCodon, c_hashGeneticCode[strCodon]) )\n",
    "strCodon = strRNA[3:6]\n",
    "print( \"%s: %s\" % (strCodon, c_hashGeneticCode[strCodon]) )\n",
    "strCodon = strRNA[6:9]\n",
    "print( \"%s: %s\" % (strCodon, c_hashGeneticCode[strCodon]) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And further simplify it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CCC: P\n",
      "ACC: T\n",
      "GGC: G\n"
     ]
    }
   ],
   "source": [
    "strRNA = \"CCCACCGGC\"\n",
    "strCodon = strRNA[0:( 0 + 3 )]\n",
    "print( \"%s: %s\" % (strCodon, c_hashGeneticCode[strCodon]) )\n",
    "strCodon = strRNA[3:( 3 + 3 )]\n",
    "print( \"%s: %s\" % (strCodon, c_hashGeneticCode[strCodon]) )\n",
    "strCodon = strRNA[6:( 6 + 3 )]\n",
    "print( \"%s: %s\" % (strCodon, c_hashGeneticCode[strCodon]) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And further simplify it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CCC: P\n",
      "ACC: T\n",
      "GGC: G\n"
     ]
    }
   ],
   "source": [
    "strRNA = \"CCCACCGGC\"\n",
    "for i in (0, 3, 6):\n",
    "    strCodon = strRNA[i:( i + 3 )]\n",
    "    print( \"%s: %s\" % (strCodon, c_hashGeneticCode[strCodon]) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And further simplify it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CCC: P\n",
      "ACC: T\n",
      "GGC: G\n"
     ]
    }
   ],
   "source": [
    "strRNA = \"CCCACCGGC\"\n",
    "for i in range( 0, len( strRNA ), 3 ):\n",
    "    strCodon = strRNA[i:( i + 3 )]\n",
    "    print( \"%s: %s\" % (strCodon, c_hashGeneticCode[strCodon]) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And finally make it into a function that we can reuse for any RNA string:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def translate_rna( strRNA ):\n",
    "    c_iCodon = 3\n",
    "    \n",
    "    strAAs = \"\"\n",
    "    for i in range( 0, len( strRNA ), c_iCodon ):\n",
    "        strCodon = strRNA[i:( i + c_iCodon )]\n",
    "        strAA = c_hashGeneticCode.get( strCodon )\n",
    "        if not strAA:\n",
    "            break\n",
    "        strAAs += strAA\n",
    "    return strAAs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Whew, that was a lot of work!  First, take a look at the function above, and convince yourself that it works.  Then run these Python cells and let me convince you a bit more:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'PTG'"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "translate_rna( \"CCCACCGGC\" )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GCGUUUCCGAACCAGGUUCAC\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'AFPNQVH'"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "strRNA = generate_rna( 21 )\n",
    "print( strRNA )\n",
    "translate_rna( strRNA )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What happens when you translate a sequence that's not of length divisible by three?  ***Modify (and run) the following to find out!***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ACGCCGCUAGUAGCCACCAGA\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'TPLVATR'"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "strRNA = generate_rna( 21 )\n",
    "print( strRNA )\n",
    "translate_rna( strRNA )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What happens when you translate a sequence that contains a stop codon?  ***Modify (and run) the following to find out!***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'PQDR'"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "translate_rna( \"CCGCAAGAUCGU\" )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What happens when you translate a sequence that contains a non-ribonucleotide letter?  ***Modify (and run) the following to find out!***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'PQDR'"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "translate_rna( \"CCGCAAGAUCGU\" )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, if you scroll up and eyeball the `translate_rna` function again, can you convince yourself that you understand why it behaves this way?  Note that you can modify the function definition, re-run the cell, and then re-run the cells below using that function to experiment with changes.  Jupyter is cool like that - just don't forget to make it work again afterward!\n",
    "\n",
    "Is this \"robust\" behavior good or bad?  What if you're writing a program that you want to be able to handle a huge database that might be full of glitchy data?  What if you're writing a program that might silently fail to respond to mistakes in its input data?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can investigate another common type of DNA sequence manipulation that might look familiar from the problem set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "c_hashComplements = {\"A\":\"T\", \"C\":\"G\", \"G\":\"C\", \"T\":\"A\"}\n",
    "\n",
    "def reverse_complement( strDNA ):\n",
    "    \n",
    "    strRet = \"\"\n",
    "    for s in reversed( strDNA ):\n",
    "        strRet += c_hashComplements.get( s, s )\n",
    "    return strRet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'AACCGGTT'"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "reverse_complement( \"AACCGGTT\" )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AATAGTGGCAAGCACCCAGTTATT\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'AATAACTGGGTGCTTGCCACTATT'"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "strDNA = generate_dna( 24 )\n",
    "print( strDNA )\n",
    "reverse_complement( strDNA )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Challenge mode: can you write a new function that reverse-complements DNA without using the `reversed` function?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def my_reverse_complement( strDNA ):\n",
    "    \n",
    "    return strDNA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***Modify `my_reverse_complement` to make the following true!***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "TATTTCCTGTCCCTCGTTGAATCT\n",
      "AGATTCAACGAGGGACAGGAAATA\n",
      "TATTTCCTGTCCCTCGTTGAATCT\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "strDNA = generate_dna( 24 )\n",
    "print( strDNA )\n",
    "strRC = reverse_complement( strDNA )\n",
    "print( strRC )\n",
    "strMRC = my_reverse_complement( strDNA )\n",
    "print( strMRC )\n",
    "strRC == strMRC"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Searching biological sequences\n",
    "\n",
    "You learned about very scalable homology search methods like BLAST, which rapidly query sequence databases using biologically-informed heuristics.  Often, when manipulating smaller sequences locally using Python, you don't need to do anything fancy.  In fact, Python strings have a search function built in that will easily match subsequences:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "8"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "strDNA = \"CGCCGCGCTATAGCCCGCTATAGCCCGC\"\n",
    "strDNA.find( \"TATA\" )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To understand this output, consider the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'TATAGCCCGCTATAGCCCGC'"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "strDNA[8:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or more simply:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'TATA'"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "strDNA[8:( 8 + 4 )]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `find` method returns `-1` when a query is not matched:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-1"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "strDNA.find( \"ATAT\" )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can mimic this behavior using a simple example \"find\" function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def my_find( strQuery, strTarget ):\n",
    "    \n",
    "    for i in range( len( strTarget ) ):\n",
    "        if strTarget[i:( i + len( strQuery ) )] == strQuery:\n",
    "            return i\n",
    "    return -1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "8"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_find( \"TATA\", strDNA )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-1"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_find( \"ATAT\", strDNA )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***Inspect, modify, and otherwise play with `my_find` until you're comfortable with how it works.***  Adding `print` statements to see how it ticks is always a great way to tinker with Python code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def my_find( strQuery, strTarget ):\n",
    "    \n",
    "    print( \"strQuery is of length: %s\" % len( strQuery ) )\n",
    "    print( \"strTarget is of length: %s\" % len( strTarget ) )\n",
    "    for i in range( len( strTarget ) ):\n",
    "        print( i )\n",
    "        strCur = strTarget[i:( i + len( strQuery ) )]\n",
    "        print( strCur )\n",
    "        if strCur == strQuery:\n",
    "            return i\n",
    "    return -1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "strQuery is of length: 4\n",
      "strTarget is of length: 28\n",
      "0\n",
      "CGCC\n",
      "1\n",
      "GCCG\n",
      "2\n",
      "CCGC\n",
      "3\n",
      "CGCG\n",
      "4\n",
      "GCGC\n",
      "5\n",
      "CGCT\n",
      "6\n",
      "GCTA\n",
      "7\n",
      "CTAT\n",
      "8\n",
      "TATA\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "8"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_find( \"TATA\", strDNA )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "strQuery is of length: 4\n",
      "strTarget is of length: 28\n",
      "0\n",
      "CGCC\n",
      "1\n",
      "GCCG\n",
      "2\n",
      "CCGC\n",
      "3\n",
      "CGCG\n",
      "4\n",
      "GCGC\n",
      "5\n",
      "CGCT\n",
      "6\n",
      "GCTA\n",
      "7\n",
      "CTAT\n",
      "8\n",
      "TATA\n",
      "9\n",
      "ATAG\n",
      "10\n",
      "TAGC\n",
      "11\n",
      "AGCC\n",
      "12\n",
      "GCCC\n",
      "13\n",
      "CCCG\n",
      "14\n",
      "CCGC\n",
      "15\n",
      "CGCT\n",
      "16\n",
      "GCTA\n",
      "17\n",
      "CTAT\n",
      "18\n",
      "TATA\n",
      "19\n",
      "ATAG\n",
      "20\n",
      "TAGC\n",
      "21\n",
      "AGCC\n",
      "22\n",
      "GCCC\n",
      "23\n",
      "CCCG\n",
      "24\n",
      "CCGC\n",
      "25\n",
      "CGC\n",
      "26\n",
      "GC\n",
      "27\n",
      "C\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "-1"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_find( \"ATAT\", strDNA )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that `find` only locates the first occurrence of a query substring.  However, it takes an optional second argument that indicates the search index from which to start (remembering that the first index is zero):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'CGCCGCGCTATAGCCCGCTATAGCCCGC'"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "strDNA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "8"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "strDNA.find( \"TATA\" )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "8"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "strDNA.find( \"TATA\", 0 )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "8"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "strDNA.find( \"TATA\", 5 )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "8"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "strDNA.find( \"TATA\", 8 )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "18"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "strDNA.find( \"TATA\", 9 )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "18"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "strDNA.find( \"TATA\", 15 )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-1"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "strDNA.find( \"TATA\", 19 )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can take advantage of this behavior to more closely mimic the behavior of BLAST, finding **all** occurrences of a substring:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def find_all( strQuery, strTarget ):\n",
    "    \n",
    "    aiRet = []\n",
    "    iStart = 0\n",
    "    while iStart < len( strTarget ):\n",
    "        iCur = strTarget.find( strQuery, iStart )\n",
    "        if iCur < 0:\n",
    "            break\n",
    "        aiRet.append( iCur )\n",
    "        iStart = iCur + 1\n",
    "    return aiRet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[8, 18]"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "find_all( \"TATA\", strDNA )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "find_all( \"ATAT\", strDNA )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0, 3, 5, 15, 25]"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "find_all( \"CG\", strDNA )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***What happens if you query an empty string?  Why?***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "find_all( \"make me empty\", strDNA )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***How many steps (approximately) does it take to search for a query of length `M` in a target of length `N`?***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "It takes about: ???\n"
     ]
    }
   ],
   "source": [
    "print( \"It takes about: ???\" )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***Why is this too long to use for a problem like sequence database search?***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Biological sequence file formats\n",
    "\n",
    "We don't know quite enough Python to read and write biological sequence files like `FASTA`s yet (and it's a pain to do reproducibly from Jupyter anyhow), but we can still use Python strings to represent them.  Let's first convert a biological sequence to a single `FASTA` entry, given an identifier to use in the header:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def sequence_to_fasta_almost( strID, strSequence ):\n",
    "    \n",
    "    return ( \">\" + strID + \"\\n\" + strSequence )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">This is some DNA!\n",
      "AAGAATCCGTCTGCGCGTACCTCC\n"
     ]
    }
   ],
   "source": [
    "strDNA = generate_dna( 24 )\n",
    "print( sequence_to_fasta_almost( \"This is some DNA!\", strDNA ) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This works well enough to fit the general `FASTA` format, but typically, `FASTA` files wrap long sequences at some fixed width (often 70 characters).  That means that while this is technically correct:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">This is a lot of DNA!\n",
      "GGATCGTCGCTCAGTTGTTTTGTAGCGATGCGGATAGAAAGCCTCGATACTGTGGCTCCGTACCGCTTCATCAATACGGACGTCGTCTTA\n"
     ]
    }
   ],
   "source": [
    "strDNA = generate_dna( 90 )\n",
    "print( sequence_to_fasta_almost( \"This is a lot of DNA!\", strDNA ) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "it's aesthetically displeasing.  Let's use a trick not unlike `find_all` to correctly wrap our `FASTA` output at a given width:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def sequence_to_fasta( strID, strSequence, iWidth = 70 ):\n",
    "    \n",
    "    strRet = \">\" + strID\n",
    "    for i in range( 0, len( strSequence ), iWidth ):\n",
    "        strRet += \"\\n\" + strSequence[i:( i + iWidth )]\n",
    "    return strRet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">This is a lot of well-formatted DNA!\n",
      "CTGGCCACCTTTTACTGTACTAACCAAGTTGCACAGGTCTTCTGCCGGACGTTGGGTCGTCCCCGATTAA\n",
      "GGGCATCAAATTACTGCACACCGGAATGCACCGCTAAGAGAATCTATCGGGCCAAGATTTCAAGATGCGA\n",
      "GCGCCCTCTAGTCGACAAAAGGGTACTGTGAGAGGCCACTGGTACTCTGTGCGATGGCCCATCTCCCGGT\n",
      "GGCGTTATCCTAATTCGGTGCGCTTGAGACTCCAGGAGCCCCAGCTCGCGGCTTAAGCGATCCTGGTAAG\n",
      "AATCTGCCTTAGGATGTGAC\n"
     ]
    }
   ],
   "source": [
    "strDNA = generate_dna( 300 )\n",
    "print( sequence_to_fasta( \"This is a lot of well-formatted DNA!\", strDNA ) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And don't forget that because we wrote useful, reusable functions way back when, we can do the same exact thing for non-DNA biological sequences as well:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">This is a lot of well-formatted RNA!\n",
      "GUACGGGUGGGUUGCCCAUAAUUGGUUAGCGUGCUCACGGAGAGAUUUCGAAACACCUGCGACAAAAUGG\n",
      "GUGAUCGCCUUGCGCGGUGAUACUGUCAAAACCAGCCUCCACUCGCUAAGCUCUAUGUCCUAGGAAUAGA\n",
      "CAAUCAGGUAACUGAAAGCAGGUGAUUCCCCUCCCAUUGUUUGAGUACACAAUGGCGAUCUGACGACAAC\n",
      "AGAAUUCAGCCACGGUAAUGCAAAAAAAUCAACAAACGCAGUGGAUGGAUAUGCUAAAUCAUAUGUCCAA\n",
      "UCUGUCUCGGCAACUAUCCC\n"
     ]
    }
   ],
   "source": [
    "strRNA = generate_rna( 300 )\n",
    "print( sequence_to_fasta( \"This is a lot of well-formatted RNA!\", strRNA ) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Your turn - ***modify the Python cell below to do exactly the same thing for a peptide sequence (using appropriate variable names and identifying text).***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of course, we can chain together multiple sequences into a valid `FASTA` file as well:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">First sequence\n",
      "GGCCCTTGAGACATCCTGCGAGATACGTCCGAGGATATACAATCCTGGCTTTGAATCGCAGTGTAATGAA\n",
      "GTCGATTCTGTTGTCTTTGTGAAGGTATGTATGGTAGTCCTTGCCCGCTCTCA\n",
      ">Second sequence\n",
      "TTGTTTGGGTTTGGAGGTGTTCCGAGGCTGGGGAAATCGTTAGCCGATTGACGAATGGGAACTCCCTTCT\n",
      "ACGCGAGGGCTCGGCAGTGTTGCAACCTGTGCCTCCGTCGGATTTTCTTACAGCGGCACGAGCCGAAGGA\n",
      "AACTTTGTAAAACCACCAATCCCATTGTTTGGACAAATCACGCAGCTGGTACGAAGATTCTCAGGCAGGT\n",
      "GTGCACTGACGGACAGTACCCACATGGCGAAGCTGGGTGCCGCTTATTAGTAAGCCGCAGTTTGGCTCAC\n",
      "AAGCGCGGGTCTGCGCCAAATAAGACTCTTCGCAACTGAGA\n",
      ">Third sequence\n",
      "GGAAATGTTCTAAGGGAAGCCTCATTGAGGATCAATTGAGTAATGAATTTTGGACTTTCGTTCGAACGCT\n",
      "CCTCATCCAGCGAGTAGCATTCTCAAAGGTAGATGGTGCATCTTCAGTAGCTTCGCTTCAGTATTGGGTT\n",
      "TACAGGCTATAGGACCTATAGAAAGAACTAGGGCCCCGTTAATCGACTACATAGCCTTCGCTTGTGATAG\n",
      "GCCTGCAGCCCT\n"
     ]
    }
   ],
   "source": [
    "strOne = sequence_to_fasta( \"First sequence\", generate_dna( 123 ) )\n",
    "strTwo = sequence_to_fasta( \"Second sequence\", generate_dna( 321 ) )\n",
    "strThree = sequence_to_fasta( \"Third sequence\", generate_dna( 222 ) )\n",
    "print( strOne + \"\\n\" + strTwo + \"\\n\" + strThree )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There's actually a useful Python shorthand for that last part:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">First sequence\n",
      "GGCCCTTGAGACATCCTGCGAGATACGTCCGAGGATATACAATCCTGGCTTTGAATCGCAGTGTAATGAA\n",
      "GTCGATTCTGTTGTCTTTGTGAAGGTATGTATGGTAGTCCTTGCCCGCTCTCA\n",
      ">Second sequence\n",
      "TTGTTTGGGTTTGGAGGTGTTCCGAGGCTGGGGAAATCGTTAGCCGATTGACGAATGGGAACTCCCTTCT\n",
      "ACGCGAGGGCTCGGCAGTGTTGCAACCTGTGCCTCCGTCGGATTTTCTTACAGCGGCACGAGCCGAAGGA\n",
      "AACTTTGTAAAACCACCAATCCCATTGTTTGGACAAATCACGCAGCTGGTACGAAGATTCTCAGGCAGGT\n",
      "GTGCACTGACGGACAGTACCCACATGGCGAAGCTGGGTGCCGCTTATTAGTAAGCCGCAGTTTGGCTCAC\n",
      "AAGCGCGGGTCTGCGCCAAATAAGACTCTTCGCAACTGAGA\n",
      ">Third sequence\n",
      "GGAAATGTTCTAAGGGAAGCCTCATTGAGGATCAATTGAGTAATGAATTTTGGACTTTCGTTCGAACGCT\n",
      "CCTCATCCAGCGAGTAGCATTCTCAAAGGTAGATGGTGCATCTTCAGTAGCTTCGCTTCAGTATTGGGTT\n",
      "TACAGGCTATAGGACCTATAGAAAGAACTAGGGCCCCGTTAATCGACTACATAGCCTTCGCTTGTGATAG\n",
      "GCCTGCAGCCCT\n"
     ]
    }
   ],
   "source": [
    "print( \"\\n\".join( [strOne, strTwo, strThree] ) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or we can write a function that creates a `FASTA` out of any number of sequences (with identifiers):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def create_fasta( astrIDs, astrSequences ):\n",
    "    \n",
    "    astrFASTAs = []\n",
    "    for i in range( len( astrIDs ) ):\n",
    "        astrFASTAs.append( sequence_to_fasta( astrIDs[i], astrSequences[i] ) )\n",
    "    return \"\\n\".join( astrFASTAs )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">One\n",
      "TGATCGGTGCACCTTCTAGGAATCGACTGTCGGTTCACTCAACTAGCCAAGTGGATGCGTCAACGTTCGA\n",
      "GGCTTGTTAGAGCGTGAGTTTTGGAATATGGTAGTGTCATA\n",
      ">Two\n",
      "TGTTAGAATGCAGAGTGCTAGGCAAGACGTCTTCATCCCTAAGTCGGCTTACATGTAGGCAACACGGTGC\n",
      "TTTGAGACGACTATTGTGCCAGAACTTACATAGCAATATGGGGATTCGCGAATGTGACGAACCCAGCAAA\n",
      "CCGGTTACGTTGAGACCTCACCTTGCCAGGCAGCACTGCCACTGTCTCTCCCAGCTAGAGTGGTATTACT\n",
      "GCAGTGGACGAT\n",
      ">Three\n",
      "GACTTTGGATTCATAATCGTCCGAAGGGTCATTGTCTTCGGGACGGCCTCTTCCGACTGCAGGGTTTTGC\n",
      "ACAACTGGTGCTTGAACATTACGGCCCTCCGGGCATCATGGTGTCACAGAGAACTGGGATGTCCACTGGG\n",
      "GTAAGAGCCCGTGCACACTATCTAAGTCCATATAATTAAAACAGCAAGCACGCGTAACTTATAATACATA\n",
      "GACCCGCCTCCCAGCTGGTTCTCTTACGGTAATTCGCATCCCTAGTGCACGGTGTTCGCGCTGTGGGATA\n",
      "TAGTAAAAGCACATGAAAGGAGAAAACCGCTAACGATCGCATTCTAAAATGAG\n"
     ]
    }
   ],
   "source": [
    "astrIDs = [\"One\", \"Two\", \"Three\"]\n",
    "astrSequences = [generate_dna( 111 ), generate_dna( 222 ), generate_dna( 333 )]\n",
    "print( create_fasta( astrIDs, astrSequences ) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That gets pretty complicated!  As above, ***add some `print` statements to the demo version of `my_create_fasta` below, then test it out to make sure you understand how it works.***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def my_create_fasta( astrIDs, astrSequences ):\n",
    "    \n",
    "    astrFASTAs = []\n",
    "    for i in range( len( astrIDs ) ):\n",
    "        astrFASTAs.append( sequence_to_fasta( astrIDs[i], astrSequences[i] ) )\n",
    "    return \"\\n\".join( astrFASTAs )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">One\n",
      "ACGT\n",
      ">Two\n",
      "TGCA\n"
     ]
    }
   ],
   "source": [
    "print( my_create_fasta( [\"One\", \"Two\"], [\"ACGT\", \"TGCA\"] ) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, what if we're giving data in `FASTA` format and need to read it back into individual sequences?  Well, if the `FASTA` contains a single line of sequence, it's easy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">ID\n",
      "ACGT\n"
     ]
    }
   ],
   "source": [
    "strFASTA = sequence_to_fasta( \"ID\", \"ACGT\" )\n",
    "print( strFASTA )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['>ID', 'ACGT']\n"
     ]
    }
   ],
   "source": [
    "astrLinesInMyFASTA = strFASTA.split( \"\\n\" )\n",
    "print( astrLinesInMyFASTA )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ID: ID\n",
      "Sequence: ACGT\n"
     ]
    }
   ],
   "source": [
    "strIDWithHeader = astrLinesInMyFASTA[0]\n",
    "strID = strIDWithHeader[1:]\n",
    "strSequence = astrLinesInMyFASTA[1]\n",
    "print( \"ID: %s\" % strID )\n",
    "print( \"Sequence: %s\" % strSequence )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This works no matter how long our sequence is - as long as it's all on one line.  But we worked hard to construct correctly formatted `FASTA`s with line-wrapped sequences!  If we need to read them back in again, we should be able to:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">ID\n",
      "AC\n",
      "GT\n"
     ]
    }
   ],
   "source": [
    "strTwoLineFASTA = sequence_to_fasta( \"ID\", \"ACGT\", 2 )\n",
    "print( strTwoLineFASTA )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['>ID', 'AC', 'GT']\n"
     ]
    }
   ],
   "source": [
    "astrLinesInMyFASTA = strTwoLineFASTA.split( \"\\n\" )\n",
    "print( astrLinesInMyFASTA )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ID: ID\n",
      "Sequence: ACGT\n"
     ]
    }
   ],
   "source": [
    "strIDWithHeader = astrLinesInMyFASTA[0]\n",
    "strID = strIDWithHeader[1:]\n",
    "strSequence = astrLinesInMyFASTA[1] + astrLinesInMyFASTA[2]\n",
    "print( \"ID: %s\" % strID )\n",
    "print( \"Sequence: %s\" % strSequence )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see if we can generalize this process:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def fasta_to_sequence( strFASTA ):\n",
    "    \n",
    "    astrLines = strFASTA.split( \"\\n\" )\n",
    "    strID = astrLines[0][1:]\n",
    "    strSeq = \"\"\n",
    "    for strCur in astrLines[1:]:\n",
    "        strSeq += strCur\n",
    "    return [strID, strSeq]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['ID', 'ACGT']"
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fasta_to_sequence( strTwoLineFASTA )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['ID', 'ACGT']"
      ]
     },
     "execution_count": 85,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fasta_to_sequence( strFASTA )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Long!',\n",
       " 'TATTGCGTCGGACCTTACAACTGTTAACAGAGTTAATACCCCCGCAACTTGACGCTTGGTTGGTACGTCAGGCAGCGGAAGCCGGCCTTAGAGACGAGGATCGCTGCTGTGTAGTAGACTTAGAGCGTACTACGAATATCCACGTCGTCCTAGGGCCTGAGATGTATCGGTACAAGGAAGGGTTCTGGCCTTCTTCTATTGGGGTCGCCTACTTGTAGCCTTTACAGCTTTTATTACGGTAGGCGTCTCTTAGTAGCTGCTCTGTATTACAAAGTCAGGCATTGCGTAGGGACCTCATCACAGACACGTATGCGCACCATTGGCACAAGGGGA']"
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fasta_to_sequence( \"\"\">Long!\n",
    "TATTGCGTCGGACCTTACAACTGTTAACAGAGTTAATACCCCCGCAACTTGACGCTTGGTTGGTACGTCA\n",
    "GGCAGCGGAAGCCGGCCTTAGAGACGAGGATCGCTGCTGTGTAGTAGACTTAGAGCGTACTACGAATATC\n",
    "CACGTCGTCCTAGGGCCTGAGATGTATCGGTACAAGGAAGGGTTCTGGCCTTCTTCTATTGGGGTCGCCT\n",
    "ACTTGTAGCCTTTACAGCTTTTATTACGGTAGGCGTCTCTTAGTAGCTGCTCTGTATTACAAAGTCAGGC\n",
    "ATTGCGTAGGGACCTCATCACAGACACGTATGCGCACCATTGGCACAAGGGGA\"\"\" )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's pretty complicated stuff!  Again, the version of `fasta_to_sequence` above is very compact, and uses a lot of sophisticated Python.  As a challenge, ***rewrite `my_fasta_to_sequence` below using only Python that you recognize, so that each of the subsequent three tests returns `True`.***  It will likely be longer than `fasta_to_sequence`, but that's ok, and of course use `print` statements while writing the function to test as much as you'd like."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def my_fasta_to_sequence( strFASTA ):\n",
    "    \n",
    "    return [\"\", \"\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fasta_to_sequence( strFASTA ) == my_fasta_to_sequence( strFASTA )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fasta_to_sequence( strTwoLineFASTA ) == my_fasta_to_sequence( strTwoLineFASTA )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 90,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "strLong = \"\"\">Long!\n",
    "TATTGCGTCGGACCTTACAACTGTTAACAGAGTTAATACCCCCGCAACTTGACGCTTGGTTGGTACGTCA\n",
    "GGCAGCGGAAGCCGGCCTTAGAGACGAGGATCGCTGCTGTGTAGTAGACTTAGAGCGTACTACGAATATC\n",
    "CACGTCGTCCTAGGGCCTGAGATGTATCGGTACAAGGAAGGGTTCTGGCCTTCTTCTATTGGGGTCGCCT\n",
    "ACTTGTAGCCTTTACAGCTTTTATTACGGTAGGCGTCTCTTAGTAGCTGCTCTGTATTACAAAGTCAGGC\n",
    "ATTGCGTAGGGACCTCATCACAGACACGTATGCGCACCATTGGCACAAGGGGA\"\"\"\n",
    "fasta_to_sequence( strLong ) == my_fasta_to_sequence( strLong )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We only need one more trick to complete our ability to manipulate biological sequence files: what about `FASTA`s with multiple sequences in them?  That is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">One\n",
      "AATGCGGACGTATTTACTCGACTGCGGATACCGAATACGGATCAGCTACTTCCTCGTGACCTGAAGCAAA\n",
      "ATGCGTGAAATCACTGAGTCTGCAATGGATCTTTGATGGAA\n",
      ">Two\n",
      "CCTCGAATGGGAAACTAGAGTGGACTTTGAAAAATCATCTTATAGAATAGCTTGGCATGAATGTGGGAGG\n",
      "GTAGCATCAGCTAGCGCGATAGGTAATGGGACTGGCAAGCTGAGGCTTCTGGGGGGTTCTGGTTAGGAAA\n",
      "CTGGTTAACCAATCCCATAATGACGAGAGGGCACCAGGGTTAAAAGCCTACTGGTCATGCGGATATGAGG\n",
      "TTCCCTGCTAGC\n",
      ">Three\n",
      "TAATGTGTGAGTCAGTACGGTATTATGGCGTTACGGGGTATCGTGCAGCGAGTGCCGGATTTTCGTCCAC\n",
      "GTGGTCCTACCATATCCGCACAACCCAATGATCCGCGAGTCAGGTTTGTACCGTCGTAGCACCGACCGCT\n",
      "AGGTGAAGGGCAATATATGTCGAGGCCCGCGTCTTTCATACAACGACGTACCGATCGAGAAGGACGTAAC\n",
      "TGATCTGGGCTTCGCGGCCAGCTCGTCGTATAAACGTTAATGCTTGCTCAGCACCGCATCGCCGCGGTGC\n",
      "TTCTATAGAAAATAACGAGTGAGAACTGCCCCTGTCGTATAATAGCATTCTTG\n"
     ]
    }
   ],
   "source": [
    "astrIDs = [\"One\", \"Two\", \"Three\"]\n",
    "astrSequences = [generate_dna( 111 ), generate_dna( 222 ), generate_dna( 333 )]\n",
    "strFASTA = create_fasta( astrIDs, astrSequences )\n",
    "print( strFASTA )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can extend the same trick we used above to `split` multiple lines of sequence apart to separate multiple `FASTA` headers as well.  Consider:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def fasta_to_sequences( strFASTA ):\n",
    "    \n",
    "    hashRet = {}\n",
    "    astrFASTAs = strFASTA.split( \">\" )[1:]\n",
    "    for strFASTA in astrFASTAs:\n",
    "        strID, strSeq = fasta_to_sequence( \">\" + strFASTA )\n",
    "        hashRet[strID] = strSeq\n",
    "    return hashRet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'First': 'ACGT', 'Second': 'TGCA'}"
      ]
     },
     "execution_count": 93,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fasta_to_sequences( \"\"\">First\n",
    "ACGT\n",
    ">Second\n",
    "TGCA\"\"\" )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'One': 'AATGCGGACGTATTTACTCGACTGCGGATACCGAATACGGATCAGCTACTTCCTCGTGACCTGAAGCAAAATGCGTGAAATCACTGAGTCTGCAATGGATCTTTGATGGAA',\n",
       " 'Three': 'TAATGTGTGAGTCAGTACGGTATTATGGCGTTACGGGGTATCGTGCAGCGAGTGCCGGATTTTCGTCCACGTGGTCCTACCATATCCGCACAACCCAATGATCCGCGAGTCAGGTTTGTACCGTCGTAGCACCGACCGCTAGGTGAAGGGCAATATATGTCGAGGCCCGCGTCTTTCATACAACGACGTACCGATCGAGAAGGACGTAACTGATCTGGGCTTCGCGGCCAGCTCGTCGTATAAACGTTAATGCTTGCTCAGCACCGCATCGCCGCGGTGCTTCTATAGAAAATAACGAGTGAGAACTGCCCCTGTCGTATAATAGCATTCTTG',\n",
       " 'Two': 'CCTCGAATGGGAAACTAGAGTGGACTTTGAAAAATCATCTTATAGAATAGCTTGGCATGAATGTGGGAGGGTAGCATCAGCTAGCGCGATAGGTAATGGGACTGGCAAGCTGAGGCTTCTGGGGGGTTCTGGTTAGGAAACTGGTTAACCAATCCCATAATGACGAGAGGGCACCAGGGTTAAAAGCCTACTGGTCATGCGGATATGAGGTTCCCTGCTAGC'}"
      ]
     },
     "execution_count": 94,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fasta_to_sequences( strFASTA )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We used *lots* of new concepts there, so first ***spend a while adding `print` statements to `my_fasta_to_sequences` below and running the associated examples to understand how it works.***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def my_fasta_to_sequences( strFASTA ):\n",
    "    \n",
    "    hashRet = {}\n",
    "    astrFASTAs = strFASTA.split( \">\" )[1:]\n",
    "    for strFASTA in astrFASTAs:\n",
    "        strID, strSeq = fasta_to_sequence( \">\" + strFASTA )\n",
    "        hashRet[strID] = strSeq\n",
    "    return hashRet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'First': 'ACGT', 'Second': 'TGCA'}"
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_fasta_to_sequences( \"\"\">First\n",
    "ACGT\n",
    ">Second\n",
    "TGCA\"\"\" )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'One': 'AATGCGGACGTATTTACTCGACTGCGGATACCGAATACGGATCAGCTACTTCCTCGTGACCTGAAGCAAAATGCGTGAAATCACTGAGTCTGCAATGGATCTTTGATGGAA',\n",
       " 'Three': 'TAATGTGTGAGTCAGTACGGTATTATGGCGTTACGGGGTATCGTGCAGCGAGTGCCGGATTTTCGTCCACGTGGTCCTACCATATCCGCACAACCCAATGATCCGCGAGTCAGGTTTGTACCGTCGTAGCACCGACCGCTAGGTGAAGGGCAATATATGTCGAGGCCCGCGTCTTTCATACAACGACGTACCGATCGAGAAGGACGTAACTGATCTGGGCTTCGCGGCCAGCTCGTCGTATAAACGTTAATGCTTGCTCAGCACCGCATCGCCGCGGTGCTTCTATAGAAAATAACGAGTGAGAACTGCCCCTGTCGTATAATAGCATTCTTG',\n",
       " 'Two': 'CCTCGAATGGGAAACTAGAGTGGACTTTGAAAAATCATCTTATAGAATAGCTTGGCATGAATGTGGGAGGGTAGCATCAGCTAGCGCGATAGGTAATGGGACTGGCAAGCTGAGGCTTCTGGGGGGTTCTGGTTAGGAAACTGGTTAACCAATCCCATAATGACGAGAGGGCACCAGGGTTAAAAGCCTACTGGTCATGCGGATATGAGGTTCCCTGCTAGC'}"
      ]
     },
     "execution_count": 97,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_fasta_to_sequences( strFASTA )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, ***create your own `my_fasta_to_sequences2` using simpler Python that you've seen before.***  Again, it will probably be longer, but that's fine - as long as it makes the examples below all `True`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def my_fasta_to_sequences2( strFASTA ):\n",
    "    \n",
    "    return {}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 99,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "strOne = \"\"\">First\n",
    "ACGT\n",
    ">Second\n",
    "TGCA\"\"\"\n",
    "sorted( my_fasta_to_sequences2( strOne ) ) == sorted( fasta_to_sequences( strOne ) )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 100,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sorted( my_fasta_to_sequences2( strFASTA ) ) == sorted( fasta_to_sequences( strFASTA ) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Last one, and a challenge: ***finish the `search_fasta` function below so that it returns a list of all IDs within a `FASTA` whose sequences contain the given query.***  You can reuse any of the functions defined above, but as usual, make sure the examples return `True`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def search_fasta( strQuery, strFASTA ):\n",
    "    \n",
    "    return []"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 102,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "strFASTA = \"\"\">One\n",
    "ACGT\n",
    ">Two\n",
    "TGCA\n",
    ">Three\n",
    "CCGG\"\"\"\n",
    "search_fasta( \"CG\", strFASTA ) == [\"One\", \"Three\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 103,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "search_fasta( \"GGCC\", strFASTA ) == []"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 104,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "search_fasta( \"TGC\", strFASTA ) == [\"Two\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}

## W03: Modules and IO.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              W03: Modules and IO.ipynb
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## W08: Multiple sample hypothesis tests.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              W08: Multiple sample hypothesis tests.ipynb
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.