Skip to content

Instantly share code, notes, and snippets.

@gregcaporaso
Last active December 11, 2015 20:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gregcaporaso/4657175 to your computer and use it in GitHub Desktop.
Save gregcaporaso/4657175 to your computer and use it in GitHub Desktop.
IPython Notebooks to support Greg Caporaso's Spring 2013 Bioinformatics I class at Northern Arizona University: http://caporaso.us/teaching/courses/bio299_spring_2013/index.html This work is licensed under the Creative Commons Attribution 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/us/

IPython Notebooks to support Greg Caporaso's Spring 2013 Bioinformatics I class at Northern Arizona University.

This work is licensed under the Creative Commons Attribution 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

Feel free to use or modify these notebooks, but please credit me by placing the following attribution information where you feel that it makes sense: Greg Caporaso, www.caporaso.us.

Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "Lecture8"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Import some functions from the python standard library, numpy, and ipython. This will set up our computing environment."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# import some functions from python's random module - these will\n",
"# be used in the modeling process\n",
"from random import choice, random\n",
"# import some math functions from the numpy library (note that this\n",
"# isn't part of the python standard library)\n",
"from numpy import log10, average\n",
"# import FileLink from IPython - this will be used for downloading \n",
"# our results\n",
"from IPython.display import FileLink"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, let's just experiment with the ``random`` function to learn about what it does."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"random()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Define a ``count_differences`` function, which counts the number of differences between a pair of sequences of the same length."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def count_differences(sequence1,sequence2):\n",
" \"\"\"Count the number of differences between two sequences of the same length\n",
" \"\"\"\n",
" # confirm that the two sequences are the same length and throw\n",
" # an error if they aren't\n",
" assert len(sequence1) == len(sequence2), \"Sequences differ in length\"\n",
" # initiate a counter for the number of differences\n",
" result = 0\n",
" # iterate over the two sequences and count the number of\n",
" # positions which are not identical\n",
" for base1,base2 in zip(sequence1,sequence2):\n",
" if base1 != base2:\n",
" # this is a commonly used shortcut for incrementing a count\n",
" # and is equivalent to the following statement\n",
" # result = result + 1\n",
" result += 1\n",
" return result\n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Define an ``evolve_seq`` function. This takes a parent sequence, and returns two child sequences to simulate a possible speciation event, where each input sequence (representing a parent's genetic sequence) has diverged into two child sequences (representing the new species, and incurring point mutations along the way). "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def evolve_seq(sequence,\n",
" substitution_probability=0.01,\n",
" mutation_choices=['A','C','G','T']):\n",
" \"\"\"Return two child sequences simulating point mutations\n",
"\n",
" An error occurs with probability substitution_probability\n",
" independently at each position of each child sequence.\n",
" \"\"\"\n",
" # Generate two lists for storing the resulting sequences\n",
" r1 = []\n",
" r2 = []\n",
"\n",
" for base in sequence:\n",
" if random() < substitution_probability:\n",
" # a point mutation will occur at this position\n",
" # what's wrong with the following statement?\n",
" r1.append(choice(mutation_choices))\n",
" else:\n",
" # no point mutation at this position\n",
" r1.append(base)\n",
" if random() < substitution_probability:\n",
" # a point mutation will occur at this position\n",
" # what's wrong with the following statement?\n",
" r2.append(choice(mutation_choices))\n",
" else:\n",
" # no point mutation at this position\n",
" r2.append(base)\n",
" # convert the lists to strings and return them\n",
" return ''.join(r1), ''.join(r2)\n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Define a ``main`` function, which represents the main execution block of the code. This will be used to simulate evolution of sequences over a user-defined number of generations."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def main(root_sequence,generations,verbose=False):\n",
" # initial some values and perform some basic error checking\n",
" assert generations > 0, \"Must simulate one or more generations.\"\n",
" # can you simplify the following test?\n",
" for base in root_sequence:\n",
" assert base != 'A' or base != 'C' or base != 'G' or base != 'T',\\\n",
" \"Invalid base identified: %s. Only A, C, G, or T are allowed.\" % base\n",
" # initialize a list of the previous generation's sequences - this gets used\n",
" # in the for loop below. since we'll start with the first generation of\n",
" # children, root_sequence is the previous generation's sequence\n",
" previous_generation_sequences = [root_sequence]\n",
"\n",
" # iterate over each generation (why do we add one to generations?)\n",
" for i in range(1,generations+1):\n",
" # print the generation number and the current number of sequences\n",
" print \"Generation: %d (Number of child sequences: %d)\" % (i,2**i)\n",
" # create a list to store the current generation of sequences\n",
" current_generation_sequences = []\n",
" # create a list to store the differences in each current generation\n",
" # sequence from the root sequence\n",
" difference_counts = []\n",
" # iterate over the sequences of the previous generation\n",
" for parent_sequence in previous_generation_sequences:\n",
" # evolve two child sequences\n",
" r1, r2 = evolve_seq(parent_sequence)\n",
" # count the differences in the first sequence (from root_sequence)\n",
" r1_diffs = count_differences(root_sequence,r1)\n",
" # append the count of differences to the list of difference counts\n",
" difference_counts.append(r1_diffs)\n",
" # add the new sequence to the list of this generation's sequences\n",
" current_generation_sequences.append(r1)\n",
" # count the differences in the second sequence (from root_sequence)\n",
" r2_diffs = count_differences(root_sequence,r2)\n",
" # append the count of differences to the list of difference counts\n",
" difference_counts.append(r2_diffs)\n",
" # add the new sequence to the list of this generation's sequences\n",
" current_generation_sequences.append(r2)\n",
" if verbose:\n",
" # if the caller specified verbose output, print the actual sequences\n",
" print \" %s %d\" % (r1, r1_diffs)\n",
" print \" %s %d\" % (r2, r2_diffs)\n",
" # print summary information: the average number of differences in the current\n",
" # generation from root_sequence\n",
" print \"Mean differences %1.3f\\n\" % average(difference_counts)\n",
" # current_generation_sequences becomes the next generation's\n",
" # previous_generation_sequences\n",
" previous_generation_sequences = current_generation_sequences\n",
"\n",
" # upon completion of all generations, return the last generation's sequences\n",
" return previous_generation_sequences"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Initialize some variables to run the analysis. \n",
"\n",
"**IMPORTANT:** Update the value of the ``sid`` variable, or you will not be able to download your output."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sequence_length = 100\n",
"num_generations = 5\n",
"sid = \"your-student-id\""
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"assert num_generations <= 12, \"Reduce num_generations if running on the class cluster. Max is 12.\"\n",
"assert sequence_length <= 100, \"Reduce sequence length if running on the class cluster. Max is 100.\"\n",
"assert sid != \"your-student-id\", \"Enter your student ID in the previous cell.\"\n",
"sequence_fp = \"%s_sequences.fasta\" % sid\n",
"\n",
"# generate a random sequence composed of ['A', 'C', 'G', 'T']\n",
"# of length sequence_length\n",
"root_sequence = []\n",
"for i in range(int(sequence_length)):\n",
" root_sequence.append(choice(list('ACGT')))\n",
"root_sequence = ''.join(root_sequence)\n",
"\n",
"# run the simulation and get the final generation of sequences\n",
"sequences = main(root_sequence,int(num_generations))\n",
"\n",
"# write the final generation of sequences to a fasta file\n",
"output_f = open(sequence_fp,'w')\n",
"for i,s in enumerate(sequences):\n",
" output_f.write('>seq%d\\n%s\\n' % (i,s))\n",
"output_f.close()\n",
"\n",
"# Print a link to download the sequence file\n",
"FileLink(sequence_fp)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment