Skip to content

Instantly share code, notes, and snippets.

@gregcaporaso
Last active December 11, 2015 20:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gregcaporaso/4657175 to your computer and use it in GitHub Desktop.
Save gregcaporaso/4657175 to your computer and use it in GitHub Desktop.
IPython Notebooks to support Greg Caporaso's Spring 2013 Bioinformatics I class at Northern Arizona University: http://caporaso.us/teaching/courses/bio299_spring_2013/index.html This work is licensed under the Creative Commons Attribution 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/us/

IPython Notebooks to support Greg Caporaso's Spring 2013 Bioinformatics I class at Northern Arizona University.

This work is licensed under the Creative Commons Attribution 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

Feel free to use or modify these notebooks, but please credit me by placing the following attribution information where you feel that it makes sense: Greg Caporaso, www.caporaso.us.

Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "Homework_7"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Homework 7\n",
"\n",
"The cell below contains the functions necessary to parse a vcf file. The only function that needs to be called in order to return the parsed data file is `parse_data_file`. The input to this function is a vcf file handle (i.e., an open `vcf` file), and the output to parse_data_file will be a list of lists, for example: \n",
"\n",
"`[[('10', '89623323', 'G', 'A', [('HG00096',[0,0]), ('HG00097', [0,0]), ('HG00099', [0,0]), ('HG00100', [0,0]), ('HG00101', [0,0])])], \n",
" [('10', '89626323', 'T', 'A', [('HG00096',[0,0]), ('HG00097', [0,0]), ('HG00099', [0,0]), ('HG00100', [0,0]), ('HG00101', [0,0])])]]`\n",
"\n",
"The first entry in the outer list will be details for the first snp in the vcf file, the second entry in the outer list will be details on the second snp in the file, and so on. \n",
"\n",
"In each inner list, the entries are as follows:\n",
" \n",
" 1. the chromosome number\n",
" 2. the location on the chromosome\n",
" 3. the base found at this position in the first allele, \n",
" 4. the base found at this position in the second allele,\n",
" 5. a list of lists, where each netry contains an individual id, followed by their genotype. \n",
" \n",
"I strongly recommend assigning the output of this function to a variable name such as `snps`. You can then access the items in the ``snps``. For example, ``snp[0][4][0][0] == 'HG00096'`.\n",
"\n",
"**IMPORTANT: Do not edit anything in the next cell. You will call the ``parse_data_file`` function from within your code.**"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from random import choice, sample, randint, randrange\n",
"from __future__ import division\n",
"\n",
"def process_data_entry_line (line, ids):\n",
" '''for a line in the vcf file this will return a list of the information \n",
" in that line, see the return statement'''\n",
" fields = line.split('\\t') \n",
" chr_n = int(fields[0])\n",
" pos_n = int(fields[1])\n",
" ref = fields[3]\n",
" alt = fields[4]\n",
" indiv_ids = zip(ids, indiv_snp_variation(line))\n",
" return (chr_n, pos_n, ref, alt, indiv_ids)\n",
"\n",
"\n",
"#Creates a list of allelic variation for each SNP. The list itself contains one entry of two numbers for each individual. \n",
"def indiv_snp_variation(line):\n",
" '''This function takes a single line from a vcf file and returns a list of the \n",
" genotypes of the given snp of each individual''' \n",
" l = line.split('\\t')\n",
" list = l[9:]\n",
" variation_list = [] \n",
" for i in list: \n",
" variation_list.append(map(int,i.split(':')[0].split('|')))\n",
" return variation_list\n",
"\n",
"def parse_data_file(f):\n",
" '''This function takes a filepath for a vcf file and returns a list of each snp with the \n",
" genotype of each individual in a sublist'''\n",
" result = []\n",
" for line in f:\n",
" if line.startswith('##'):\n",
" pass\n",
" elif line.startswith('#CHROM'): \n",
" ids = line.rstrip().split('\\t')[9:]\n",
" else:\n",
" result.append(process_data_entry_line(line, ids))\n",
" return result\n",
"\n"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Example usage of parse_data_file\n",
"-\n",
"`snps = parse_data_file(open('f.vcf','U'))`\n",
"\n",
"`snps`\n",
"\n",
"`[('10', '89626323', 'T', 'A', [('HG00096',[0,0]), ('HG00097', [0,0]), ('HG00099', [0,0]), ('HG00100', [0,0]), ('HG00101', [0,0])])]`\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#Question 4\n",
"You will define a function that will then be used to answer questions 5 and 6. Do not change the inputs and output of this function - add your code in place of `#do some work`. \n",
"\n",
"**Hint:** you may want to write a for loop that calls get_gene_stats for a list of all of the gene regions you are interested in so you only have call it once."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def get_gene_stats(vcf, gene_coordinates):\n",
" '''This function should take a vcf file and a list of gene coordinates i.e. [(89653782, 89653798), ...]. The function should return \n",
" the total number of snps in the given region, and the snp frequency (snps/base).'''\n",
" #do some work\n",
" return total_snps, snp_freqs\n"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Example input and output for question 4.\n",
"\n",
"After you've written your function, you should test with the following to confirm that it is working as expected.\n",
"\n",
"`print get_gene_stats(open('f.vcf','U'), (89624636, 89655450))`\n",
"\n",
"(352, 0.0114)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#Question 7\n",
"\n",
"**IMPORTANT: Do not edit anything in the next cell. You will call these functions from within your code.**\n",
"\n",
"To answer this question you will need to call `fraction_better_or_equivalent_hwe` on the SNP of interest. See below for example usage."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\n",
"def create_new_population(pop_size, genotypes):\n",
" '''create a list of all alleles. The list is the size of the population'''\n",
" result = []\n",
" for i in range(pop_size): \n",
" genotype = choice(genotypes)\n",
" result.append(genotype)\n",
" return result\n",
" \n",
"def create_new_generation(population, pop_size, genotypes):\n",
" '''create a new generation. This will be the same size as the original \n",
" population'''\n",
" males = sample(population, (int(len(population)/2)))\n",
" result = []\n",
" for male in males:\n",
" population.remove(male)\n",
" while len(result) < pop_size:\n",
" new_gene = (choice(choice(males)), choice(choice(population)))\n",
" if new_gene == (genotypes[1][1], genotypes[1][0]):\n",
" result.append((genotypes[1]))\n",
" else:\n",
" result.append((new_gene))\n",
" return result\n",
"\n",
"def count_generations(pop_size, alleles):\n",
" '''count the number of generations until one allele is fixed'''\n",
" genotypes = [(alleles[0], alleles[0]), (alleles[0], alleles[1]), (alleles[1], alleles[1])]\n",
" population = create_new_population(pop_size, genotypes)\n",
" c = 0\n",
" while genotypes[1] in population: \n",
" pop1 = create_new_generation(population, pop_size, genotypes)\n",
" population = pop1\n",
"# print pop1\n",
" c +=1\n",
" return c\n",
" \n",
"def count_allele_frequency(line):\n",
" '''This counts the number of homozygotes and heterozygotes for a given snp'''\n",
" hetero = 0\n",
" ref = 0\n",
" alt = 0\n",
" for i in line[4]:\n",
" if i[1] == [0,1] or i[1] == [1, 0]:\n",
" hetero += 1\n",
" elif i[1] == [0, 0]:\n",
" ref += 1\n",
" elif i[1] == [1, 1]:\n",
" alt += 1\n",
" return ref, alt, hetero\n",
"\n",
"def calculate_hwe(pop_size, alleles, generations):\n",
" '''This takes a given population size and the number of generations and returns \n",
" the predicted number of heterozygotes divided by the actual number present in the last \n",
" generation. A population in complete equilibrium would return a value of 1'''\n",
" genotypes = [(alleles[0], alleles[0]), (alleles[0], alleles[1]), (alleles[1], alleles[1])]\n",
" population = create_new_population(pop_size, genotypes)\n",
" for i in range(generations):\n",
" new_pop = create_new_generation(population, pop_size, genotypes)\n",
" population = new_pop\n",
" p = (2*population.count(genotypes[0]) + (population.count(genotypes[1])))/(2*len(population))\n",
" q = p - 1\n",
" pq2 = ((population.count(genotypes[1])))/(len(population))\n",
" hetero = (2*p*q)\n",
" return pq2/hetero\n",
"\n",
"def fraction_better_or_equivalent_hwe(vcf, alleles, replications, generations, snp=None):\n",
" '''This file compares the ratio of predicted to actual heterozygotes in a vcf file for\n",
" a given snp. It then compares this value to an artificial distribution based on a \n",
" number of iterations of the calculate_hwe function. The return is a p-value for a \n",
" null hypothesis that the snp in question is in hardy weinberg equilibrium.'''\n",
" snps = parse_data_file(vcf)\n",
" pop_len = len(snps[0][4])\n",
" result = []\n",
" total_snps = []\n",
"# If a snp is identified the function will return only the p-value for that snp \n",
"# otherwise it will return a p-value for every snp\n",
" for i in snps:\n",
" if snp == None:\n",
" total_snps.append(i)\n",
" elif snp == int(i[1]):\n",
" total_snps.append(i)\n",
" else: \n",
" pass\n",
" for i in total_snps:\n",
" snp = i[1]\n",
" ref, alt, hetero = count_allele_frequency(i)\n",
" p = ((2*ref) + hetero)/(2*pop_len)\n",
" q = 1 - p\n",
" pq2 = hetero/pop_len\n",
" predicted_2pq = (2*p*q)\n",
" if predicted_2pq == 0:\n",
" pass\n",
" else: \n",
" ratio_actual = pq2/predicted_2pq\n",
" count = 0\n",
" for i in range(replications):\n",
" if (1 - abs(ratio_actual)) >= (1-abs(calculate_hwe(pop_len, alleles, generations))):\n",
" count += 1 \n",
" if count == 0:\n",
" result.append((snp, 1))\n",
" else:\n",
" result.append((snp, round((1 -(count/replications)), 10)))\n",
" else:\n",
" pass\n",
" return result"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Example usage for question 7\n",
"-\n",
"`print fraction_better_or_equivalent_hwe(f.vcf, ('A', 'B'), 1000, 3, 89627667)`\n",
"\n",
"[(89627667, 0.534)]\n",
"\n",
"To answer question 7, 1000 replications and 2 generations should be used"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hints: \n",
"-\n",
"p + q = 1 \n",
"\n",
"P2 + 2pq + q = 1\n",
"\n",
"Think about how the estimated fraction is being created. Then think about what that is being compared to. Look at the code to help understand this.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#Credits\n",
"\n",
"This assignment was created by John Chase (`chasejohnh@gmail.com`) with guidance from Greg Caporaso for [BIO/CS 299 Introduction to Bioinformatics at Northern Arizona University](http://www.caporaso.us/teaching). As with all of our course materials, this is licensed under Creative Commons Attribution Only. "
]
}
],
"metadata": {}
}
]
}
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment