dwinter/make_artifact_bams

## make_artifact_bams
{
 "metadata": {
  "name": "",
  "signature": "sha256:98f1acba9791082035860796ccf52294670282ba8deb7745219206b5182498d8"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#Creating test data for mutation detection"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Set up some tools for creating random DNA sequences with weighted sampling"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from collections import deque\n",
      "import random \n",
      "\n",
      "def create_alias(weights):\n",
      "    \"\"\" Create the probability and alias tables for \n",
      "    \"\"\"\n",
      "    n = len(weights)\n",
      "    s = sum(weights)\n",
      "    scaled_probs = [float(w)/s*n for w in weights]\n",
      "    Alias = [0 for _ in range(n)]\n",
      "    Prob = [0 for _ in range(n) ]\n",
      "    large = deque()\n",
      "    small = deque() \n",
      "\n",
      "    for i,p in enumerate(scaled_probs):\n",
      "        if p < 1:\n",
      "            small.append(i)\n",
      "        else:\n",
      "            large.append(i)\n",
      "\n",
      "    while (large and small):\n",
      "        l = small.pop()\n",
      "        g = large.pop()\n",
      "        Prob[l] = scaled_probs[l]\n",
      "        Alias[l] = g\n",
      "        scaled_probs[g] = (scaled_probs[g] + scaled_probs[l]) - 1\n",
      "        if scaled_probs[g] < 1:\n",
      "            small.append(g)\n",
      "        else:\n",
      "            large.append(g)\n",
      "\n",
      "    while large:\n",
      "        g = large.pop()\n",
      "        Prob[g] = 1\n",
      "\n",
      "    while small:\n",
      "        l = small.pop()\n",
      "        Prob[l] = 1\n",
      "\n",
      "    return( (Prob, Alias) )\n",
      "\n",
      "def weighted_choice(population, weights):\n",
      "    Prob, Alias = create_alias(weights)\n",
      "    n = len(population) -1\n",
      "    i = random.randint(0,n)\n",
      "    if random.random() < Prob[i]:\n",
      "        return population[i]\n",
      "    return population[ Alias[i] ]\n",
      "\n",
      "def weighted_sample(population, weights, size):\n",
      "    Prob, Alias = create_alias(weights)\n",
      "    n = len(population) - 1\n",
      "    result = []\n",
      "    for rep in range(size):\n",
      "        i = random.randint(0,n)\n",
      "        if random.random() < Prob[i]:\n",
      "            result.append(population[i])\n",
      "        else:\n",
      "            result.append(population[ Alias[i] ])\n",
      "    return(result)\n",
      "    \n",
      "def sample_from_alias(population, prob_table, alias_table):\n",
      "    \"\"\" \"\"\"\n",
      "    n = len(population) - 1\n",
      "    i = random.randint(0,n)\n",
      "    if random.random() < prob_table[i]:\n",
      "        return population[i]\n",
      "    return population[ alias_table[i] ]\n",
      "\n",
      "    \n",
      "def mutate(b, prob, alias):\n",
      "    \"\"\" \"\"\"\n",
      "    m = sample_from_alias(\"ACGT\", prob, alias)\n",
      "    while m == b:\n",
      "        m = sample_from_alias(\"ACGT\", prob, alias)\n",
      "    return(m)\n",
      "\n",
      "        \n",
      "    "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 347
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##The Plan...\n",
      "\n",
      "... is to create test datasets from which simulated reads can be generated. Specifically we want\n",
      "\n",
      "+ A 'good'mutation that's 'easy' to call\n",
      "+ A 'good' mutation that's harder to call because there are indels in the reads [not yet done]\n",
      "+ A sequence at which only the ancestor is non-refernce (to deal with heterozgosity in ancestor)\n",
      "+ A 'double mutant'\n",
      "+ A case were all apparent mutant bases have quality scores below the threshold (default 13)\n",
      "+ A case with strand bias (for the post-procesor to detect)\n",
      "\n",
      "To do this we will make random 'ancestral sequences' for 7 samples, an Ancestor (`A0`) abd descenants (`D1` ...`D6`). To make testing easier we'll only ever introduce mutatnts at the 600th base of the sequence, so to prevent overwriting a mutations we need new descenant line for every case above (and two for the double mutant). \n",
      "\n",
      "The code below requires the execultables `bowtie2` and `art_illumina`. In this notebook they are called by relative paths (i.e. they are not on `$PATH`)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Make some 'ancestral' DNA sequences\n",
      "\n",
      "The goal here is 6 random \"chromosomes\" (on which we'll add the different cases above) for each of 8 samples \n",
      "\n",
      "I first tried this with lists(chromosomes) of lists (samples) of strings (sequences), but found keeping track of things too hard. Instead, here's a a functions to create \"mutable strings\" and methods to indtroduce mutations, and a dictoinary to keep track of them"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import copy\n",
      "\n",
      "class MutableSequence(object):\n",
      "    def __init__(self, start_string):\n",
      "        self.seq = start_string\n",
      "        \n",
      "    def __repr__(self):\n",
      "        if len(self.seq) > 5:\n",
      "            return \"MutableSequence({0}...{1})\".format(self.seq[1:3], self.seq[4:5])\n",
      "        return \"MutableSequence({0})\".format(self.seq)\n",
      "        \n",
      "    def point_mutation(self, index, alias_table, prob_table):\n",
      "        \"\"\" Introduce a point mutation. \n",
      "        \n",
      "        Note, this code requires precomputed alias table for weighted sampling\n",
      "        \"\"\"\n",
      "        listed = list(self.seq)\n",
      "        m = listed[index]\n",
      "        while m == listed[index]:\n",
      "            m = sample_from_alias(\"ACGT\", alias_table, prob_table)\n",
      "        listed[index] = m\n",
      "        self.seq = \"\".join(listed)\n",
      "    \n",
      "    def remove_bases(self, excluded_inidices):\n",
      "        \"\"\"\n",
      "        Remove bases from sequences\n",
      "        \n",
      "        excluded_indices is a tuple list with zero-based half open indices to remove\n",
      "        i.e to remove the first based from a MutableSequence `S`:\n",
      "        \n",
      "        S.remove_bases([(0,1)])\n",
      "        \"\"\"\n",
      "        listed = list(self.seq)\n",
      "        slice_start = 0\n",
      "        new_seq = []\n",
      "        for st,end in excluded_inidices:\n",
      "            new_seq += listed[slice_start:st]\n",
      "            slice_start = end \n",
      "        new_seq += listed[end:]\n",
      "        self.seq = \"\".join(new_seq)\n",
      "        \n",
      "def make_seq_dict(specimen_names, alias_table, prob_table, seq_len = 1200):\n",
      "    aseq = MutableSequence(\"\".join([sample_from_alias(\"ACGT\",alias_table,prob_table) for _ in range(seq_len)]))\n",
      "    return {n:copy.deepcopy(aseq) for n in specimen_names}\n",
      "\n",
      "#start out by creating alias tables for somehting like the Tetrahymena AT bias\n",
      "a,p = create_alias( [0.36, 0.14, 0.14, 0.36])\n",
      "\n",
      "gene_names = [\"good_mutation\", \"indelish\", \"anc_het\", \"double_mutation\", \"qual_score\", \"std_bias\"]\n",
      "line_names = [\"test_ref\", \"A0\", \"D1\", \"D2\", \"D3\", \"D4\", \"D5\", \"D6\"]\n",
      "\n",
      "sequence_set = [make_seq_dict(line_names,a,p) for g in gene_names]\n",
      "sequence_set[0:2]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 348,
       "text": [
        "[{'A0': MutableSequence(TT...G),\n",
        "  'D1': MutableSequence(TT...G),\n",
        "  'D2': MutableSequence(TT...G),\n",
        "  'D3': MutableSequence(TT...G),\n",
        "  'D4': MutableSequence(TT...G),\n",
        "  'D5': MutableSequence(TT...G),\n",
        "  'D6': MutableSequence(TT...G),\n",
        "  'test_ref': MutableSequence(TT...G)},\n",
        " {'A0': MutableSequence(GA...T),\n",
        "  'D1': MutableSequence(GA...T),\n",
        "  'D2': MutableSequence(GA...T),\n",
        "  'D3': MutableSequence(GA...T),\n",
        "  'D4': MutableSequence(GA...T),\n",
        "  'D5': MutableSequence(GA...T),\n",
        "  'D6': MutableSequence(GA...T),\n",
        "  'test_ref': MutableSequence(GA...T)}]"
       ]
      }
     ],
     "prompt_number": 348
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Introduce mutations to some sequences"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#genes names are\n",
      "#[\"good_mutation\", \"indelish\", \"anc_het\", \"double_mutation\", \"qual_score\", \"std_bias\"]\n",
      "\n",
      "#a good mutation\n",
      "sequence_set[0][\"D1\"].point_mutation(600, a,p)\n",
      "# a good mutation around indels\n",
      "sequence_set[1][\"D2\"].point_mutation(600, a,p)\n",
      "sequence_set[1][\"D2\"].remove_bases([(590,596), (605,611)])\n",
      "#apparent mutation in the ancestor\n",
      "sequence_set[2][\"A0\"].point_mutation(600, a,p)\n",
      "#two apparent mutations in difference samples\n",
      "sequence_set[3][\"D3\"].point_mutation(600, a,p)\n",
      "sequence_set[3][\"D4\"].point_mutation(600, a,p)\n",
      "# a good mutation that we'll edit the quality score for\n",
      "sequence_set[4][\"D5\"].point_mutation(600, a,p)\n",
      "# a good mutation that we'll edit the starnd bias\n",
      "sequence_set[5][\"D6\"].point_mutation(600, a,p)\n",
      "\n",
      "#test it all worked\n",
      "print sequence_set[0][\"D1\"].seq[600] !=  sequence_set[0][\"test_ref\"].seq[600]\n",
      "print len(sequence_set[1][\"D2\"].seq) <  len(sequence_set[1][\"test_ref\"].seq)\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "True\n",
        "True\n"
       ]
      }
     ],
     "prompt_number": 349
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Write out the mutated sequences to fasta files, create reads with `ART`\n",
      "\n",
      "writes fasta sequenes for each line into `/seqs` dir; `.fq` and `.aln` fliles  into `/fq`. Not the small insert/read size, which was chosen to deal with the short squences"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import subprocess\n",
      "\n",
      "results = []\n",
      "for ln in line_names:\n",
      "    fname = \"seqs/{0}.fasta\".format(ln)\n",
      "    handle = open(fname, \"w\")\n",
      "    for i,gn in enumerate(gene_names):\n",
      "        handle.write(\">{0}\\n{1}\\n\".format(gn, sequence_set[i][ln].seq))\n",
      "    handle.close()\n",
      "    if ln != \"test_ref\": \n",
      "        cmd = \"bin/art_illumina -i {0} -p -l 50 -f 30 -m 100 -s 10 -o fq/{1}_r\".format(fname, ln)\n",
      "        results.append( subprocess.call(cmd, shell=True) )\n",
      "\n",
      "#test it worked, and all retuned exit code 0\n",
      "print len(results) == len(line_names)-1\n",
      "print sum(results) == 0\n",
      "        "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "True\n",
        "True\n"
       ]
      }
     ],
     "prompt_number": 350
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for i,gn in enumerate(gene_names):\n",
      "    handle = open(\"MSA/{0}.fasta\".format(gn), \"w\")\n",
      "    for k,v in sorted(sequence_set[i].items()):\n",
      "        handle.write(\">{0}\\n{1}\\n\".format(k,v.seq))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 351
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Now we have reads... let's monkey around with a couple of them. \n",
      "\n",
      "**NOTE** These functions move the files created above around, to re-run any cell in this section we should also re-run the sequencing simulations\n",
      "\n",
      "###first, introudce stand bias. \n",
      "using the `.aln` file to identify reads overlapping the mutation site, filter those IDs from the FASTQ:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from Bio import SeqIO\n",
      "\n",
      "def extract_read_id(read):\n",
      "    return read.description.split(\"\\t\")[1].split(\"/\")[0]\n",
      "\n",
      "def overlaps(read):\n",
      "    _, read_info, pos, strand = read.description.split(\"\\t\")\n",
      "    if \"std_bias\" in read_info: #it's in the right gene\n",
      "        if abs(601 - int(pos)) < 75: #it overlaps\n",
      "            return (strand == \"+\") #is it the stand we care about?\n",
      "    return False\n",
      "\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 352
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "to_drop = [extract_read_id(r) for r in SeqIO.parse(\"fq/D6_r1.aln\", \"fasta\") if overlaps(r)] + \\\n",
      "          [extract_read_id(r) for r in SeqIO.parse(\"fq/D6_r2.aln\", \"fasta\") if overlaps(r)]\n",
      "\n",
      "print \"removed \",len(to_drop)\n",
      "#we will overwrite teh fastq files, so preserve the origanls\n",
      "#BE AWARE - you can't re-run this cell without re-running theorignal read-simulatoins\n",
      "!mv fq/D6_r1.fq fq/D6_r1_old.fq\n",
      "!mv fq/D6_r2.fq fq/D6_r2_old.fq\n",
      "\n",
      "\n",
      "\n",
      "print SeqIO.write([r for r in SeqIO.parse(\"fq/D6_r1_old.fq\", \"fastq\") if r.description.split(\"/\")[0] not in to_drop], \n",
      "            \"fq/D6_r1.fq\", \"fastq\")\n",
      "print SeqIO.write([r for r in SeqIO.parse(\"fq/D6_r2_old.fq\", \"fastq\") if r.description.split(\"/\")[0] not in to_drop], \n",
      "            \"fq/D6_r2.fq\", \"fastq\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "removed  56\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2104\n",
        "2104"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 353
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Now, degrade the quality of one would-be mutation"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def target_muants(read):\n",
      "    _, read_info, pos, strand = read.description.split(\"\\t\")\n",
      "    rid = read_info.split(\"/\")[0]\n",
      "    if \"qual_score\" in read_info: #it's in the right gene\n",
      "        pos = int(pos)\n",
      "        if strand == \"-\":\n",
      "            if 0 < (pos - 600) < 50:\n",
      "                return rid, pos - 600\n",
      "        else:\n",
      "            if 0 < (601 - pos) <50:\n",
      "                return rid, 600 -pos\n",
      "    return rid, None"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 354
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "to_downgrade1 = {read:mutant for read,mutant in [target_muants(r) for r in SeqIO.parse(\"fq/D5_r1.aln\", \"fasta\")]}\n",
      "to_downgrade2 = {read:mutant for read,mutant in [target_muants(r) for r in SeqIO.parse(\"fq/D5_r2.aln\", \"fasta\")]}\n",
      "\n",
      "#we will overwrite teh fastq files, so preserve the origanls\n",
      "#BE AWARE - you can't re-run this cell without re-running theorignal read-simulatoins\n",
      "!mv fq/D5_r1.fq fq/D5_r1_old.fq\n",
      "!mv fq/D5_r2.fq fq/D5_r2_old.fq\n",
      "\n",
      "\n",
      "def downgrade_read(rec, dg_dict):\n",
      "    \"\"\"\" \"\"\"\n",
      "    base = dg_dict[ rec.id.split(\"/\")[0] ]\n",
      "    if base:\n",
      "        rec.letter_annotations[\"phred_quality\"][base] = 2\n",
      "    return rec\n",
      "        \n",
      "print SeqIO.write([downgrade_read(r, to_downgrade1) for r in SeqIO.parse(\"fq/D5_r1_old.fq\", \"fastq\")],\n",
      "                  \"fq/D5_r1.fq\", \"fastq\")\n",
      "print SeqIO.write([downgrade_read(r, to_downgrade2) for r in SeqIO.parse(\"fq/D5_r2_old.fq\", \"fastq\")],\n",
      "                  \"fq/D5_r2.fq\", \"fastq\")\n",
      "\n",
      "#how many did we downgrade\n",
      "print \"downgraded\", len([v for v in to_downgrade1.values() if v]) + len([v for v in to_downgrade2.values() if v]) "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2160\n",
        "2160"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "downgraded 28\n"
       ]
      }
     ],
     "prompt_number": 355
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "! ls fq/"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "A0_r1.aln  D1_r2.aln  D3_r1.aln  D4_r2.aln     D5_r2.fq      D6_r2.fq\r\n",
        "A0_r1.fq   D1_r2.fq   D3_r1.fq\t D4_r2.fq      D5_r2_old.fq  D6_r2_old.fq\r\n",
        "A0_r2.aln  D2_r1.aln  D3_r2.aln  D5_r1.aln     D6_r1.aln     test_ref_r1.aln\r\n",
        "A0_r2.fq   D2_r1.fq   D3_r2.fq\t D5_r1.fq      D6_r1.fq      test_ref_r1.fq\r\n",
        "D1_r1.aln  D2_r2.aln  D4_r1.aln  D5_r1_old.fq  D6_r1_old.fq  test_ref_r2.aln\r\n",
        "D1_r1.fq   D2_r2.fq   D4_r1.fq\t D5_r2.aln     D6_r2.aln     test_ref_r2.fq\r\n"
       ]
      }
     ],
     "prompt_number": 356
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Align into BAMS\n",
      "\n",
      "bowtie's index builder is kind of annoying, as far as I can tell it needs to be run from the directory in which it lives. To make the index you need to go into the dir and\n",
      "\n",
      "```[path to bowtie]./bowtie2-build-l [path to cwd]seqs/test_ref.fasta [path to cwd]/seqs/test_ref```"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "! /home/david/Downloads/bowtie2-2.2.0/bowtie2-build-l ~/analysis/tetma/test_data/seqs/test_ref.fasta ~/analysis/tetma/test_data/seqs/test_ref &> ref.log"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Settings:\r\n",
        "  Output files: \"/home/david/analysis/tetma/test_data/seqs/test_ref.*.bt2l\"\r\n",
        "  Line rate: 7 (line is 128 bytes)\r\n",
        "  Lines per side: 1 (side is 128 bytes)\r\n",
        "  Offset rate: 4 (one in 16)\r\n",
        "  FTable chars: 10\r\n",
        "  Strings: unpacked\r\n",
        "  Max bucket size: default\r\n",
        "  Max bucket size, sqrt multiplier: default\r\n",
        "  Max bucket size, len divisor: 4\r\n",
        "  Difference-cover sample period: 1024\r\n",
        "  Endianness: little\r\n",
        "  Actual local endianness: little\r\n",
        "  Sanity checking: disabled\r\n",
        "  Assertions: disabled\r\n",
        "  Random seed: 0\r\n",
        "  Sizeofs: void*:8, int:4, long:8, size_t:8\r\n",
        "Input files DNA, FASTA:\r\n",
        "  /home/david/analysis/tetma/test_data/seqs/test_ref.fasta\r\n",
        "Building a LARGE index\r\n",
        "Reading reference sizes\r\n"
       ]
      }
     ],
     "prompt_number": 357
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for ln in line_names[1:]:\n",
      "    cmd = \"bin/bowtie2-align-l -x seqs/test_ref -1 fq/{0}_r1.fq -2 fq/{0}_r2.fq --rg-id={0} | samtools view -uhS - | samtools sort - {0}\".format(ln)\n",
      "    print subprocess.call(cmd, shell=True)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "0\n",
        "0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 358
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "old_header = \"\"\"@HD\\tVN:1.0\\tSO:coordinate\n",
      "@SQ\\tSN:good_mutation\\tLN:1200\n",
      "@SQ\\tSN:indelish\\tLN:1200\n",
      "@SQ\\tSN:anc_het\\tLN:1200\n",
      "@SQ\\tSN:double_mutation\\tLN:1200\n",
      "@SQ\\tSN:qual_score\\tLN:1200\n",
      "@SQ\\tSN:std_bias\\tLN:1200\n",
      "\"\"\"\n",
      "\n",
      "with open(\"new_head.txt\", \"w\") as out:\n",
      "    out.write(old_header)\n",
      "    for ln in line_names[1:]:\n",
      "        out.write(\"@RG\\tID:{0}\\tSM:{0}\\tPL:Simulation\\n\".format(ln))\n",
      "\n",
      "        \n",
      "bams = \" \".join([ln + \".bam\" for ln in line_names[1:] ])\n",
      "print subprocess.call(\"samtools merge -frh new_head.txt merged.bam \" + bams, shell=True)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "0\n"
       ]
      }
     ],
     "prompt_number": 359
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "! samtools sort merged.bam sorted\n",
      "! samtools index sorted.bam"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 360
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Finally, make an ini file for `accumulate` "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "dm_params=\"\"\"theta=0.0001\n",
      "nfreqs=0.388 #Ugly work around. Need to write a custom\n",
      "nfreqs=0.112 #boost validator to read these in as a line\n",
      "nfreqs=0.112 #in the config file. This works for now\n",
      "nfreqs=0.388 #\n",
      "mu=1.0e-8\n",
      "seq-error=0.01\n",
      "phi-haploid=0.001\n",
      "phi-diploid=0.001\n",
      "\"\"\"\n",
      "\n",
      "\n",
      "with open(\"test_params.ini\", \"w\") as out:\n",
      "    with open(\"pp_params.ini\", \"w\") as pp_out:\n",
      "        out.write(dm_params)\n",
      "        for ln in line_names[1:]:\n",
      "            out.write(\"sample-name={0}\\n\".format(ln))\n",
      "            pp_out.write(\"sample-name={0}\\n\".format(ln))\n",
      "\n",
      "\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 361
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "! ./accuMUlate -c test_params.ini -b sorted.bam -r seqs/test_ref.fasta -o test.out &> err\n",
      "! ./pp -c pp_params.ini -b sorted.bam -i test.out \n",
      "! cat  test.out\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 363
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 362
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 362
    }
   ],
   "metadata": {}
  }
 ]
}
	{
	"metadata": {
	"name": "",
	"signature": "sha256:98f1acba9791082035860796ccf52294670282ba8deb7745219206b5182498d8"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#Creating test data for mutation detection"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"##Set up some tools for creating random DNA sequences with weighted sampling"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"from collections import deque\n",
	"import random \n",
	"\n",
	"def create_alias(weights):\n",
	" \"\"\" Create the probability and alias tables for \n",
	" \"\"\"\n",
	" n = len(weights)\n",
	" s = sum(weights)\n",
	" scaled_probs = [float(w)/s*n for w in weights]\n",
	" Alias = [0 for _ in range(n)]\n",
	" Prob = [0 for _ in range(n) ]\n",
	" large = deque()\n",
	" small = deque() \n",
	"\n",
	" for i,p in enumerate(scaled_probs):\n",
	" if p < 1:\n",
	" small.append(i)\n",
	" else:\n",
	" large.append(i)\n",
	"\n",
	" while (large and small):\n",
	" l = small.pop()\n",
	" g = large.pop()\n",
	" Prob[l] = scaled_probs[l]\n",
	" Alias[l] = g\n",
	" scaled_probs[g] = (scaled_probs[g] + scaled_probs[l]) - 1\n",
	" if scaled_probs[g] < 1:\n",
	" small.append(g)\n",
	" else:\n",
	" large.append(g)\n",
	"\n",
	" while large:\n",
	" g = large.pop()\n",
	" Prob[g] = 1\n",
	"\n",
	" while small:\n",
	" l = small.pop()\n",
	" Prob[l] = 1\n",
	"\n",
	" return( (Prob, Alias) )\n",
	"\n",
	"def weighted_choice(population, weights):\n",
	" Prob, Alias = create_alias(weights)\n",
	" n = len(population) -1\n",
	" i = random.randint(0,n)\n",
	" if random.random() < Prob[i]:\n",
	" return population[i]\n",
	" return population[ Alias[i] ]\n",
	"\n",
	"def weighted_sample(population, weights, size):\n",
	" Prob, Alias = create_alias(weights)\n",
	" n = len(population) - 1\n",
	" result = []\n",
	" for rep in range(size):\n",
	" i = random.randint(0,n)\n",
	" if random.random() < Prob[i]:\n",
	" result.append(population[i])\n",
	" else:\n",
	" result.append(population[ Alias[i] ])\n",
	" return(result)\n",
	" \n",
	"def sample_from_alias(population, prob_table, alias_table):\n",
	" \"\"\" \"\"\"\n",
	" n = len(population) - 1\n",
	" i = random.randint(0,n)\n",
	" if random.random() < prob_table[i]:\n",
	" return population[i]\n",
	" return population[ alias_table[i] ]\n",
	"\n",
	" \n",
	"def mutate(b, prob, alias):\n",
	" \"\"\" \"\"\"\n",
	" m = sample_from_alias(\"ACGT\", prob, alias)\n",
	" while m == b:\n",
	" m = sample_from_alias(\"ACGT\", prob, alias)\n",
	" return(m)\n",
	"\n",
	" \n",
	" "
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 347
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"##The Plan...\n",
	"\n",
	"... is to create test datasets from which simulated reads can be generated. Specifically we want\n",
	"\n",
	"+ A 'good'mutation that's 'easy' to call\n",
	"+ A 'good' mutation that's harder to call because there are indels in the reads [not yet done]\n",
	"+ A sequence at which only the ancestor is non-refernce (to deal with heterozgosity in ancestor)\n",
	"+ A 'double mutant'\n",
	"+ A case were all apparent mutant bases have quality scores below the threshold (default 13)\n",
	"+ A case with strand bias (for the post-procesor to detect)\n",
	"\n",
	"To do this we will make random 'ancestral sequences' for 7 samples, an Ancestor (`A0`) abd descenants (`D1` ...`D6`). To make testing easier we'll only ever introduce mutatnts at the 600th base of the sequence, so to prevent overwriting a mutations we need new descenant line for every case above (and two for the double mutant). \n",
	"\n",
	"The code below requires the execultables `bowtie2` and `art_illumina`. In this notebook they are called by relative paths (i.e. they are not on `$PATH`)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"##Make some 'ancestral' DNA sequences\n",
	"\n",
	"The goal here is 6 random \"chromosomes\" (on which we'll add the different cases above) for each of 8 samples \n",
	"\n",
	"I first tried this with lists(chromosomes) of lists (samples) of strings (sequences), but found keeping track of things too hard. Instead, here's a a functions to create \"mutable strings\" and methods to indtroduce mutations, and a dictoinary to keep track of them"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"import copy\n",
	"\n",
	"class MutableSequence(object):\n",
	" def __init__(self, start_string):\n",
	" self.seq = start_string\n",
	" \n",
	" def __repr__(self):\n",
	" if len(self.seq) > 5:\n",
	" return \"MutableSequence({0}...{1})\".format(self.seq[1:3], self.seq[4:5])\n",
	" return \"MutableSequence({0})\".format(self.seq)\n",
	" \n",
	" def point_mutation(self, index, alias_table, prob_table):\n",
	" \"\"\" Introduce a point mutation. \n",
	" \n",
	" Note, this code requires precomputed alias table for weighted sampling\n",
	" \"\"\"\n",
	" listed = list(self.seq)\n",
	" m = listed[index]\n",
	" while m == listed[index]:\n",
	" m = sample_from_alias(\"ACGT\", alias_table, prob_table)\n",
	" listed[index] = m\n",
	" self.seq = \"\".join(listed)\n",
	" \n",
	" def remove_bases(self, excluded_inidices):\n",
	" \"\"\"\n",
	" Remove bases from sequences\n",
	" \n",
	" excluded_indices is a tuple list with zero-based half open indices to remove\n",
	" i.e to remove the first based from a MutableSequence `S`:\n",
	" \n",
	" S.remove_bases([(0,1)])\n",
	" \"\"\"\n",
	" listed = list(self.seq)\n",
	" slice_start = 0\n",
	" new_seq = []\n",
	" for st,end in excluded_inidices:\n",
	" new_seq += listed[slice_start:st]\n",
	" slice_start = end \n",
	" new_seq += listed[end:]\n",
	" self.seq = \"\".join(new_seq)\n",
	" \n",
	"def make_seq_dict(specimen_names, alias_table, prob_table, seq_len = 1200):\n",
	" aseq = MutableSequence(\"\".join([sample_from_alias(\"ACGT\",alias_table,prob_table) for _ in range(seq_len)]))\n",
	" return {n:copy.deepcopy(aseq) for n in specimen_names}\n",
	"\n",
	"#start out by creating alias tables for somehting like the Tetrahymena AT bias\n",
	"a,p = create_alias( [0.36, 0.14, 0.14, 0.36])\n",
	"\n",
	"gene_names = [\"good_mutation\", \"indelish\", \"anc_het\", \"double_mutation\", \"qual_score\", \"std_bias\"]\n",
	"line_names = [\"test_ref\", \"A0\", \"D1\", \"D2\", \"D3\", \"D4\", \"D5\", \"D6\"]\n",
	"\n",
	"sequence_set = [make_seq_dict(line_names,a,p) for g in gene_names]\n",
	"sequence_set[0:2]"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 348,
	"text": [
	"[{'A0': MutableSequence(TT...G),\n",
	" 'D1': MutableSequence(TT...G),\n",
	" 'D2': MutableSequence(TT...G),\n",
	" 'D3': MutableSequence(TT...G),\n",
	" 'D4': MutableSequence(TT...G),\n",
	" 'D5': MutableSequence(TT...G),\n",
	" 'D6': MutableSequence(TT...G),\n",
	" 'test_ref': MutableSequence(TT...G)},\n",
	" {'A0': MutableSequence(GA...T),\n",
	" 'D1': MutableSequence(GA...T),\n",
	" 'D2': MutableSequence(GA...T),\n",
	" 'D3': MutableSequence(GA...T),\n",
	" 'D4': MutableSequence(GA...T),\n",
	" 'D5': MutableSequence(GA...T),\n",
	" 'D6': MutableSequence(GA...T),\n",
	" 'test_ref': MutableSequence(GA...T)}]"
	]
	}
	],
	"prompt_number": 348
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"##Introduce mutations to some sequences"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"#genes names are\n",
	"#[\"good_mutation\", \"indelish\", \"anc_het\", \"double_mutation\", \"qual_score\", \"std_bias\"]\n",
	"\n",
	"#a good mutation\n",
	"sequence_set[0][\"D1\"].point_mutation(600, a,p)\n",
	"# a good mutation around indels\n",
	"sequence_set[1][\"D2\"].point_mutation(600, a,p)\n",
	"sequence_set[1][\"D2\"].remove_bases([(590,596), (605,611)])\n",
	"#apparent mutation in the ancestor\n",
	"sequence_set[2][\"A0\"].point_mutation(600, a,p)\n",
	"#two apparent mutations in difference samples\n",
	"sequence_set[3][\"D3\"].point_mutation(600, a,p)\n",
	"sequence_set[3][\"D4\"].point_mutation(600, a,p)\n",
	"# a good mutation that we'll edit the quality score for\n",
	"sequence_set[4][\"D5\"].point_mutation(600, a,p)\n",
	"# a good mutation that we'll edit the starnd bias\n",
	"sequence_set[5][\"D6\"].point_mutation(600, a,p)\n",
	"\n",
	"#test it all worked\n",
	"print sequence_set[0][\"D1\"].seq[600] != sequence_set[0][\"test_ref\"].seq[600]\n",
	"print len(sequence_set[1][\"D2\"].seq) < len(sequence_set[1][\"test_ref\"].seq)\n"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"True\n",
	"True\n"
	]
	}
	],
	"prompt_number": 349
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"###Write out the mutated sequences to fasta files, create reads with `ART`\n",
	"\n",
	"writes fasta sequenes for each line into `/seqs` dir; `.fq` and `.aln` fliles into `/fq`. Not the small insert/read size, which was chosen to deal with the short squences"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"import subprocess\n",
	"\n",
	"results = []\n",
	"for ln in line_names:\n",
	" fname = \"seqs/{0}.fasta\".format(ln)\n",
	" handle = open(fname, \"w\")\n",
	" for i,gn in enumerate(gene_names):\n",
	" handle.write(\">{0}\\n{1}\\n\".format(gn, sequence_set[i][ln].seq))\n",
	" handle.close()\n",
	" if ln != \"test_ref\": \n",
	" cmd = \"bin/art_illumina -i {0} -p -l 50 -f 30 -m 100 -s 10 -o fq/{1}_r\".format(fname, ln)\n",
	" results.append( subprocess.call(cmd, shell=True) )\n",
	"\n",
	"#test it worked, and all retuned exit code 0\n",
	"print len(results) == len(line_names)-1\n",
	"print sum(results) == 0\n",
	" "
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"True\n",
	"True\n"
	]
	}
	],
	"prompt_number": 350
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"for i,gn in enumerate(gene_names):\n",
	" handle = open(\"MSA/{0}.fasta\".format(gn), \"w\")\n",
	" for k,v in sorted(sequence_set[i].items()):\n",
	" handle.write(\">{0}\\n{1}\\n\".format(k,v.seq))"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 351
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"##Now we have reads... let's monkey around with a couple of them. \n",
	"\n",
	"NOTE These functions move the files created above around, to re-run any cell in this section we should also re-run the sequencing simulations\n",
	"\n",
	"###first, introudce stand bias. \n",
	"using the `.aln` file to identify reads overlapping the mutation site, filter those IDs from the FASTQ:"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"from Bio import SeqIO\n",
	"\n",
	"def extract_read_id(read):\n",
	" return read.description.split(\"\\t\")[1].split(\"/\")[0]\n",
	"\n",
	"def overlaps(read):\n",
	" _, read_info, pos, strand = read.description.split(\"\\t\")\n",
	" if \"std_bias\" in read_info: #it's in the right gene\n",
	" if abs(601 - int(pos)) < 75: #it overlaps\n",
	" return (strand == \"+\") #is it the stand we care about?\n",
	" return False\n",
	"\n"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 352
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"to_drop = [extract_read_id(r) for r in SeqIO.parse(\"fq/D6_r1.aln\", \"fasta\") if overlaps(r)] + \\\n",
	" [extract_read_id(r) for r in SeqIO.parse(\"fq/D6_r2.aln\", \"fasta\") if overlaps(r)]\n",
	"\n",
	"print \"removed \",len(to_drop)\n",
	"#we will overwrite teh fastq files, so preserve the origanls\n",
	"#BE AWARE - you can't re-run this cell without re-running theorignal read-simulatoins\n",
	"!mv fq/D6_r1.fq fq/D6_r1_old.fq\n",
	"!mv fq/D6_r2.fq fq/D6_r2_old.fq\n",
	"\n",
	"\n",
	"\n",
	"print SeqIO.write([r for r in SeqIO.parse(\"fq/D6_r1_old.fq\", \"fastq\") if r.description.split(\"/\")[0] not in to_drop], \n",
	" \"fq/D6_r1.fq\", \"fastq\")\n",
	"print SeqIO.write([r for r in SeqIO.parse(\"fq/D6_r2_old.fq\", \"fastq\") if r.description.split(\"/\")[0] not in to_drop], \n",
	" \"fq/D6_r2.fq\", \"fastq\")"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"removed 56\n"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"2104\n",
	"2104"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n"
	]
	}
	],
	"prompt_number": 353
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"###Now, degrade the quality of one would-be mutation"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"def target_muants(read):\n",
	" _, read_info, pos, strand = read.description.split(\"\\t\")\n",
	" rid = read_info.split(\"/\")[0]\n",
	" if \"qual_score\" in read_info: #it's in the right gene\n",
	" pos = int(pos)\n",
	" if strand == \"-\":\n",
	" if 0 < (pos - 600) < 50:\n",
	" return rid, pos - 600\n",
	" else:\n",
	" if 0 < (601 - pos) <50:\n",
	" return rid, 600 -pos\n",
	" return rid, None"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 354
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"to_downgrade1 = {read:mutant for read,mutant in [target_muants(r) for r in SeqIO.parse(\"fq/D5_r1.aln\", \"fasta\")]}\n",
	"to_downgrade2 = {read:mutant for read,mutant in [target_muants(r) for r in SeqIO.parse(\"fq/D5_r2.aln\", \"fasta\")]}\n",
	"\n",
	"#we will overwrite teh fastq files, so preserve the origanls\n",
	"#BE AWARE - you can't re-run this cell without re-running theorignal read-simulatoins\n",
	"!mv fq/D5_r1.fq fq/D5_r1_old.fq\n",
	"!mv fq/D5_r2.fq fq/D5_r2_old.fq\n",
	"\n",
	"\n",
	"def downgrade_read(rec, dg_dict):\n",
	" \"\"\"\" \"\"\"\n",
	" base = dg_dict[ rec.id.split(\"/\")[0] ]\n",
	" if base:\n",
	" rec.letter_annotations[\"phred_quality\"][base] = 2\n",
	" return rec\n",
	" \n",
	"print SeqIO.write([downgrade_read(r, to_downgrade1) for r in SeqIO.parse(\"fq/D5_r1_old.fq\", \"fastq\")],\n",
	" \"fq/D5_r1.fq\", \"fastq\")\n",
	"print SeqIO.write([downgrade_read(r, to_downgrade2) for r in SeqIO.parse(\"fq/D5_r2_old.fq\", \"fastq\")],\n",
	" \"fq/D5_r2.fq\", \"fastq\")\n",
	"\n",
	"#how many did we downgrade\n",
	"print \"downgraded\", len([v for v in to_downgrade1.values() if v]) + len([v for v in to_downgrade2.values() if v]) "
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"2160\n",
	"2160"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	"downgraded 28\n"
	]
	}
	],
	"prompt_number": 355
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"! ls fq/"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"A0_r1.aln D1_r2.aln D3_r1.aln D4_r2.aln D5_r2.fq D6_r2.fq\r\n",
	"A0_r1.fq D1_r2.fq D3_r1.fq\t D4_r2.fq D5_r2_old.fq D6_r2_old.fq\r\n",
	"A0_r2.aln D2_r1.aln D3_r2.aln D5_r1.aln D6_r1.aln test_ref_r1.aln\r\n",
	"A0_r2.fq D2_r1.fq D3_r2.fq\t D5_r1.fq D6_r1.fq test_ref_r1.fq\r\n",
	"D1_r1.aln D2_r2.aln D4_r1.aln D5_r1_old.fq D6_r1_old.fq test_ref_r2.aln\r\n",
	"D1_r1.fq D2_r2.fq D4_r1.fq\t D5_r2.aln D6_r2.aln test_ref_r2.fq\r\n"
	]
	}
	],
	"prompt_number": 356
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"##Align into BAMS\n",
	"\n",
	"bowtie's index builder is kind of annoying, as far as I can tell it needs to be run from the directory in which it lives. To make the index you need to go into the dir and\n",
	"\n",
	"```[path to bowtie]./bowtie2-build-l [path to cwd]seqs/test_ref.fasta [path to cwd]/seqs/test_ref```"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"! /home/david/Downloads/bowtie2-2.2.0/bowtie2-build-l ~/analysis/tetma/test_data/seqs/test_ref.fasta ~/analysis/tetma/test_data/seqs/test_ref &> ref.log"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"Settings:\r\n",
	" Output files: \"/home/david/analysis/tetma/test_data/seqs/test_ref.*.bt2l\"\r\n",
	" Line rate: 7 (line is 128 bytes)\r\n",
	" Lines per side: 1 (side is 128 bytes)\r\n",
	" Offset rate: 4 (one in 16)\r\n",
	" FTable chars: 10\r\n",
	" Strings: unpacked\r\n",
	" Max bucket size: default\r\n",
	" Max bucket size, sqrt multiplier: default\r\n",
	" Max bucket size, len divisor: 4\r\n",
	" Difference-cover sample period: 1024\r\n",
	" Endianness: little\r\n",
	" Actual local endianness: little\r\n",
	" Sanity checking: disabled\r\n",
	" Assertions: disabled\r\n",
	" Random seed: 0\r\n",
	" Sizeofs: void*:8, int:4, long:8, size_t:8\r\n",
	"Input files DNA, FASTA:\r\n",
	" /home/david/analysis/tetma/test_data/seqs/test_ref.fasta\r\n",
	"Building a LARGE index\r\n",
	"Reading reference sizes\r\n"
	]
	}
	],
	"prompt_number": 357
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"for ln in line_names[1:]:\n",
	" cmd = \"bin/bowtie2-align-l -x seqs/test_ref -1 fq/{0}_r1.fq -2 fq/{0}_r2.fq --rg-id={0} \| samtools view -uhS - \| samtools sort - {0}\".format(ln)\n",
	" print subprocess.call(cmd, shell=True)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"0\n",
	"0"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	"0"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	"0"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	"0"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	"0"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	"0"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n"
	]
	}
	],
	"prompt_number": 358
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"old_header = \"\"\"@HD\\tVN:1.0\\tSO:coordinate\n",
	"@SQ\\tSN:good_mutation\\tLN:1200\n",
	"@SQ\\tSN:indelish\\tLN:1200\n",
	"@SQ\\tSN:anc_het\\tLN:1200\n",
	"@SQ\\tSN:double_mutation\\tLN:1200\n",
	"@SQ\\tSN:qual_score\\tLN:1200\n",
	"@SQ\\tSN:std_bias\\tLN:1200\n",
	"\"\"\"\n",
	"\n",
	"with open(\"new_head.txt\", \"w\") as out:\n",
	" out.write(old_header)\n",
	" for ln in line_names[1:]:\n",
	" out.write(\"@RG\\tID:{0}\\tSM:{0}\\tPL:Simulation\\n\".format(ln))\n",
	"\n",
	" \n",
	"bams = \" \".join([ln + \".bam\" for ln in line_names[1:] ])\n",
	"print subprocess.call(\"samtools merge -frh new_head.txt merged.bam \" + bams, shell=True)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"0\n"
	]
	}
	],
	"prompt_number": 359
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"! samtools sort merged.bam sorted\n",
	"! samtools index sorted.bam"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 360
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Finally, make an ini file for `accumulate` "
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"dm_params=\"\"\"theta=0.0001\n",
	"nfreqs=0.388 #Ugly work around. Need to write a custom\n",
	"nfreqs=0.112 #boost validator to read these in as a line\n",
	"nfreqs=0.112 #in the config file. This works for now\n",
	"nfreqs=0.388 #\n",
	"mu=1.0e-8\n",
	"seq-error=0.01\n",
	"phi-haploid=0.001\n",
	"phi-diploid=0.001\n",
	"\"\"\"\n",
	"\n",
	"\n",
	"with open(\"test_params.ini\", \"w\") as out:\n",
	" with open(\"pp_params.ini\", \"w\") as pp_out:\n",
	" out.write(dm_params)\n",
	" for ln in line_names[1:]:\n",
	" out.write(\"sample-name={0}\\n\".format(ln))\n",
	" pp_out.write(\"sample-name={0}\\n\".format(ln))\n",
	"\n",
	"\n"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 361
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"! ./accuMUlate -c test_params.ini -b sorted.bam -r seqs/test_ref.fasta -o test.out &> err\n",
	"! ./pp -c pp_params.ini -b sorted.bam -i test.out \n",
	"! cat test.out\n"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 363
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"\n"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 362
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 362
	}
	],
	"metadata": {}
	}
	]
	}