Created
April 22, 2014 19:31
-
-
Save dwinter/11191449 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "", | |
"signature": "sha256:98f1acba9791082035860796ccf52294670282ba8deb7745219206b5182498d8" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#Creating test data for mutation detection" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##Set up some tools for creating random DNA sequences with weighted sampling" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"from collections import deque\n", | |
"import random \n", | |
"\n", | |
"def create_alias(weights):\n", | |
" \"\"\" Create the probability and alias tables for \n", | |
" \"\"\"\n", | |
" n = len(weights)\n", | |
" s = sum(weights)\n", | |
" scaled_probs = [float(w)/s*n for w in weights]\n", | |
" Alias = [0 for _ in range(n)]\n", | |
" Prob = [0 for _ in range(n) ]\n", | |
" large = deque()\n", | |
" small = deque() \n", | |
"\n", | |
" for i,p in enumerate(scaled_probs):\n", | |
" if p < 1:\n", | |
" small.append(i)\n", | |
" else:\n", | |
" large.append(i)\n", | |
"\n", | |
" while (large and small):\n", | |
" l = small.pop()\n", | |
" g = large.pop()\n", | |
" Prob[l] = scaled_probs[l]\n", | |
" Alias[l] = g\n", | |
" scaled_probs[g] = (scaled_probs[g] + scaled_probs[l]) - 1\n", | |
" if scaled_probs[g] < 1:\n", | |
" small.append(g)\n", | |
" else:\n", | |
" large.append(g)\n", | |
"\n", | |
" while large:\n", | |
" g = large.pop()\n", | |
" Prob[g] = 1\n", | |
"\n", | |
" while small:\n", | |
" l = small.pop()\n", | |
" Prob[l] = 1\n", | |
"\n", | |
" return( (Prob, Alias) )\n", | |
"\n", | |
"def weighted_choice(population, weights):\n", | |
" Prob, Alias = create_alias(weights)\n", | |
" n = len(population) -1\n", | |
" i = random.randint(0,n)\n", | |
" if random.random() < Prob[i]:\n", | |
" return population[i]\n", | |
" return population[ Alias[i] ]\n", | |
"\n", | |
"def weighted_sample(population, weights, size):\n", | |
" Prob, Alias = create_alias(weights)\n", | |
" n = len(population) - 1\n", | |
" result = []\n", | |
" for rep in range(size):\n", | |
" i = random.randint(0,n)\n", | |
" if random.random() < Prob[i]:\n", | |
" result.append(population[i])\n", | |
" else:\n", | |
" result.append(population[ Alias[i] ])\n", | |
" return(result)\n", | |
" \n", | |
"def sample_from_alias(population, prob_table, alias_table):\n", | |
" \"\"\" \"\"\"\n", | |
" n = len(population) - 1\n", | |
" i = random.randint(0,n)\n", | |
" if random.random() < prob_table[i]:\n", | |
" return population[i]\n", | |
" return population[ alias_table[i] ]\n", | |
"\n", | |
" \n", | |
"def mutate(b, prob, alias):\n", | |
" \"\"\" \"\"\"\n", | |
" m = sample_from_alias(\"ACGT\", prob, alias)\n", | |
" while m == b:\n", | |
" m = sample_from_alias(\"ACGT\", prob, alias)\n", | |
" return(m)\n", | |
"\n", | |
" \n", | |
" " | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 347 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##The Plan...\n", | |
"\n", | |
"... is to create test datasets from which simulated reads can be generated. Specifically we want\n", | |
"\n", | |
"+ A 'good'mutation that's 'easy' to call\n", | |
"+ A 'good' mutation that's harder to call because there are indels in the reads [not yet done]\n", | |
"+ A sequence at which only the ancestor is non-refernce (to deal with heterozgosity in ancestor)\n", | |
"+ A 'double mutant'\n", | |
"+ A case were all apparent mutant bases have quality scores below the threshold (default 13)\n", | |
"+ A case with strand bias (for the post-procesor to detect)\n", | |
"\n", | |
"To do this we will make random 'ancestral sequences' for 7 samples, an Ancestor (`A0`) abd descenants (`D1` ...`D6`). To make testing easier we'll only ever introduce mutatnts at the 600th base of the sequence, so to prevent overwriting a mutations we need new descenant line for every case above (and two for the double mutant). \n", | |
"\n", | |
"The code below requires the execultables `bowtie2` and `art_illumina`. In this notebook they are called by relative paths (i.e. they are not on `$PATH`)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##Make some 'ancestral' DNA sequences\n", | |
"\n", | |
"The goal here is 6 random \"chromosomes\" (on which we'll add the different cases above) for each of 8 samples \n", | |
"\n", | |
"I first tried this with lists(chromosomes) of lists (samples) of strings (sequences), but found keeping track of things too hard. Instead, here's a a functions to create \"mutable strings\" and methods to indtroduce mutations, and a dictoinary to keep track of them" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"import copy\n", | |
"\n", | |
"class MutableSequence(object):\n", | |
" def __init__(self, start_string):\n", | |
" self.seq = start_string\n", | |
" \n", | |
" def __repr__(self):\n", | |
" if len(self.seq) > 5:\n", | |
" return \"MutableSequence({0}...{1})\".format(self.seq[1:3], self.seq[4:5])\n", | |
" return \"MutableSequence({0})\".format(self.seq)\n", | |
" \n", | |
" def point_mutation(self, index, alias_table, prob_table):\n", | |
" \"\"\" Introduce a point mutation. \n", | |
" \n", | |
" Note, this code requires precomputed alias table for weighted sampling\n", | |
" \"\"\"\n", | |
" listed = list(self.seq)\n", | |
" m = listed[index]\n", | |
" while m == listed[index]:\n", | |
" m = sample_from_alias(\"ACGT\", alias_table, prob_table)\n", | |
" listed[index] = m\n", | |
" self.seq = \"\".join(listed)\n", | |
" \n", | |
" def remove_bases(self, excluded_inidices):\n", | |
" \"\"\"\n", | |
" Remove bases from sequences\n", | |
" \n", | |
" excluded_indices is a tuple list with zero-based half open indices to remove\n", | |
" i.e to remove the first based from a MutableSequence `S`:\n", | |
" \n", | |
" S.remove_bases([(0,1)])\n", | |
" \"\"\"\n", | |
" listed = list(self.seq)\n", | |
" slice_start = 0\n", | |
" new_seq = []\n", | |
" for st,end in excluded_inidices:\n", | |
" new_seq += listed[slice_start:st]\n", | |
" slice_start = end \n", | |
" new_seq += listed[end:]\n", | |
" self.seq = \"\".join(new_seq)\n", | |
" \n", | |
"def make_seq_dict(specimen_names, alias_table, prob_table, seq_len = 1200):\n", | |
" aseq = MutableSequence(\"\".join([sample_from_alias(\"ACGT\",alias_table,prob_table) for _ in range(seq_len)]))\n", | |
" return {n:copy.deepcopy(aseq) for n in specimen_names}\n", | |
"\n", | |
"#start out by creating alias tables for somehting like the Tetrahymena AT bias\n", | |
"a,p = create_alias( [0.36, 0.14, 0.14, 0.36])\n", | |
"\n", | |
"gene_names = [\"good_mutation\", \"indelish\", \"anc_het\", \"double_mutation\", \"qual_score\", \"std_bias\"]\n", | |
"line_names = [\"test_ref\", \"A0\", \"D1\", \"D2\", \"D3\", \"D4\", \"D5\", \"D6\"]\n", | |
"\n", | |
"sequence_set = [make_seq_dict(line_names,a,p) for g in gene_names]\n", | |
"sequence_set[0:2]" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 348, | |
"text": [ | |
"[{'A0': MutableSequence(TT...G),\n", | |
" 'D1': MutableSequence(TT...G),\n", | |
" 'D2': MutableSequence(TT...G),\n", | |
" 'D3': MutableSequence(TT...G),\n", | |
" 'D4': MutableSequence(TT...G),\n", | |
" 'D5': MutableSequence(TT...G),\n", | |
" 'D6': MutableSequence(TT...G),\n", | |
" 'test_ref': MutableSequence(TT...G)},\n", | |
" {'A0': MutableSequence(GA...T),\n", | |
" 'D1': MutableSequence(GA...T),\n", | |
" 'D2': MutableSequence(GA...T),\n", | |
" 'D3': MutableSequence(GA...T),\n", | |
" 'D4': MutableSequence(GA...T),\n", | |
" 'D5': MutableSequence(GA...T),\n", | |
" 'D6': MutableSequence(GA...T),\n", | |
" 'test_ref': MutableSequence(GA...T)}]" | |
] | |
} | |
], | |
"prompt_number": 348 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##Introduce mutations to some sequences" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"#genes names are\n", | |
"#[\"good_mutation\", \"indelish\", \"anc_het\", \"double_mutation\", \"qual_score\", \"std_bias\"]\n", | |
"\n", | |
"#a good mutation\n", | |
"sequence_set[0][\"D1\"].point_mutation(600, a,p)\n", | |
"# a good mutation around indels\n", | |
"sequence_set[1][\"D2\"].point_mutation(600, a,p)\n", | |
"sequence_set[1][\"D2\"].remove_bases([(590,596), (605,611)])\n", | |
"#apparent mutation in the ancestor\n", | |
"sequence_set[2][\"A0\"].point_mutation(600, a,p)\n", | |
"#two apparent mutations in difference samples\n", | |
"sequence_set[3][\"D3\"].point_mutation(600, a,p)\n", | |
"sequence_set[3][\"D4\"].point_mutation(600, a,p)\n", | |
"# a good mutation that we'll edit the quality score for\n", | |
"sequence_set[4][\"D5\"].point_mutation(600, a,p)\n", | |
"# a good mutation that we'll edit the starnd bias\n", | |
"sequence_set[5][\"D6\"].point_mutation(600, a,p)\n", | |
"\n", | |
"#test it all worked\n", | |
"print sequence_set[0][\"D1\"].seq[600] != sequence_set[0][\"test_ref\"].seq[600]\n", | |
"print len(sequence_set[1][\"D2\"].seq) < len(sequence_set[1][\"test_ref\"].seq)\n" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"True\n", | |
"True\n" | |
] | |
} | |
], | |
"prompt_number": 349 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"###Write out the mutated sequences to fasta files, create reads with `ART`\n", | |
"\n", | |
"writes fasta sequenes for each line into `/seqs` dir; `.fq` and `.aln` fliles into `/fq`. Not the small insert/read size, which was chosen to deal with the short squences" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"import subprocess\n", | |
"\n", | |
"results = []\n", | |
"for ln in line_names:\n", | |
" fname = \"seqs/{0}.fasta\".format(ln)\n", | |
" handle = open(fname, \"w\")\n", | |
" for i,gn in enumerate(gene_names):\n", | |
" handle.write(\">{0}\\n{1}\\n\".format(gn, sequence_set[i][ln].seq))\n", | |
" handle.close()\n", | |
" if ln != \"test_ref\": \n", | |
" cmd = \"bin/art_illumina -i {0} -p -l 50 -f 30 -m 100 -s 10 -o fq/{1}_r\".format(fname, ln)\n", | |
" results.append( subprocess.call(cmd, shell=True) )\n", | |
"\n", | |
"#test it worked, and all retuned exit code 0\n", | |
"print len(results) == len(line_names)-1\n", | |
"print sum(results) == 0\n", | |
" " | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"True\n", | |
"True\n" | |
] | |
} | |
], | |
"prompt_number": 350 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"for i,gn in enumerate(gene_names):\n", | |
" handle = open(\"MSA/{0}.fasta\".format(gn), \"w\")\n", | |
" for k,v in sorted(sequence_set[i].items()):\n", | |
" handle.write(\">{0}\\n{1}\\n\".format(k,v.seq))" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 351 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##Now we have reads... let's monkey around with a couple of them. \n", | |
"\n", | |
"**NOTE** These functions move the files created above around, to re-run any cell in this section we should also re-run the sequencing simulations\n", | |
"\n", | |
"###first, introudce stand bias. \n", | |
"using the `.aln` file to identify reads overlapping the mutation site, filter those IDs from the FASTQ:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"from Bio import SeqIO\n", | |
"\n", | |
"def extract_read_id(read):\n", | |
" return read.description.split(\"\\t\")[1].split(\"/\")[0]\n", | |
"\n", | |
"def overlaps(read):\n", | |
" _, read_info, pos, strand = read.description.split(\"\\t\")\n", | |
" if \"std_bias\" in read_info: #it's in the right gene\n", | |
" if abs(601 - int(pos)) < 75: #it overlaps\n", | |
" return (strand == \"+\") #is it the stand we care about?\n", | |
" return False\n", | |
"\n" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 352 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"to_drop = [extract_read_id(r) for r in SeqIO.parse(\"fq/D6_r1.aln\", \"fasta\") if overlaps(r)] + \\\n", | |
" [extract_read_id(r) for r in SeqIO.parse(\"fq/D6_r2.aln\", \"fasta\") if overlaps(r)]\n", | |
"\n", | |
"print \"removed \",len(to_drop)\n", | |
"#we will overwrite teh fastq files, so preserve the origanls\n", | |
"#BE AWARE - you can't re-run this cell without re-running theorignal read-simulatoins\n", | |
"!mv fq/D6_r1.fq fq/D6_r1_old.fq\n", | |
"!mv fq/D6_r2.fq fq/D6_r2_old.fq\n", | |
"\n", | |
"\n", | |
"\n", | |
"print SeqIO.write([r for r in SeqIO.parse(\"fq/D6_r1_old.fq\", \"fastq\") if r.description.split(\"/\")[0] not in to_drop], \n", | |
" \"fq/D6_r1.fq\", \"fastq\")\n", | |
"print SeqIO.write([r for r in SeqIO.parse(\"fq/D6_r2_old.fq\", \"fastq\") if r.description.split(\"/\")[0] not in to_drop], \n", | |
" \"fq/D6_r2.fq\", \"fastq\")" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"removed 56\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"2104\n", | |
"2104" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n" | |
] | |
} | |
], | |
"prompt_number": 353 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"###Now, degrade the quality of one would-be mutation" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"def target_muants(read):\n", | |
" _, read_info, pos, strand = read.description.split(\"\\t\")\n", | |
" rid = read_info.split(\"/\")[0]\n", | |
" if \"qual_score\" in read_info: #it's in the right gene\n", | |
" pos = int(pos)\n", | |
" if strand == \"-\":\n", | |
" if 0 < (pos - 600) < 50:\n", | |
" return rid, pos - 600\n", | |
" else:\n", | |
" if 0 < (601 - pos) <50:\n", | |
" return rid, 600 -pos\n", | |
" return rid, None" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 354 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"to_downgrade1 = {read:mutant for read,mutant in [target_muants(r) for r in SeqIO.parse(\"fq/D5_r1.aln\", \"fasta\")]}\n", | |
"to_downgrade2 = {read:mutant for read,mutant in [target_muants(r) for r in SeqIO.parse(\"fq/D5_r2.aln\", \"fasta\")]}\n", | |
"\n", | |
"#we will overwrite teh fastq files, so preserve the origanls\n", | |
"#BE AWARE - you can't re-run this cell without re-running theorignal read-simulatoins\n", | |
"!mv fq/D5_r1.fq fq/D5_r1_old.fq\n", | |
"!mv fq/D5_r2.fq fq/D5_r2_old.fq\n", | |
"\n", | |
"\n", | |
"def downgrade_read(rec, dg_dict):\n", | |
" \"\"\"\" \"\"\"\n", | |
" base = dg_dict[ rec.id.split(\"/\")[0] ]\n", | |
" if base:\n", | |
" rec.letter_annotations[\"phred_quality\"][base] = 2\n", | |
" return rec\n", | |
" \n", | |
"print SeqIO.write([downgrade_read(r, to_downgrade1) for r in SeqIO.parse(\"fq/D5_r1_old.fq\", \"fastq\")],\n", | |
" \"fq/D5_r1.fq\", \"fastq\")\n", | |
"print SeqIO.write([downgrade_read(r, to_downgrade2) for r in SeqIO.parse(\"fq/D5_r2_old.fq\", \"fastq\")],\n", | |
" \"fq/D5_r2.fq\", \"fastq\")\n", | |
"\n", | |
"#how many did we downgrade\n", | |
"print \"downgraded\", len([v for v in to_downgrade1.values() if v]) + len([v for v in to_downgrade2.values() if v]) " | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"2160\n", | |
"2160" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n", | |
"downgraded 28\n" | |
] | |
} | |
], | |
"prompt_number": 355 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"! ls fq/" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"A0_r1.aln D1_r2.aln D3_r1.aln D4_r2.aln D5_r2.fq D6_r2.fq\r\n", | |
"A0_r1.fq D1_r2.fq D3_r1.fq\t D4_r2.fq D5_r2_old.fq D6_r2_old.fq\r\n", | |
"A0_r2.aln D2_r1.aln D3_r2.aln D5_r1.aln D6_r1.aln test_ref_r1.aln\r\n", | |
"A0_r2.fq D2_r1.fq D3_r2.fq\t D5_r1.fq D6_r1.fq test_ref_r1.fq\r\n", | |
"D1_r1.aln D2_r2.aln D4_r1.aln D5_r1_old.fq D6_r1_old.fq test_ref_r2.aln\r\n", | |
"D1_r1.fq D2_r2.fq D4_r1.fq\t D5_r2.aln D6_r2.aln test_ref_r2.fq\r\n" | |
] | |
} | |
], | |
"prompt_number": 356 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##Align into BAMS\n", | |
"\n", | |
"bowtie's index builder is kind of annoying, as far as I can tell it needs to be run from the directory in which it lives. To make the index you need to go into the dir and\n", | |
"\n", | |
"```[path to bowtie]./bowtie2-build-l [path to cwd]seqs/test_ref.fasta [path to cwd]/seqs/test_ref```" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"! /home/david/Downloads/bowtie2-2.2.0/bowtie2-build-l ~/analysis/tetma/test_data/seqs/test_ref.fasta ~/analysis/tetma/test_data/seqs/test_ref &> ref.log" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"Settings:\r\n", | |
" Output files: \"/home/david/analysis/tetma/test_data/seqs/test_ref.*.bt2l\"\r\n", | |
" Line rate: 7 (line is 128 bytes)\r\n", | |
" Lines per side: 1 (side is 128 bytes)\r\n", | |
" Offset rate: 4 (one in 16)\r\n", | |
" FTable chars: 10\r\n", | |
" Strings: unpacked\r\n", | |
" Max bucket size: default\r\n", | |
" Max bucket size, sqrt multiplier: default\r\n", | |
" Max bucket size, len divisor: 4\r\n", | |
" Difference-cover sample period: 1024\r\n", | |
" Endianness: little\r\n", | |
" Actual local endianness: little\r\n", | |
" Sanity checking: disabled\r\n", | |
" Assertions: disabled\r\n", | |
" Random seed: 0\r\n", | |
" Sizeofs: void*:8, int:4, long:8, size_t:8\r\n", | |
"Input files DNA, FASTA:\r\n", | |
" /home/david/analysis/tetma/test_data/seqs/test_ref.fasta\r\n", | |
"Building a LARGE index\r\n", | |
"Reading reference sizes\r\n" | |
] | |
} | |
], | |
"prompt_number": 357 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"for ln in line_names[1:]:\n", | |
" cmd = \"bin/bowtie2-align-l -x seqs/test_ref -1 fq/{0}_r1.fq -2 fq/{0}_r2.fq --rg-id={0} | samtools view -uhS - | samtools sort - {0}\".format(ln)\n", | |
" print subprocess.call(cmd, shell=True)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"0\n", | |
"0" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n", | |
"0" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n", | |
"0" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n", | |
"0" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n", | |
"0" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n", | |
"0" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n" | |
] | |
} | |
], | |
"prompt_number": 358 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"old_header = \"\"\"@HD\\tVN:1.0\\tSO:coordinate\n", | |
"@SQ\\tSN:good_mutation\\tLN:1200\n", | |
"@SQ\\tSN:indelish\\tLN:1200\n", | |
"@SQ\\tSN:anc_het\\tLN:1200\n", | |
"@SQ\\tSN:double_mutation\\tLN:1200\n", | |
"@SQ\\tSN:qual_score\\tLN:1200\n", | |
"@SQ\\tSN:std_bias\\tLN:1200\n", | |
"\"\"\"\n", | |
"\n", | |
"with open(\"new_head.txt\", \"w\") as out:\n", | |
" out.write(old_header)\n", | |
" for ln in line_names[1:]:\n", | |
" out.write(\"@RG\\tID:{0}\\tSM:{0}\\tPL:Simulation\\n\".format(ln))\n", | |
"\n", | |
" \n", | |
"bams = \" \".join([ln + \".bam\" for ln in line_names[1:] ])\n", | |
"print subprocess.call(\"samtools merge -frh new_head.txt merged.bam \" + bams, shell=True)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"0\n" | |
] | |
} | |
], | |
"prompt_number": 359 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"! samtools sort merged.bam sorted\n", | |
"! samtools index sorted.bam" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 360 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Finally, make an ini file for `accumulate` " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"dm_params=\"\"\"theta=0.0001\n", | |
"nfreqs=0.388 #Ugly work around. Need to write a custom\n", | |
"nfreqs=0.112 #boost validator to read these in as a line\n", | |
"nfreqs=0.112 #in the config file. This works for now\n", | |
"nfreqs=0.388 #\n", | |
"mu=1.0e-8\n", | |
"seq-error=0.01\n", | |
"phi-haploid=0.001\n", | |
"phi-diploid=0.001\n", | |
"\"\"\"\n", | |
"\n", | |
"\n", | |
"with open(\"test_params.ini\", \"w\") as out:\n", | |
" with open(\"pp_params.ini\", \"w\") as pp_out:\n", | |
" out.write(dm_params)\n", | |
" for ln in line_names[1:]:\n", | |
" out.write(\"sample-name={0}\\n\".format(ln))\n", | |
" pp_out.write(\"sample-name={0}\\n\".format(ln))\n", | |
"\n", | |
"\n" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 361 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"! ./accuMUlate -c test_params.ini -b sorted.bam -r seqs/test_ref.fasta -o test.out &> err\n", | |
"! ./pp -c pp_params.ini -b sorted.bam -i test.out \n", | |
"! cat test.out\n" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 363 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"\n" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 362 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 362 | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment