Skip to content

Instantly share code, notes, and snippets.

@dereneaton
Last active April 16, 2018 21:04
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dereneaton/1f661bfb205b644086cc to your computer and use it in GitHub Desktop.
Save dereneaton/1f661bfb205b644086cc to your computer and use it in GitHub Desktop.
** PERMALINKED **
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "",
"signature": "sha256:f2244fdfaa27f085e34a76df82ed1758f53a090e50df789ec2e7f7dfb2c421fd"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Example _de novo_ RADseq assembly using _pyRAD_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---------- \n",
"\n",
"Please direct questions about _pyRAD_ analyses to the google group thread ([link](https://groups.google.com/forum/#!forum/pyrad-users)) \n",
"\n",
"-------------- \n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ This tutorial is meant as a walkthrough for a single-end RADseq analyses. If you have not yet read the [__full tutorial__](http://www.dereneaton.com/software/pyrad), you should start there for a broader description of how _pyRAD_ works. If you are new to RADseq analyses, this tutorial will provide a simple overview of how to execute _pyRAD_, what the data files look like, and how to check that your analysis is working, and the expected output formats. \n",
"\n",
"\n",
"\n",
"+ Each cell in this tutorial begins with the header (%%bash) indicating that the code should be executed in a command line shell, for example by copying and pasting the text into your terminal (but excluding the %%bash header).\n",
"\n",
"------------- \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Begin by executing the command below. This will download an example simulated RADseq data set and unarchive it into your current directory."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"wget -q dereneaton.com/downloads/simRADs.zip\n",
"unzip simRADs.zip"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Archive: simRADs.zip\n",
" inflating: simRADs.barcodes \n",
" inflating: simRADs_R1.fastq.gz \n"
]
}
],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---------------- \n",
"\n",
"#### The two necessary files below should now be located in your current directory.\n",
"\n",
"+ simRADs.fastq.gz : Illumina fastQ formatted reads (gzip compressed)\n",
"+ simRADs.barcodes : barcode map file \n",
"\n",
"----------------- \n",
"\n"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"We begin by creating the params.txt file which is used to set all parameters for an analysis."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"## here I'm also creating a symlink from the location of pyrad on my \n",
"## machine so that I can call the program using just 'pyrad'\n",
"## if you wish to do this uncomment the code on the line below\n",
"## and replace the location with its location on your machine\n",
"#ln -s ~/Dropbox/pyrad-github/pyRAD pyrad\n",
"\n",
"## call pyRAD with the (-n) option\n",
"./pyrad -n"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stderr",
"text": [
"\tnew params.txt file created\n"
]
}
],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"------------ \n",
"\n",
"The params file lists on each line one parameter followed by a __##__ mark, after which any comments can be left. In the comments section there is a description of the parameter and in parentheses the step of the analysis affected by the parameter. Lines 1-14 are required, the remaining lines are optional. The params.txt file is further described in the general tutorial."
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Let's take a look at the default settings. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"cat params.txt"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"==== parameter inputs for pyRAD version 2.16 ============================ affected step ==\n",
"./ ## 1. Working directory (all)\n",
"./*.fastq.gz ## 2. Loc. of non-demultiplexed files (if not line 18) (s1)\n",
"./*.barcodes ## 3. Loc. of barcode file (if not line 18) (s1)\n",
"usearch7.0.1090_i86linux32 ## 4. command (or path) to call usearch v.7 (s3,s6)\n",
"muscle ## 5. command (or path) to call muscle (s3,s7)\n",
"TGCAG ## 6. restriction overhang (e.g., C|TGCAG -> TGCAG) (s1,s2)\n",
"2 ## 7. N processors to use in parallel (all)\n",
"6 ## 8. Mindepth: min coverage for a cluster (s4,s5)\n",
"4 ## 9. NQual: max # sites with qual < 20 (line 20) (s2)\n",
".88 ## 10. Wclust: clustering threshold as a decimal (s3,s6)\n",
"rad ## 11. Datatype: rad,gbs,ddrad,pairgbs,pairddrad,merge (all)\n",
"4 ## 12. MinCov: min samples in a final locus (s7)\n",
"3 ## 13. MaxSH: max inds with shared hetero site (s7)\n",
"c88d6m4p3 ## 14. prefix name for final output (no spaces) (s7)\n",
"==== optional params below this line =================================== affected step ==\n",
" ## 15.opt.: select subset (prefix* selector) (s2-s7)\n",
" ## 16.opt.: add-on (outgroup) taxa (list or prefix*) (s6,s7)\n",
" ## 17.opt.: exclude taxa (list or prefix*) (s7)\n",
" ## 18.opt.: Loc. of de-multiplexed data (s2)\n",
" ## 19.opt.: maxM: N mismatches in barcodes (def. 1) (s1)\n",
" ## 20.opt.: Phred Qscore offset (def. 33) (s2)\n",
" ## 21.opt.: Filter: 0=NQual 1=NQual+adapters. 2=1+strict (s2)\n",
" ## 22.opt.: a priori E,H (def. 0.001,0.01, if not estimated) (s5)\n",
" ## 23.opt.: maxN: Ns in a consensus seq (def. 5) (s5)\n",
" ## 24.opt.: maxH: hetero. sites in consensus seq (def. 5) (s5)\n",
" ## 25.opt.: ploidy: max alleles in consens (def. 2) see doc (s5)\n",
" ## 26.opt.: maxSNPs: step 7. (def=100). Paired (def=100,100) (s7)\n",
" ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)\n",
" ## 28.opt.: random number seed (def. 112233) (s3,s6,s7)\n",
" ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7)\n",
" ## 30.opt.: add output formats: a,n,s,u (see documentation) (s7)\n",
" ## 31.opt.: call maj. consens if dpth < stat. limit (def. 0) (s5)\n",
" ## 32.opt.: merge/remove paired overlap (def 0), 1=check (s2)\n",
" ## 33.opt.: keep trimmed reads (def=0). Enter min length. (s2)\n",
" ## 34.opt.: max stack size (int), def= max(500,mean+2*SD) (s3)\n",
" ## 35.opt.: minDerep: exclude dereps with <= N copies, def=0 (s3)\n",
" ## 36.opt.: hierarch. cluster groups (def.=0, 1=yes) (s6)\n",
"==== list hierachical cluster groups below this line =====================================\n"
]
}
],
"prompt_number": 3
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"To change parameters you can edit params.txt in any text editor. Here to automate things I use the script below."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"sed -i '/## 7. /c\\8 ## 7. N processors... ' params.txt\n",
"sed -i '/## 10. /c\\.85 ## 10. lowered clust thresh... ' params.txt\n",
"sed -i '/## 14. /c\\c85m4p3 ## 14. outprefix... ' params.txt\n",
"sed -i '/## 24./c\\8 ## 24. maxH raised ... ' params.txt\n",
"sed -i '/## 30./c\\a,s,n,u,k ## 30. more output formats... ' params.txt"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 4
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Let's have a look at the changes:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"cat params.txt"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"==== parameter inputs for pyRAD version 2.16 ============================ affected step ==\n",
"./ ## 1. Working directory (all)\n",
"./*.fastq.gz ## 2. Loc. of non-demultiplexed files (if not line 18) (s1)\n",
"./*.barcodes ## 3. Loc. of barcode file (if not line 18) (s1)\n",
"usearch7.0.1090_i86linux32 ## 4. command (or path) to call usearch v.7 (s3,s6)\n",
"muscle ## 5. command (or path) to call muscle (s3,s7)\n",
"TGCAG ## 6. restriction overhang (e.g., C|TGCAG -> TGCAG) (s1,s2)\n",
"8 ## 7. N processors... \n",
"6 ## 8. Mindepth: min coverage for a cluster (s4,s5)\n",
"4 ## 9. NQual: max # sites with qual < 20 (line 20) (s2)\n",
".85 ## 10. lowered clust thresh... \n",
"rad ## 11. Datatype: rad,gbs,ddrad,pairgbs,pairddrad,merge (all)\n",
"4 ## 12. MinCov: min samples in a final locus (s7)\n",
"3 ## 13. MaxSH: max inds with shared hetero site (s7)\n",
"c85m4p3 ## 14. outprefix... \n",
"==== optional params below this line =================================== affected step ==\n",
" ## 15.opt.: select subset (prefix* selector) (s2-s7)\n",
" ## 16.opt.: add-on (outgroup) taxa (list or prefix*) (s6,s7)\n",
" ## 17.opt.: exclude taxa (list or prefix*) (s7)\n",
" ## 18.opt.: Loc. of de-multiplexed data (s2)\n",
" ## 19.opt.: maxM: N mismatches in barcodes (def. 1) (s1)\n",
" ## 20.opt.: Phred Qscore offset (def. 33) (s2)\n",
" ## 21.opt.: Filter: 0=NQual 1=NQual+adapters. 2=1+strict (s2)\n",
" ## 22.opt.: a priori E,H (def. 0.001,0.01, if not estimated) (s5)\n",
" ## 23.opt.: maxN: Ns in a consensus seq (def. 5) (s5)\n",
"8 ## 24. maxH raised ... \n",
" ## 25.opt.: ploidy: max alleles in consens (def. 2) see doc (s5)\n",
" ## 26.opt.: maxSNPs: step 7. (def=100). Paired (def=100,100) (s7)\n",
" ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)\n",
" ## 28.opt.: random number seed (def. 112233) (s3,s6,s7)\n",
" ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7)\n",
"a,s,n,u,k ## 30. more output formats... \n",
" ## 31.opt.: call maj. consens if dpth < stat. limit (def. 0) (s5)\n",
" ## 32.opt.: merge/remove paired overlap (def 0), 1=check (s2)\n",
" ## 33.opt.: keep trimmed reads (def=0). Enter min length. (s2)\n",
" ## 34.opt.: max stack size (int), def= max(500,mean+2*SD) (s3)\n",
" ## 35.opt.: minDerep: exclude dereps with <= N copies, def=0 (s3)\n",
" ## 36.opt.: hierarch. cluster groups (def.=0, 1=yes) (s6)\n",
"==== list hierachical cluster groups below this line =====================================\n"
]
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"-------------- \n",
"\n",
"__Let's take a look at what the raw data look like.__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Your input data will be in fastQ format, usually ending in .fq or .fastq. Your data could be split among multiple files, or all within a single file (de-multiplexing goes much faster if they happen to be split into multiple files). The file/s may be compressed with gzip so that they have a .gz ending, but they do not need to be. The location of these files should be entered on line 2 of the params file. Below are the first three reads in the example file."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"less simRADs_R1.fastq.gz | head -n 12 | cut -c 1-90"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"@lane1_fakedata0_R1_0 1:N:0:\n",
"TTTTAATGCAGTGAGTGGCCATGCAATATATATTTACGGGCGCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGCTTAG\n",
"+\n",
"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\n",
"@lane1_fakedata0_R1_1 1:N:0:\n",
"TTTTAATGCAGTGAGTGGCCATGCAATATATATTTACGGGCGCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGCTTAG\n",
"+\n",
"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\n",
"@lane1_fakedata0_R1_2 1:N:0:\n",
"TTTTAATGCAGTGAGTGGCCATGCAATATATATTTACGGGCGCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGCTTAG\n",
"+\n",
"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\n"
]
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"------------ \n",
"\n",
"Each read takes four lines. The first is the name of the read (its location on the plate). The second line contains the sequence data. The third line is a spacer. And the fourth line the quality scores for the base calls. In this case arbitrarily high since the data were simulated. \n",
"\n",
"These are 100 bp single-end reads prepared as RADseq. The first six bases form the barcode and the next five bases (TGCAG) the restriction site overhang. All following bases make up the sequence data. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---------------- \n",
"\n",
"## Step 1: de-multiplexing ##"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This step uses information in the barcodes file to sort data into a separate file for each sample. Below is the barcodes file, with sample names and their barcodes each on a separate line with a tab between them."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"cat simRADs.barcodes"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1A0\tCATCAT\n",
"1B0\tTTTTAA\n",
"1C0\tAGGGGA\n",
"1D0\tTAAGGT\n",
"2E0\tTTTATA\n",
"2F0\tGAGTAT\n",
"2G0\tATAGAG\n",
"2H0\tATGAGG\n",
"3I0\tGGGTTT\n",
"3J0\tTTAAAA\n",
"3K0\tGGATTG\n",
"3L0\tAAGAAG\n"
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Step 1 writes the de-multiplexed data to a new file for each sample in a new directory created within the working directory called fastq/."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"./pyrad -p params.txt -s 1"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stderr",
"text": [
"\n",
"\n",
" ------------------------------------------------------------\n",
" pyRAD : RADseq for phylogenetics & introgression analyses\n",
" ------------------------------------------------------------\n",
"\n",
"\n",
"\tstep 1: sorting reads by barcode\n",
" ."
]
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can see that this created a new file for each sample in the directory 'fastq/'"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"ls fastq/"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1A0_R1.fq.gz\n",
"1B0_R1.fq.gz\n",
"1C0_R1.fq.gz\n",
"1D0_R1.fq.gz\n",
"2E0_R1.fq.gz\n",
"2F0_R1.fq.gz\n",
"2G0_R1.fq.gz\n",
"2H0_R1.fq.gz\n",
"3I0_R1.fq.gz\n",
"3J0_R1.fq.gz\n",
"3K0_R1.fq.gz\n",
"3L0_R1.fq.gz\n"
]
}
],
"prompt_number": 9
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"The statistics for step 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A new directory called stats will also have been created. Each step of the _pyRAD_ analysis will create a new stats output file in this directory. The stats output for step 1 is below:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"cat stats/s1.sorting.txt"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"file \tNreads\tcut_found\tbar_matched\n",
"simRADs_R1.fastq.gz\t480000\t480000\t480000\n",
"\n",
"\n",
"sample\ttrue_bar\tobs_bars\tN_obs\n",
"3L0 \tAAGAAG \tAAGAAG\t40000 \n",
"1C0 \tAGGGGA \tAGGGGA\t40000 \n",
"2G0 \tATAGAG \tATAGAG\t40000 \n",
"2H0 \tATGAGG \tATGAGG\t40000 \n",
"1A0 \tCATCAT \tCATCAT\t40000 \n",
"2F0 \tGAGTAT \tGAGTAT\t40000 \n",
"3K0 \tGGATTG \tGGATTG\t40000 \n",
"3I0 \tGGGTTT \tGGGTTT\t40000 \n",
"1D0 \tTAAGGT \tTAAGGT\t40000 \n",
"3J0 \tTTAAAA \tTTAAAA\t40000 \n",
"2E0 \tTTTATA \tTTTATA\t40000 \n",
"1B0 \tTTTTAA \tTTTTAA\t40000 \n",
"\n",
"nomatch \t_ \t0\n"
]
}
],
"prompt_number": 10
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Step 2: quality filtering"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This step filters reads based on quality scores, and can be used to detect Illumina adapters in your reads, which is sometimes a problem with homebrew type library preparations. Here the filter is set to the default value of 0, meaning it filters only based on quality scores of base calls. The filtered files are written to a new directory called edits/."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"./pyrad -p params.txt -s 2"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stderr",
"text": [
"\n",
"\n",
" ------------------------------------------------------------\n",
" pyRAD : RADseq for phylogenetics & introgression analyses\n",
" ------------------------------------------------------------\n",
"\n",
"\tstep 2: editing raw reads \n",
"\t............"
]
}
],
"prompt_number": 11
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"ls edits/"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1A0.edit\n",
"1B0.edit\n",
"1C0.edit\n",
"1D0.edit\n",
"2E0.edit\n",
"2F0.edit\n",
"2G0.edit\n",
"2H0.edit\n",
"3I0.edit\n",
"3J0.edit\n",
"3K0.edit\n",
"3L0.edit\n"
]
}
],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The filtered data are written in fasta format (quality scores removed) into a new directory called edits/. Below I show a preview of the file which you can view most easily using the `less` command."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"head -n 10 edits/1A0.edit | cut -c 1-80"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
">1A0_0_r1\n",
"TGCAGTGAGTGGCCATGCAATATATATTTACGGGCTCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGC\n",
">1A0_1_r1\n",
"TGCAGTGAGTGGCCATGCAATATATATTTACGGGCTCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGC\n",
">1A0_2_r1\n",
"TGCAGTGAGTGGCCATGCAATATATATTTACGGGCTCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGC\n",
">1A0_3_r1\n",
"TGCAGTGAGTGGCCATGCAATATATATTTACGGGCTCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGC\n",
">1A0_4_r1\n",
"TGCAGTGAGTGGCCATGCAATATATATTTACGGGCTCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGC\n"
]
}
],
"prompt_number": 13
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Step 3: clustering within-samples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Step 3 de-replicates and then clusters reads within each sample by the set clustering threshold and writes the clusters to new files in a directory called clust.xx"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"./pyrad -p params.txt -s 3"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stderr",
"text": [
"\n",
"\n",
" ------------------------------------------------------------\n",
" pyRAD : RADseq for phylogenetics & introgression analyses\n",
" ------------------------------------------------------------\n",
"\n",
"\n",
"\tde-replicating files for clustering...\n",
"\n",
"\tstep 3: within-sample clustering of 12 samples at \n",
"\t '.85' similarity using up to 8 processors\n",
"\t2E0.edit finished, 2000loci\n",
"\t1A0.edit finished, 2000loci\n",
"\t1C0.edit finished, 2000loci\n",
"\t3I0.edit finished, 2000loci\n",
"\t2H0.edit finished, 2000loci\n",
"\t1B0.edit finished, 2000loci\n",
"\t3L0.edit finished, 2000loci\n",
"\t3J0.edit finished, 2000loci\n",
"\t2F0.edit finished, 2000loci\n",
"\t1D0.edit finished, 2000loci\n",
"\t2G0.edit finished, 2000loci\n",
"\t3K0.edit finished, 2000loci\n"
]
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once again, I recommend you use the unix command 'less' to look at the clustS files. These contain each cluster separated by \"//\". For the first few clusters below you can see that there is one or two alleles in the cluster and one or a few reads that contained a (simulated) sequencing error. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"less clust.85/1A0.clustS.gz | head -n 26 | cut -c 1-80"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
">1A0_2540_r1;size=17;\n",
"TGCAGTGTAACGTTGTATCCATCGAGTCGATCATAGCCTAAAATAAGTAACACTAATCAGGCGCGCTGGTTGGGGGATCA\n",
">1A0_2549_r1;size=1;+\n",
"TGCAGTGTAACGTTGTATCCATCGAGTCGATCATAGCCTAAAATAAGTAACGCTAATCAGGCGCGCTGGTTGGGGGATCA\n",
">1A0_2541_r1;size=1;+\n",
"TGCAGTGTAACGTTGTATCCAACGAGTCGATCATAGCCTAAAATAAGTAACACTAATCAGGCGCGCTGGTTGGGGGATCA\n",
">1A0_2551_r1;size=1;+\n",
"TGCAGTGTAACGTTGTATCCATCGAGTCGATCATAGCCTAAAATAAGTAACACTAATCAGGCGCGTTGGTTGGGGGATCA\n",
"//\n",
"//\n",
">1A0_2140_r1;size=19;\n",
"TGCAGCTCCGTCACTGCTCAGCGAACCTACTATCTAGTCGGAAAAGGTTCCGGCCCTTATGCTAAGTGCAAGCTGCCAGT\n",
">1A0_2155_r1;size=1;+\n",
"TGCAGCTCCCTCACTGCTCAGCGAACCTACTATCTAGTCGGAAAAGGTTCCGGCCCTTATGCTAAGTGCAAGCTGCCAGT\n",
"//\n",
"//\n",
">1A0_8280_r1;size=10;\n",
"TGCAGCGTATATGATCAGAACCGGGTGAGTGGGTACCGCGAACCGAAAGGCATCGAAAGTTTAGCGCAGCACTAATCTCA\n",
">1A0_8290_r1;size=8;+\n",
"TGCAGCGTATATGATCAGAACCGGGTGAGTGGGTACCGCGAACCGAAAGGCACCGAAAGTTTAGCGCAGCACTAATCTCA\n",
">1A0_8297_r1;size=1;+\n",
"TGCAGCGTATATGATCAGAACCGGGTGAGTGGGAACCGCGAACCGAAAGGCACCGAAAGTTTAGCGCAGCACTAATCTCA\n",
">1A0_8292_r1;size=1;+\n",
"TGCAGCCTATATGATCAGAACCGGGTGAGTGGGTACCGCGAACCGAAAGGCACCGAAAGTTTAGCGCAGCACTAATCTCA\n",
"//\n",
"//\n"
]
}
],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---------------\n",
"\n",
"\n",
"The stats output tells you how many clusters were found, and their mean depth of coverage. It also tells you how many pass your minimum depth setting. You can use this information to decide if you wish to increase or decrease the mindepth before it is applied for making consensus base calls in steps 4 & 5."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"cat stats/s3.clusters.txt"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"taxa\ttotal\tdpt.me\tdpt.sd\td>5.tot\td>5.me\td>5.sd\tbadpairs\n",
"1A0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"1B0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"1C0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"1D0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"2E0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"2F0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"2G0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"2H0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"3I0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"3J0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"3K0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"3L0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"\n",
" ## total = total number of clusters, including singletons\n",
" ## dpt.me = mean depth of clusters\n",
" ## dpt.sd = standard deviation of cluster depth\n",
" ## >N.tot = number of clusters with depth greater than N\n",
" ## >N.me = mean depth of clusters with depth greater than N\n",
" ## >N.sd = standard deviation of cluster depth for clusters with depth greater than N\n",
" ## badpairs = mismatched 1st & 2nd reads (only for paired ddRAD data)\n",
"\n",
"HISTOGRAMS\n",
"\n",
" \n",
"sample: 1A0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n",
"sample: 1B0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n",
"sample: 1C0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n",
"sample: 1D0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n",
"sample: 2E0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n",
"sample: 2F0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n",
"sample: 2G0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n",
"sample: 2H0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n",
"sample: 3I0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n",
"sample: 3J0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n",
"sample: 3K0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n",
"sample: 3L0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n"
]
}
],
"prompt_number": 16
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Steps 4 & 5: Call consensus sequences"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Step 4 jointly infers the error-rate and heterozygosity across samples."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"./pyrad -p params.txt -s 4"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stderr",
"text": [
"\n",
"\n",
" ------------------------------------------------------------\n",
" pyRAD : RADseq for phylogenetics & introgression analyses\n",
" ------------------------------------------------------------\n",
"\n",
"\n",
"\tstep 4: estimating error rate and heterozygosity\n",
"\t............"
]
}
],
"prompt_number": 17
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"less stats/Pi_E_estimate.txt"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"taxa\tH\tE\n",
"3K0\t0.00135982\t0.00048078\t\n",
"1C0\t0.00134858\t0.00048372\t\n",
"1D0\t0.00135375\t0.00048822\t\n",
"3I0\t0.00129751\t0.00048694\t\n",
"2H0\t0.00133223\t0.00049211\t\n",
"2F0\t0.00135365\t0.0004995\t\n",
"1A0\t0.00136043\t0.00051028\t\n",
"2E0\t0.00126915\t0.00051556\t\n",
"1B0\t0.00149924\t0.00049663\t\n",
"3J0\t0.00144422\t0.0005089\t\n",
"2G0\t0.00138185\t0.00051206\t\n",
"3L0\t0.00143349\t0.00051991\t\n"
]
}
],
"prompt_number": 18
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Step 5 calls consensus sequences using the parameters inferred above, and filters for paralogs."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"./pyrad -p params.txt -s 5"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stderr",
"text": [
"\n",
"\n",
" ------------------------------------------------------------\n",
" pyRAD : RADseq for phylogenetics & introgression analyses\n",
" ------------------------------------------------------------\n",
"\n",
"\n",
"\tstep 5: creating consensus seqs for 12 samples, using H=0.00137 E=0.00050\n",
"\t............"
]
}
],
"prompt_number": 19
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"The stats output for step 5"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"less stats/s5.consens.txt"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"taxon \tnloci\tf1loci\tf2loci\tnsites\tnpoly\tpoly\n",
"3L0.clustS.gz \t2000\t2000\t2000\t178003\t255\t0.0014326\n",
"2F0.clustS.gz \t2000\t2000\t2000\t178002\t241\t0.0013539\n",
"1B0.clustS.gz \t2000\t2000\t2000\t178005\t267\t0.0015\n",
"3J0.clustS.gz \t2000\t2000\t2000\t178003\t257\t0.0014438\n",
"1A0.clustS.gz \t2000\t2000\t2000\t178002\t242\t0.0013595\n",
"2H0.clustS.gz \t2000\t2000\t2000\t178001\t237\t0.0013315\n",
"2E0.clustS.gz \t2000\t2000\t2000\t178002\t226\t0.0012696\n",
"2G0.clustS.gz \t2000\t2000\t2000\t178001\t246\t0.001382\n",
"1D0.clustS.gz \t2000\t2000\t2000\t178002\t241\t0.0013539\n",
"3I0.clustS.gz \t2000\t2000\t2000\t178003\t231\t0.0012977\n",
"1C0.clustS.gz \t2000\t2000\t2000\t178002\t240\t0.0013483\n",
"3K0.clustS.gz \t2000\t2000\t2000\t178001\t242\t0.0013595\n",
"\n",
" ## nloci = number of loci\n",
" ## f1loci = number of loci with >N depth coverage\n",
" ## f2loci = number of loci with >N depth and passed paralog filter\n",
" ## nsites = number of sites across f loci\n",
" ## npoly = number of polymorphic sites in nsites\n",
" ## poly = frequency of polymorphic sites\n"
]
}
],
"prompt_number": 20
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Step 6: Cluster across samples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Step 6 clusters consensus sequences across samples. This step can take a long time for very large data sets (>100 individuals). I suggest trying it first. It will print its progress and if it looks to be taking way too long then you can implement the hierarchical clustering method instead, described in detail in a separate tutorial)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"./pyrad -p params.txt -s 6 "
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"usearch v7.0.1090_i86linux32, 4.0Gb RAM (65.8Gb total), 40 cores\n",
"(C) Copyright 2013 Robert C. Edgar, all rights reserved.\n",
"http://drive5.com/usearch\n",
"\n",
"Licensed to: daeaton.chicago@gmail.com\n",
"\n",
"\n",
"\tfinished clustering\n"
]
},
{
"output_type": "stream",
"stream": "stderr",
"text": [
"\n",
"\n",
" ------------------------------------------------------------\n",
" pyRAD : RADseq for phylogenetics & introgression analyses\n",
" ------------------------------------------------------------\n",
"\n",
"\n",
"\tstep 6: clustering across 12 samples at '.85' similarity \n",
"\n",
"00:00 21Mb 0.1% 0 clusters, max size 0, avg 0.0\r",
"00:01 26Mb 1.0% 192 clusters, max size 2, avg 1.1\r",
"00:02 28Mb 3.2% 624 clusters, max size 4, avg 1.2\r",
"00:03 28Mb 4.9% 881 clusters, max size 4, avg 1.3\r",
"00:04 29Mb 6.5% 1085 clusters, max size 5, avg 1.4\r",
"00:05 29Mb 8.2% 1266 clusters, max size 5, avg 1.5\r",
"00:06 29Mb 10.1% 1435 clusters, max size 5, avg 1.7\r",
"00:07 29Mb 12.5% 1591 clusters, max size 7, avg 1.9\r",
"00:08 29Mb 15.5% 1733 clusters, max size 7, avg 2.1\r",
"00:09 30Mb 19.9% 1865 clusters, max size 8, avg 2.5\r",
"00:10 30Mb 29.4% 1972 clusters, max size 10, avg 3.6\r",
"00:11 30Mb 75.2% 2000 clusters, max size 12, avg 9.0\r",
"00:11 30Mb 100.0% 2000 clusters, max size 12, avg 12.0\r\n",
" \n",
" Seqs 24000 (24.0k)\n",
" Clusters 2000\n",
" Max size 12\n",
" Avg size 12.0\n",
" Min size 12\n",
"Singletons 0, 0.0% of seqs, 0.0% of clusters\n",
" Max mem 30Mb\n",
" Time 11.0s\n",
"Throughput 2181.8 seqs/sec.\n",
"\n"
]
}
],
"prompt_number": 21
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Step 7: Assemble final data sets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The final step is to output data only for the loci that you want to have included in your data set. This filters once again for potential paralogs or highly repetitive regions, and includes options to minimize the amount of missing data in the output. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"./pyrad -p params.txt -s 7"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\tingroup 1A0,1B0,1C0,1D0,2E0,2F0,2G0,2H0,3I0,3J0,3K0,3L0\n",
"\taddon \n",
"\texclude \n",
"\t\n",
"\tfinal stats written to:\n",
"\t /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_RAD/stats/c85m4p3.stats\n",
"\toutput files written to:\n",
"\t /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_RAD/outfiles/ directory\n",
"\n"
]
},
{
"output_type": "stream",
"stream": "stderr",
"text": [
"\n",
"\n",
" ------------------------------------------------------------\n",
" pyRAD : RADseq for phylogenetics & introgression analyses\n",
" ------------------------------------------------------------\n",
"\n",
"........"
]
}
],
"prompt_number": 51
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Final stats output"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"less stats/c85m4p3.stats"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"\n",
"2000 ## loci with > minsp containing data\n",
"2000 ## loci with > minsp containing data & paralogs removed\n",
"2000 ## loci with > minsp containing data & paralogs removed & final filtering\n",
"\n",
"## number of loci recovered in final data set for each taxon.\n",
"taxon\tnloci\n",
"1A0\t2000\n",
"1B0\t2000\n",
"1C0\t2000\n",
"1D0\t2000\n",
"2E0\t2000\n",
"2F0\t2000\n",
"2G0\t2000\n",
"2H0\t2000\n",
"3I0\t2000\n",
"3J0\t2000\n",
"3K0\t2000\n",
"3L0\t2000\n",
"\n",
"\n",
"## nloci = number of loci with data for exactly ntaxa\n",
"## ntotal = number of loci for which at least ntaxa have data\n",
"ntaxa\tnloci\tsaved\tntotal\n",
"1\t-\n",
"2\t-\t\t-\n",
"3\t-\t\t-\n",
"4\t0\t*\t2000\n",
"5\t0\t*\t2000\n",
"6\t0\t*\t2000\n",
"7\t0\t*\t2000\n",
"8\t0\t*\t2000\n",
"9\t0\t*\t2000\n",
"10\t0\t*\t2000\n",
"11\t0\t*\t2000\n",
"12\t2000\t*\t2000\n",
"\n",
"\n",
"## var = number of loci containing n variable sites.\n",
"## pis = number of loci containing n parsimony informative var sites.\n",
"n\tvar\tPIS\n",
"0\t145\t551\n",
"1\t1083\t699\n",
"2\t945\t475\n",
"3\t637\t187\n",
"4\t367\t69\n",
"5\t182\t13\n",
"6\t59\t3\n",
"7\t18\t2\n",
"8\t12\t1\n",
"9\t1\t0\n",
"total var= 7847\n",
"total pis= 2591\n"
]
}
],
"prompt_number": 52
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"--------------- \n",
"\n",
"## Output formats ##"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We created 8 output files from our analysis. The standard four (.loci, .phy, .excluded_loci, and .unlinked_snps), as well as the four additional formats we requested in the params file (.snps, .alleles, .str and .nex). These are all fully explained the general tutorial."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"ls outfiles"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"c85m4p3.alleles\n",
"c85m4p3.excluded_loci\n",
"c85m4p3.loci\n",
"c85m4p3.nex\n",
"c85m4p3.phy\n",
"c85m4p3.snps\n",
"c85m4p3.str\n",
"c85m4p3.unlinked_snps\n"
]
}
],
"prompt_number": 35
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loci format \n",
"The \".loci\" file contains each locus listed in a fasta-like format that also shows which sites are variable below each locus. Autapomorphies are listed as '-' and shared SNPs as '*'. This is a custom format that is human readable and also used as input to perform D-statistic tests in pyRAD. This is the easiest way to visualize your results. I recommend viewing the file with the command `less`. Below I use a head and cut to make it easy to view in this window."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"head -n 39 outfiles/c85m4p3.loci | cut -c 1-75"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
">1A0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCA\n",
">1B0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCA\n",
">1C0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCA\n",
">1D0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAMTGTTGGCGAGTCTCATCGCGAGGCA\n",
">2E0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCA\n",
">2F0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCA\n",
">2G0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCA\n",
">2H0 AGAAGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTMAATGTTGGCGAGTCTCATCGCGAGGCA\n",
">3I0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCA\n",
">3J0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCA\n",
">3K0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCA\n",
">3L0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCA\n",
"// - - - \n",
">1A0 CGGTAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCA\n",
">1B0 CGGTAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCA\n",
">1C0 CGGTAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCA\n",
">1D0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTMGTCA\n",
">2E0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCA\n",
">2F0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCA\n",
">2G0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCA\n",
">2H0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCA\n",
">3I0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCA\n",
">3J0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCA\n",
">3K0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCA\n",
">3L0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCA\n",
"// * - \n",
">1A0 TCCGATAGCCAGGTCTCGAGGTCGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCG\n",
">1B0 TCCGATAGCCAGGTCTCGAGGTCGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCG\n",
">1C0 TCCGATAGCCAGGTCTCGAGGTCGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCG\n",
">1D0 TCCGATAGCCAGGTCTCGAGGTCGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCG\n",
">2E0 TCCGATAGCCAGGTCTCGAGGTCGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCG\n",
">2F0 TCCGATAGCCAGGTCTCGAGGTCGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCG\n",
">2G0 TCCGATAGCCAGGTCTCGAGGTCGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCG\n",
">2H0 TCCGATAGCCAGGTCTCGAGGTCGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCG\n",
">3I0 TCCGATAGCCAGGACTCGAGGTCGACTACCGGCGTGATGTCGGGTTCACCCCCCGAGCATCGGTGCG\n",
">3J0 TCCGATAGCCAGGACTCGAGGTCGACTACCGGCGTGATGTCGGGTTCACCCCCCGGGCATCGGTGCG\n",
">3K0 TCCGATAGCCAGGACTCGAGGTCGACTACCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCG\n",
">3L0 TCCGATAGCCAGGACTCGAGGTCGACTACCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCG\n",
"// * * * - \n"
]
}
],
"prompt_number": 36
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### PHY format"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"head -n 50 outfiles/c85m4p3.phy | cut -c 1-85"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"12 178083\n",
"1A0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCGAG\n",
"1B0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCGAG\n",
"1C0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTMCGAG\n",
"1D0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAMTGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCGAG\n",
"2E0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCGAG\n",
"2F0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCGAG\n",
"2G0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCGAG\n",
"2H0 AGAAGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTMAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCGAG\n",
"3I0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCGAG\n",
"3J0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCGAG\n",
"3K0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCGAG\n",
"3L0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCGAG\n"
]
}
],
"prompt_number": 37
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### NEX format"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"head -n 50 outfiles/c85m4p3.nex | cut -c 1-85"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"#NEXUS\n",
"BEGIN DATA;\n",
" DIMENSIONS NTAX=12 NCHAR=178083;\n",
" FORMAT DATATYPE=DNA MISSING=N GAP=- INTERLEAVE=YES;\n",
" MATRIX\n",
" 1B0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
" 2G0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
" 2F0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
" 1A0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
" 2H0 AGAAGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTMAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
" 2E0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
" 3I0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
" 1C0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTMCG\n",
" 3L0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
" 1D0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAMTGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
" 3J0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
" 3K0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
"\n",
" 1B0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACATCAAGGGTACC\n",
" 2G0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACAACAAGGGTACC\n",
" 2F0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACAACAAGGGTACC\n",
" 1A0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACATCAAGGGTACC\n",
" 2H0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACAACAAGGGTACC\n",
" 2E0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACARCAAGGGTACC\n",
" 3I0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACAACAAGGGTACC\n",
" 1C0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACATCAAGGGTACC\n",
" 3L0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACAACAAGGGTACC\n",
" 1D0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTMGTCAATGTTCCACATCAAGGGTACC\n",
" 3J0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACAACAAGGGTACC\n",
" 3K0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACAACAAGGGTACC\n",
"\n",
" 1B0 CGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATATGGATACGCCGAGAGGAAGAGCTGA\n",
" 2G0 CGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATATGGATACGCCGAGAGGAAGAGCTGA\n",
" 2F0 CGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATATGGATACGCCGAGAGGAAGAGCTGA\n",
" 1A0 CGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATATGGATACGCCGAGAGGAAGAGCTGA\n",
" 2H0 CGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATATGGATACGCCGAGAGGAAGAGCTGA\n",
" 2E0 CGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATATGGATACGCCGAGAGGWAGAGCTGA\n",
" 3I0 CGACTACCGGCGTGATGTCGGGTTCACCCCCCGAGCATCGGTGCGAAGGATATGGATACGCCGAGAGGAAGAGCTGG\n",
" 1C0 CGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATATGGATACGCCGAGAGGAAGAGCTGA\n",
" 3L0 CGACTACCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATATGGATACGCCGAGAGGAAGAGCTGA\n",
" 1D0 CGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATATGGATACGCCGAGAGGAAGAGCTGA\n",
" 3J0 CGACTACCGGCGTGATGTCGGGTTCACCCCCCGGGCATCGGTGCGAAGGATATGGATACGCCGAGAGGAAGAGCTGG\n",
" 3K0 CGACTACCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATATGGATACGCCGAGAGGAAGAGCTGR\n",
"\n",
" 1B0 CCTGCGTGTGTTCCTCGCACAGCTACAATTCTACTTGTAGTTTGGAGATCAGCGCCTTTGTGTTAACCGCCCTTTGC\n",
" 2G0 CCTGCGTGTGTTCCTCGCACAGCTACAATTCTACTTGTAGTTTGGAGATCAGCGCCTTTGTGTTAACCGCCCTTTGC\n",
" 2F0 CCTGCGYGTGTTCCTCGCACAGCTACAATTCTACTTGTAGTTTGGAGATCAGCGCCTTTGTGTTAACCGCCCTTTGC\n",
" 1A0 CCTGCGTGTGTTCCTCGCACAGCTACAATTCTACTTGTAGTTTGGAGATCAGCGCCTTTGTGTTAACCGCCCTTTGC\n",
" 2H0 CCTGCGTGTGTTCCTCGCACAGCTACAATTCTACTTGTAGTTTGGAGATCAGCGCCTTTGTGTTAACCGCCCTTTGC\n",
" 2E0 CCTGCGTGTGTTCCTCGCACAGCTACAATTCTACTTGTAGTTTGGAGATCAGCGCCTTTGTGTTAACCGCCCTTTGC\n"
]
}
],
"prompt_number": 38
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Alleles format"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"head -n 50 outfiles/c85m4p3.alleles| cut -c 1-85"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
">1A0_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">1A0_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">1B0_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">1B0_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">1C0_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">1C0_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTACG\n",
">1D0_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCACTGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">1D0_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">2E0_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">2E0_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">2F0_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">2F0_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">2G0_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">2G0_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">2H0_0 AGAAGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">2H0_1 AGAAGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTAAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">3I0_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">3I0_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">3J0_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">3J0_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">3K0_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">3K0_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">3L0_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">3L0_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
"// - - - - \n",
">1A0_0 CGGTAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">1A0_1 CGGTAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">1B0_0 CGGTAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">1B0_1 CGGTAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">1C0_0 CGGTAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">1C0_1 CGGTAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">1D0_0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTCGTCAATGTTCCACA\n",
">1D0_1 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">2E0_0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">2E0_1 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">2F0_0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">2F0_1 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">2G0_0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">2G0_1 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">2H0_0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">2H0_1 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">3I0_0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">3I0_1 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">3J0_0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">3J0_1 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">3K0_0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">3K0_1 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">3L0_0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">3L0_1 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
"// * - \n"
]
}
],
"prompt_number": 39
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### STRUCTURE (.str) format"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"head -n 50 outfiles/c85m4p3.str | cut -c 1-20"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1A0 \t\t\t\t\t\t\t3\t1\t2\t3\n",
"1A0 \t\t\t\t\t\t\t3\t1\t2\t3\n",
"1B0 \t\t\t\t\t\t\t3\t1\t2\t3\n",
"1B0 \t\t\t\t\t\t\t3\t1\t2\t3\n",
"1C0 \t\t\t\t\t\t\t3\t1\t2\t3\n",
"1C0 \t\t\t\t\t\t\t3\t1\t2\t3\n",
"1D0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"1D0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"2E0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"2E0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"2F0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"2F0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"2G0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"2G0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"2H0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"2H0 \t\t\t\t\t\t\t0\t0\t2\t3\n",
"3I0 \t\t\t\t\t\t\t3\t0\t0\t3\n",
"3I0 \t\t\t\t\t\t\t3\t0\t0\t3\n",
"3J0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"3J0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"3K0 \t\t\t\t\t\t\t3\t0\t2\t2\n",
"3K0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"3L0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"3L0 \t\t\t\t\t\t\t3\t0\t2\t3\n"
]
}
],
"prompt_number": 40
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### SNPs format"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"head -n 50 outfiles/c85m4p3.snps | cut -c 1-85"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"## 12 taxa, 2000 loci, 7888 snps\n",
"1A0 CCACT TAT TTAG AATC TAYC ACGTACT AAGG TCCTTG AAA GGCTT CTGTA GCAACT G A GATGC T\n",
"1B0 CCACT TAT TTAG AATC TATC ACGTACT AAGG TCCTTG AAA GGCTT CTGTA GCAACT G A GATGC T\n",
"1C0 CCAMT TAT TTAG AATC TATC ACGTTCT AAGG TCCTTG AAA GGCTT YTATA AAACCT G A GATGC T\n",
"1D0 CCMCT AMT TTAG AATC TATT ACGTTCA AAAG TCCTTG AAA GGCTT CTGTT GCACCT C A GCTGC T\n",
"2E0 CCACT AAR TTAG WATC TATC ACKTTAT AAGG TCCTKG TAG TGCTT CAGTA GCTCCT C A GATGY T\n",
"2F0 CCACT AAA TTAG AAYC TATC ACGTTAT AAGG TCCTGG TAG GGCAT CAGTA GCACCT C A GATGC T\n",
"2G0 CCACT AAA TTAG AATC TATC ACGTTAT AWGG TTCTTA TAG TGCTT CAGTA GCACCT C A GATGC T\n",
"2H0 AMACT AAA TTAG AATC TATC ACGTTCT AAGR TCTTTG TAA GGCTK CAGCA GCACCY C T GATGC T\n",
"3I0 CCACG AAA AACA AGTC TATC AGGTTCT TAGG TCCTTG AAA GGCTT CTGTA GCACCT C A AAKAC T\n",
"3J0 CCACG AAA AACG AGTC TATC AGGKTCT TAGG TCCTTG AAA GGCTT CTGTA GCACCT C A AATAC T\n",
"3K0 CCACT AAA AAAG ARTS TWTC WGGTTCT TAGG WCCTTG AAA GGCTT CTGTA GCACCT C A GATAC T\n",
"3L0 CCACG AAA AAAG AATC CATC AGGTTCT TAGG TCCATG AMA GAMTT CTGTA GCACTT C A GATAC C\n"
]
}
],
"prompt_number": 53
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### UNLINKED_SNPs format"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"head -n 50 outfiles/c85m4p3.unlinked_snps | cut -c 1-85"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"12 1959\n",
"1A0 CTGCYAGTATGTGAGARKGACARGAGGTGACTCGCTSACAGAACTCTTTCGCGCGGCCTACGGCAGTAGACAAATTTCA\n",
"1B0 CTGCTAGTATGTGAGAGTGACAGGAGGTGACTCGCTGACTGAACTCTTCCGCGCGGCCTACGACAGTAGACAAATTTCA\n",
"1C0 CTGCTTGTATATGAGAGTGACAGGAGGTGAATCGCTGACAGAACTCTTTCTCGCGGCCTACGACAGTCGACAAATTTCA\n",
"1D0 CAGCTTGTATGTCAGTGTGASAGGAGGTGACTCGCTGACAGAACTCTATCGCGCGTCCTACGCTAGTAGACAAATTTCA\n",
"2E0 CAGCTTGTGTGTCAGTGTGACAGGAACTGACTCGCAGACAGAACTCTATCGCGCGGCCTAMGACAGTAGACGAATCTTM\n",
"2F0 CAGCTTGTGTGTCAGTGTGACAGGAACTGACTCGCAGACAGAACTCTATCGMGCGGCCTACGACAGTAGACGAATCTTA\n",
"2G0 CAGCTTGTGTGTCAGTGTAACAGCAACTGACTCGCTGACAGAACTCTATCGCGCGGCCTACRACATYAGACAACTYTTA\n",
"2H0 MAGCTTRTAKGYCTGTGTGACAGGAGCTGACTCGCTGACAGAACTCTATTGCGCGGCCWACGACAGTAGAAAAATTGCA\n",
"3I0 CAACTTGTATGTCAATGTGCCTGGRGGGGCCACGCTGRGAKCACTCAATCGCGGGGCCTTCGACAGTACAAAGACTTCA\n",
"3J0 CAGCTTGTATGTCAATGTGCCAGGAGGTGCCTCGCTGAGAKAWTTTAATCGCAGGGCCTTCGACAGTACWAAGACTTCA\n",
"3K0 CAGSTTGWATGTCAATGTGCCAGGAGGTCCCTCGCTGAGATAACTCAATCGCGGGGCCTTCGACAGTACAAAAATTTCA\n",
"3L0 CAGCTTGTATGTCAATGTGACAGGAGGTGCCTASSTGAGAGAACKCAATCGCGGRGATTTCGACGGTACAAAAATTTCA\n"
]
}
],
"prompt_number": 45
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## OTHER FORMATS \n",
"\n",
"You may also produce some more complicated formatting options that involve pooling individuals into groups or populations. This can be done for the \"treemix\" and \"migrate\" outputs, which are formatted for input into the programs _TreeMix_ and _migrate-n_, respectively. Grouping individuals into populations is done with the final lines of the params file as shown below, and similar to the assignment of individuals into clades for hierarchical clustering (see full tutorial). \n",
"\n",
"Each line designates a group, and has three arguments that are separated __by a single space__. The first is the group name, the second is the minimum number of individuals that must have data in that group for a locus to be included in the output, and the third is a list of the members of that group. Example below:\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"## append group designations to the params file\n",
"echo \"pop1 4 1A0,1B0,1C0,1D0 ##\" >> params.txt\n",
"echo \"pop2 4 2E0,2F0,2G0,2H0 ##\" >> params.txt\n",
"echo \"pop3 4 3I0,3J0,3K0,3L0 ##\" >> params.txt\n",
"\n",
"## view params file\n",
"cat params.txt"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"==== parameter inputs for pyRAD version 2.16 ============================ affected step ==\n",
"./ ## 1. Working directory (all)\n",
"./*.fastq.gz ## 2. Loc. of non-demultiplexed files (if not line 18) (s1)\n",
"./*.barcodes ## 3. Loc. of barcode file (if not line 18) (s1)\n",
"usearch7.0.1090_i86linux32 ## 4. command (or path) to call usearch v.7 (s3,s6)\n",
"muscle ## 5. command (or path) to call muscle (s3,s7)\n",
"TGCAG ## 6. restriction overhang (e.g., C|TGCAG -> TGCAG) (s1,s2)\n",
"8 ## 7. N processors... \n",
"6 ## 8. Mindepth: min coverage for a cluster (s4,s5)\n",
"4 ## 9. NQual: max # sites with qual < 20 (line 20) (s2)\n",
".85 ## 10. lowered clust thresh... \n",
"rad ## 11. Datatype: rad,gbs,ddrad,pairgbs,pairddrad,merge (all)\n",
"4 ## 12. MinCov: min samples in a final locus (s7)\n",
"3 ## 13. MaxSH: max inds with shared hetero site (s7)\n",
"c85m4p3 ## 14. outprefix... \n",
"==== optional params below this line =================================== affected step ==\n",
" ## 15.opt.: select subset (prefix* selector) (s2-s7)\n",
" ## 16.opt.: add-on (outgroup) taxa (list or prefix*) (s6,s7)\n",
" ## 17.opt.: exclude taxa (list or prefix*) (s7)\n",
" ## 18.opt.: Loc. of de-multiplexed data (s2)\n",
" ## 19.opt.: maxM: N mismatches in barcodes (def. 1) (s1)\n",
" ## 20.opt.: Phred Qscore offset (def. 33) (s2)\n",
" ## 21.opt.: Filter: 0=NQual 1=NQual+adapters. 2=1+strict (s2)\n",
" ## 22.opt.: a priori E,H (def. 0.001,0.01, if not estimated) (s5)\n",
" ## 23.opt.: maxN: Ns in a consensus seq (def. 5) (s5)\n",
"8 ## 24. maxH raised ... \n",
" ## 25.opt.: ploidy: max alleles in consens (def. 2) see doc (s5)\n",
" ## 26.opt.: maxSNPs: step 7. (def=100). Paired (def=100,100) (s7)\n",
" ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)\n",
" ## 28.opt.: random number seed (def. 112233) (s3,s6,s7)\n",
" ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7)\n",
"m,t ## 30. more output formats... \n",
" ## 31.opt.: call maj. consens if dpth < stat. limit (def. 0) (s5)\n",
" ## 32.opt.: merge/remove paired overlap (def 0), 1=check (s2)\n",
" ## 33.opt.: keep trimmed reads (def=0). Enter min length. (s2)\n",
" ## 34.opt.: max stack size (int), def= max(500,mean+2*SD) (s3)\n",
" ## 35.opt.: minDerep: exclude dereps with <= N copies, def=0 (s3)\n",
" ## 36.opt.: hierarch. cluster groups (def.=0, 1=yes) (s6)\n",
"==== list hierachical cluster groups below this line =====================================\n",
"pop1 4 1A0,1B0,1C0,1D0 ##\n",
"pop2 4 2E0,2F0,2G0,2H0 ##\n",
"pop3 4 3I0,3J0,3K0,3L0 ##\n"
]
}
],
"prompt_number": 66
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating population output files \n",
"Now if we run _pyRAD_ with the 'm' (migrate) or 't' (treemix) output options, it will create their output files. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"\n",
"## add m and t to output options\n",
"sed -i '/## 30./c\\m,t ## 30. more output formats... ' params.txt\n",
"\n",
"## assemble data set\n",
"./pyrad -p params.txt -s 7"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\tgroups for 't' or 'm' outputs: ['pop1', 'pop2', 'pop3']\n",
"\tingroup 1A0,1B0,1C0,1D0,2E0,2F0,2G0,2H0,3I0,3J0,3K0,3L0\n",
"\taddon \n",
"\texclude \n",
"\t\n",
"\tfinal stats written to:\n",
"\t /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_RAD/stats/c85m4p3.stats\n",
"\toutput files written to:\n",
"\t /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_RAD/outfiles/ directory\n",
"\n"
]
},
{
"output_type": "stream",
"stream": "stderr",
"text": [
"\n",
"\n",
" ------------------------------------------------------------\n",
" pyRAD : RADseq for phylogenetics & introgression analyses\n",
" ------------------------------------------------------------\n",
"\n",
"\n",
"\tCluster input file: using \n",
"\t/home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_RAD/clust.85/cat.clust_.gz\n",
"\n",
"........"
]
}
],
"prompt_number": 87
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TREEMIX format"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"less outfiles/c85m4p3.treemix.gz | head -n 30"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"pop3 pop2 pop1\n",
"8,0 7,1 8,0\n",
"8,0 8,0 2,6\n",
"6,2 8,0 8,0\n",
"7,1 8,0 8,0\n",
"8,0 8,0 7,1\n",
"8,0 8,0 4,4\n",
"8,0 7,1 8,0\n",
"7,1 8,0 8,0\n",
"8,0 2,6 8,0\n",
"8,0 7,1 8,0\n",
"8,0 8,0 6,2\n",
"8,0 7,1 8,0\n",
"8,0 8,0 2,6\n",
"8,0 6,2 8,0\n",
"0,8 8,0 8,0\n",
"8,0 8,0 2,6\n",
"8,0 8,0 7,1\n",
"8,0 8,0 7,1\n",
"8,0 6,2 8,0\n",
"2,6 8,0 8,0\n",
"8,0 8,0 7,1\n",
"6,2 8,0 8,0\n",
"8,0 8,0 7,1\n",
"8,0 6,2 8,0\n",
"7,1 8,0 8,0\n",
"8,0 2,6 8,0\n",
"8,0 0,8 8,0\n",
"6,2 8,0 8,0\n",
"6,2 8,0 8,0\n"
]
}
],
"prompt_number": 88
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## MIGRATE-n FORMAT"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"head -n 40 outfiles/c85m4p3.migrate | cut -c 1-85"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"3 2000 ( npops nloci for data set c85m4p3.loci )\n",
"89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 90 89 89 89 89 89 89 89 89 89 89 90 8\n",
"4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4\n",
"ind_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTC\n",
"ind_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTC\n",
"ind_2 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTC\n",
"ind_3 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTC\n",
"ind_0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCA\n",
"ind_1 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCA\n",
"ind_2 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCA\n",
"ind_3 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCA\n",
"ind_0 TCCGATAGCCAGGACTCGAGGTCGACTACCGGCGTGATGTCGGGTTCACCCCCCGAGCATCGGTGCGAAGGATAT\n",
"ind_1 TCCGATAGCCAGGACTCGAGGTCGACTACCGGCGTGATGTCGGGTTCACCCCCCGGGCATCGGTGCGAAGGATAT\n",
"ind_2 TCCGATAGCCAGGACTCGAGGTCGACTACCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATAT\n",
"ind_3 TCCGATAGCCAGGACTCGAGGTCGACTACCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATAT\n",
"ind_0 GAAGAGCTGGGTGTATCTTCCGAAACATCCCCCCCTGCGTGTGTTCCTCGCACAGCTACAATTCTACTTGTAGTT\n",
"ind_1 GAAGAGCTGGGTGTATCTTCCGAAACATCCCCCCCTGCGTGTGTTCCTCGCACAGCTACAATTCTACTTGTAGTT\n",
"ind_2 GAAGAGCTGRGTGTATCTTCCGAAACATCCCCCCCTGCGTGTGTTCSTCGCACAGCTACAATTCTACTTGTAGTT\n",
"ind_3 GAAGAGCTGAGTGTATCTTCCGAAACATCCCCCCCTGCGTGTGTTCCTCGCACAGCTACAATTCTACTTGTAGTT\n",
"ind_0 TTTGTGTTAACCGCCCTTTGCTTTGATATTGCCCGCCAAGCGTCTATTGGCAATTCAGAAGGCTATCAAACGTCT\n",
"ind_1 TTTGTGTTAACCGCCCTTTGCTTTGATATTGCCCGCCAAGCGTCTATTGGCAATTCAGAAGGCTATCAAACGTCT\n",
"ind_2 TTTGTGTTAACCGCCCTTTGCTTTGATWTTGCCCGCCAAGCGTCTATTGGCAATTCAGAAGGCTATCAAACGTCT\n",
"ind_3 TTTGTGTTAACCGCCCCTTGCTTTGATATTGCCCGCCAAGCGTCTATTGGCAATTCAGAAGGCTATCAAACGTCT\n",
"ind_0 TCGAATCAAACCGTACTCGCAAGCCTTGTGTTCGCACCCACCTCGATACGATCGTTGAGCTACAGCGTAGTTTTC\n",
"ind_1 TCGAATCAAACCGTACTCGCAAGCCKTGTGTTCGCACCCACCTCGATACGATCGTTGAGCTACAGCGTAGTTTTC\n",
"ind_2 TCGAATCAAACCGTWCTCGCAAGCCTTGTGTTCGCACCCACCTCGATACGATCGTTGAGCTACAGCGTAGTTTTC\n",
"ind_3 TCGAATCAAACCGTACTCGCAAGCCTTGTGTTCGCACCCACCTCGATACGATCGTTGAGCTACAGCGTAGTTTTC\n",
"ind_0 TGTATTTTGGGTTTCTCACTGCTTCTTTGAAAACCGCGCCCTCCATGCTCCTGAAAGGCGCACAAGGCCACGCGG\n",
"ind_1 TGTATTTTGGGTTTCTCACTGCTTCTTTGAAAACCGCGCCCTCCATGCTCCTGAAAGGCGCACAAGGCCACGCGG\n",
"ind_2 TGTATTTTGGGTTTCTCACTGCTTCTTTGAAAACCGCGCCCTCCATGCTCCTGAAAGGCGCACAAGGCCACGCGG\n",
"ind_3 TGTATTTTGGGTTTCTCACTGCTTCTTTGAAAACCGCGCCCTCCATGCTCCTGAAAGGCGCACAAGGCCACGCGG\n",
"ind_0 GTTTCGAGCGAATCTAGGCTTGGCCGCCCCAAGTCACAGCGAGGATGATCCCATTTAATGCTATGTCGGTAGAAC\n",
"ind_1 GTTTCGAGCGAATCTAGGCTTGGCCGCCCCAAGTCACAGCGAGGATGATCCCATTTAATGCTATGTCGGTAGAAC\n",
"ind_2 GWTTCGAGCGAATCTAGGCTTGGCCGCCCCAAGTCACAGCGAGGATGATCCCATTTAATGCTATGTCGGTAGAAC\n",
"ind_3 GTTTCGAGCGAATCTAGGCTTGGCCGCCCCAAGTCACAGCGAGGATGATCCCATTTAATGCTATGACGGTAGAAC\n",
"ind_0 CCTTGTGTACGCTCATCACCCTAAATAGCGCTCCCGTTACCCGGCTACCCAGTGGTTCTTTCCCTATCGAACAAT\n",
"ind_1 CCTTGTGTACGCTCATCACCCTAAATAGCGCTCCCGTTACCCGGCTACCCAGTGGTTCTTTCCCTATCGAACAAT\n",
"ind_2 CCTTGTGTACGCTCATCACCCTAAATAGCGCTCCCGTTACCCGGCTACCCAGTGGTTCTTTCCCTATCGAACAAT\n",
"ind_3 CCTTGTGTACGCTCATCACCCTAAATAGCGCTCCCGTTACCCGGCTACCCAGTGGTTCTTTCCCTATCGAACMAT\n",
"ind_0 TAGCTAGAAATTAAGAAGGCTGTAACCCGGCGCGCGCAATGACTATCGCCGATTACAAGGGCAGGTGGTGACACT\n"
]
}
],
"prompt_number": 89
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 58
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "",
"signature": "sha256:f2244fdfaa27f085e34a76df82ed1758f53a090e50df789ec2e7f7dfb2c421fd"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Example _de novo_ RADseq assembly using _pyRAD_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---------- \n",
"\n",
"Please direct questions about _pyRAD_ analyses to the google group thread ([link](https://groups.google.com/forum/#!forum/pyrad-users)) \n",
"\n",
"-------------- \n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ This tutorial is meant as a walkthrough for a single-end RADseq analyses. If you have not yet read the [__full tutorial__](http://www.dereneaton.com/software/pyrad), you should start there for a broader description of how _pyRAD_ works. If you are new to RADseq analyses, this tutorial will provide a simple overview of how to execute _pyRAD_, what the data files look like, and how to check that your analysis is working, and the expected output formats. \n",
"\n",
"\n",
"\n",
"+ Each cell in this tutorial begins with the header (%%bash) indicating that the code should be executed in a command line shell, for example by copying and pasting the text into your terminal (but excluding the %%bash header).\n",
"\n",
"------------- \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Begin by executing the command below. This will download an example simulated RADseq data set and unarchive it into your current directory."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"wget -q dereneaton.com/downloads/simRADs.zip\n",
"unzip simRADs.zip"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Archive: simRADs.zip\n",
" inflating: simRADs.barcodes \n",
" inflating: simRADs_R1.fastq.gz \n"
]
}
],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---------------- \n",
"\n",
"#### The two necessary files below should now be located in your current directory.\n",
"\n",
"+ simRADs.fastq.gz : Illumina fastQ formatted reads (gzip compressed)\n",
"+ simRADs.barcodes : barcode map file \n",
"\n",
"----------------- \n",
"\n"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"We begin by creating the params.txt file which is used to set all parameters for an analysis."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"## here I'm also creating a symlink from the location of pyrad on my \n",
"## machine so that I can call the program using just 'pyrad'\n",
"## if you wish to do this uncomment the code on the line below\n",
"## and replace the location with its location on your machine\n",
"#ln -s ~/Dropbox/pyrad-github/pyRAD pyrad\n",
"\n",
"## call pyRAD with the (-n) option\n",
"./pyrad -n"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stderr",
"text": [
"\tnew params.txt file created\n"
]
}
],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"------------ \n",
"\n",
"The params file lists on each line one parameter followed by a __##__ mark, after which any comments can be left. In the comments section there is a description of the parameter and in parentheses the step of the analysis affected by the parameter. Lines 1-14 are required, the remaining lines are optional. The params.txt file is further described in the general tutorial."
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Let's take a look at the default settings. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"cat params.txt"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"==== parameter inputs for pyRAD version 2.16 ============================ affected step ==\n",
"./ ## 1. Working directory (all)\n",
"./*.fastq.gz ## 2. Loc. of non-demultiplexed files (if not line 18) (s1)\n",
"./*.barcodes ## 3. Loc. of barcode file (if not line 18) (s1)\n",
"usearch7.0.1090_i86linux32 ## 4. command (or path) to call usearch v.7 (s3,s6)\n",
"muscle ## 5. command (or path) to call muscle (s3,s7)\n",
"TGCAG ## 6. restriction overhang (e.g., C|TGCAG -> TGCAG) (s1,s2)\n",
"2 ## 7. N processors to use in parallel (all)\n",
"6 ## 8. Mindepth: min coverage for a cluster (s4,s5)\n",
"4 ## 9. NQual: max # sites with qual < 20 (line 20) (s2)\n",
".88 ## 10. Wclust: clustering threshold as a decimal (s3,s6)\n",
"rad ## 11. Datatype: rad,gbs,ddrad,pairgbs,pairddrad,merge (all)\n",
"4 ## 12. MinCov: min samples in a final locus (s7)\n",
"3 ## 13. MaxSH: max inds with shared hetero site (s7)\n",
"c88d6m4p3 ## 14. prefix name for final output (no spaces) (s7)\n",
"==== optional params below this line =================================== affected step ==\n",
" ## 15.opt.: select subset (prefix* selector) (s2-s7)\n",
" ## 16.opt.: add-on (outgroup) taxa (list or prefix*) (s6,s7)\n",
" ## 17.opt.: exclude taxa (list or prefix*) (s7)\n",
" ## 18.opt.: Loc. of de-multiplexed data (s2)\n",
" ## 19.opt.: maxM: N mismatches in barcodes (def. 1) (s1)\n",
" ## 20.opt.: Phred Qscore offset (def. 33) (s2)\n",
" ## 21.opt.: Filter: 0=NQual 1=NQual+adapters. 2=1+strict (s2)\n",
" ## 22.opt.: a priori E,H (def. 0.001,0.01, if not estimated) (s5)\n",
" ## 23.opt.: maxN: Ns in a consensus seq (def. 5) (s5)\n",
" ## 24.opt.: maxH: hetero. sites in consensus seq (def. 5) (s5)\n",
" ## 25.opt.: ploidy: max alleles in consens (def. 2) see doc (s5)\n",
" ## 26.opt.: maxSNPs: step 7. (def=100). Paired (def=100,100) (s7)\n",
" ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)\n",
" ## 28.opt.: random number seed (def. 112233) (s3,s6,s7)\n",
" ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7)\n",
" ## 30.opt.: add output formats: a,n,s,u (see documentation) (s7)\n",
" ## 31.opt.: call maj. consens if dpth < stat. limit (def. 0) (s5)\n",
" ## 32.opt.: merge/remove paired overlap (def 0), 1=check (s2)\n",
" ## 33.opt.: keep trimmed reads (def=0). Enter min length. (s2)\n",
" ## 34.opt.: max stack size (int), def= max(500,mean+2*SD) (s3)\n",
" ## 35.opt.: minDerep: exclude dereps with <= N copies, def=0 (s3)\n",
" ## 36.opt.: hierarch. cluster groups (def.=0, 1=yes) (s6)\n",
"==== list hierachical cluster groups below this line =====================================\n"
]
}
],
"prompt_number": 3
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"To change parameters you can edit params.txt in any text editor. Here to automate things I use the script below."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"sed -i '/## 7. /c\\8 ## 7. N processors... ' params.txt\n",
"sed -i '/## 10. /c\\.85 ## 10. lowered clust thresh... ' params.txt\n",
"sed -i '/## 14. /c\\c85m4p3 ## 14. outprefix... ' params.txt\n",
"sed -i '/## 24./c\\8 ## 24. maxH raised ... ' params.txt\n",
"sed -i '/## 30./c\\a,s,n,u,k ## 30. more output formats... ' params.txt"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 4
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Let's have a look at the changes:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"cat params.txt"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"==== parameter inputs for pyRAD version 2.16 ============================ affected step ==\n",
"./ ## 1. Working directory (all)\n",
"./*.fastq.gz ## 2. Loc. of non-demultiplexed files (if not line 18) (s1)\n",
"./*.barcodes ## 3. Loc. of barcode file (if not line 18) (s1)\n",
"usearch7.0.1090_i86linux32 ## 4. command (or path) to call usearch v.7 (s3,s6)\n",
"muscle ## 5. command (or path) to call muscle (s3,s7)\n",
"TGCAG ## 6. restriction overhang (e.g., C|TGCAG -> TGCAG) (s1,s2)\n",
"8 ## 7. N processors... \n",
"6 ## 8. Mindepth: min coverage for a cluster (s4,s5)\n",
"4 ## 9. NQual: max # sites with qual < 20 (line 20) (s2)\n",
".85 ## 10. lowered clust thresh... \n",
"rad ## 11. Datatype: rad,gbs,ddrad,pairgbs,pairddrad,merge (all)\n",
"4 ## 12. MinCov: min samples in a final locus (s7)\n",
"3 ## 13. MaxSH: max inds with shared hetero site (s7)\n",
"c85m4p3 ## 14. outprefix... \n",
"==== optional params below this line =================================== affected step ==\n",
" ## 15.opt.: select subset (prefix* selector) (s2-s7)\n",
" ## 16.opt.: add-on (outgroup) taxa (list or prefix*) (s6,s7)\n",
" ## 17.opt.: exclude taxa (list or prefix*) (s7)\n",
" ## 18.opt.: Loc. of de-multiplexed data (s2)\n",
" ## 19.opt.: maxM: N mismatches in barcodes (def. 1) (s1)\n",
" ## 20.opt.: Phred Qscore offset (def. 33) (s2)\n",
" ## 21.opt.: Filter: 0=NQual 1=NQual+adapters. 2=1+strict (s2)\n",
" ## 22.opt.: a priori E,H (def. 0.001,0.01, if not estimated) (s5)\n",
" ## 23.opt.: maxN: Ns in a consensus seq (def. 5) (s5)\n",
"8 ## 24. maxH raised ... \n",
" ## 25.opt.: ploidy: max alleles in consens (def. 2) see doc (s5)\n",
" ## 26.opt.: maxSNPs: step 7. (def=100). Paired (def=100,100) (s7)\n",
" ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)\n",
" ## 28.opt.: random number seed (def. 112233) (s3,s6,s7)\n",
" ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7)\n",
"a,s,n,u,k ## 30. more output formats... \n",
" ## 31.opt.: call maj. consens if dpth < stat. limit (def. 0) (s5)\n",
" ## 32.opt.: merge/remove paired overlap (def 0), 1=check (s2)\n",
" ## 33.opt.: keep trimmed reads (def=0). Enter min length. (s2)\n",
" ## 34.opt.: max stack size (int), def= max(500,mean+2*SD) (s3)\n",
" ## 35.opt.: minDerep: exclude dereps with <= N copies, def=0 (s3)\n",
" ## 36.opt.: hierarch. cluster groups (def.=0, 1=yes) (s6)\n",
"==== list hierachical cluster groups below this line =====================================\n"
]
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"-------------- \n",
"\n",
"__Let's take a look at what the raw data look like.__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Your input data will be in fastQ format, usually ending in .fq or .fastq. Your data could be split among multiple files, or all within a single file (de-multiplexing goes much faster if they happen to be split into multiple files). The file/s may be compressed with gzip so that they have a .gz ending, but they do not need to be. The location of these files should be entered on line 2 of the params file. Below are the first three reads in the example file."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"less simRADs_R1.fastq.gz | head -n 12 | cut -c 1-90"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"@lane1_fakedata0_R1_0 1:N:0:\n",
"TTTTAATGCAGTGAGTGGCCATGCAATATATATTTACGGGCGCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGCTTAG\n",
"+\n",
"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\n",
"@lane1_fakedata0_R1_1 1:N:0:\n",
"TTTTAATGCAGTGAGTGGCCATGCAATATATATTTACGGGCGCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGCTTAG\n",
"+\n",
"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\n",
"@lane1_fakedata0_R1_2 1:N:0:\n",
"TTTTAATGCAGTGAGTGGCCATGCAATATATATTTACGGGCGCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGCTTAG\n",
"+\n",
"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\n"
]
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"------------ \n",
"\n",
"Each read takes four lines. The first is the name of the read (its location on the plate). The second line contains the sequence data. The third line is a spacer. And the fourth line the quality scores for the base calls. In this case arbitrarily high since the data were simulated. \n",
"\n",
"These are 100 bp single-end reads prepared as RADseq. The first six bases form the barcode and the next five bases (TGCAG) the restriction site overhang. All following bases make up the sequence data. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---------------- \n",
"\n",
"## Step 1: de-multiplexing ##"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This step uses information in the barcodes file to sort data into a separate file for each sample. Below is the barcodes file, with sample names and their barcodes each on a separate line with a tab between them."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"cat simRADs.barcodes"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1A0\tCATCAT\n",
"1B0\tTTTTAA\n",
"1C0\tAGGGGA\n",
"1D0\tTAAGGT\n",
"2E0\tTTTATA\n",
"2F0\tGAGTAT\n",
"2G0\tATAGAG\n",
"2H0\tATGAGG\n",
"3I0\tGGGTTT\n",
"3J0\tTTAAAA\n",
"3K0\tGGATTG\n",
"3L0\tAAGAAG\n"
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Step 1 writes the de-multiplexed data to a new file for each sample in a new directory created within the working directory called fastq/."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"./pyrad -p params.txt -s 1"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stderr",
"text": [
"\n",
"\n",
" ------------------------------------------------------------\n",
" pyRAD : RADseq for phylogenetics & introgression analyses\n",
" ------------------------------------------------------------\n",
"\n",
"\n",
"\tstep 1: sorting reads by barcode\n",
" ."
]
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can see that this created a new file for each sample in the directory 'fastq/'"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"ls fastq/"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1A0_R1.fq.gz\n",
"1B0_R1.fq.gz\n",
"1C0_R1.fq.gz\n",
"1D0_R1.fq.gz\n",
"2E0_R1.fq.gz\n",
"2F0_R1.fq.gz\n",
"2G0_R1.fq.gz\n",
"2H0_R1.fq.gz\n",
"3I0_R1.fq.gz\n",
"3J0_R1.fq.gz\n",
"3K0_R1.fq.gz\n",
"3L0_R1.fq.gz\n"
]
}
],
"prompt_number": 9
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"The statistics for step 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A new directory called stats will also have been created. Each step of the _pyRAD_ analysis will create a new stats output file in this directory. The stats output for step 1 is below:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"cat stats/s1.sorting.txt"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"file \tNreads\tcut_found\tbar_matched\n",
"simRADs_R1.fastq.gz\t480000\t480000\t480000\n",
"\n",
"\n",
"sample\ttrue_bar\tobs_bars\tN_obs\n",
"3L0 \tAAGAAG \tAAGAAG\t40000 \n",
"1C0 \tAGGGGA \tAGGGGA\t40000 \n",
"2G0 \tATAGAG \tATAGAG\t40000 \n",
"2H0 \tATGAGG \tATGAGG\t40000 \n",
"1A0 \tCATCAT \tCATCAT\t40000 \n",
"2F0 \tGAGTAT \tGAGTAT\t40000 \n",
"3K0 \tGGATTG \tGGATTG\t40000 \n",
"3I0 \tGGGTTT \tGGGTTT\t40000 \n",
"1D0 \tTAAGGT \tTAAGGT\t40000 \n",
"3J0 \tTTAAAA \tTTAAAA\t40000 \n",
"2E0 \tTTTATA \tTTTATA\t40000 \n",
"1B0 \tTTTTAA \tTTTTAA\t40000 \n",
"\n",
"nomatch \t_ \t0\n"
]
}
],
"prompt_number": 10
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Step 2: quality filtering"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This step filters reads based on quality scores, and can be used to detect Illumina adapters in your reads, which is sometimes a problem with homebrew type library preparations. Here the filter is set to the default value of 0, meaning it filters only based on quality scores of base calls. The filtered files are written to a new directory called edits/."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"./pyrad -p params.txt -s 2"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stderr",
"text": [
"\n",
"\n",
" ------------------------------------------------------------\n",
" pyRAD : RADseq for phylogenetics & introgression analyses\n",
" ------------------------------------------------------------\n",
"\n",
"\tstep 2: editing raw reads \n",
"\t............"
]
}
],
"prompt_number": 11
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"ls edits/"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1A0.edit\n",
"1B0.edit\n",
"1C0.edit\n",
"1D0.edit\n",
"2E0.edit\n",
"2F0.edit\n",
"2G0.edit\n",
"2H0.edit\n",
"3I0.edit\n",
"3J0.edit\n",
"3K0.edit\n",
"3L0.edit\n"
]
}
],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The filtered data are written in fasta format (quality scores removed) into a new directory called edits/. Below I show a preview of the file which you can view most easily using the `less` command."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"head -n 10 edits/1A0.edit | cut -c 1-80"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
">1A0_0_r1\n",
"TGCAGTGAGTGGCCATGCAATATATATTTACGGGCTCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGC\n",
">1A0_1_r1\n",
"TGCAGTGAGTGGCCATGCAATATATATTTACGGGCTCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGC\n",
">1A0_2_r1\n",
"TGCAGTGAGTGGCCATGCAATATATATTTACGGGCTCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGC\n",
">1A0_3_r1\n",
"TGCAGTGAGTGGCCATGCAATATATATTTACGGGCTCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGC\n",
">1A0_4_r1\n",
"TGCAGTGAGTGGCCATGCAATATATATTTACGGGCTCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGC\n"
]
}
],
"prompt_number": 13
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Step 3: clustering within-samples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Step 3 de-replicates and then clusters reads within each sample by the set clustering threshold and writes the clusters to new files in a directory called clust.xx"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"./pyrad -p params.txt -s 3"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stderr",
"text": [
"\n",
"\n",
" ------------------------------------------------------------\n",
" pyRAD : RADseq for phylogenetics & introgression analyses\n",
" ------------------------------------------------------------\n",
"\n",
"\n",
"\tde-replicating files for clustering...\n",
"\n",
"\tstep 3: within-sample clustering of 12 samples at \n",
"\t '.85' similarity using up to 8 processors\n",
"\t2E0.edit finished, 2000loci\n",
"\t1A0.edit finished, 2000loci\n",
"\t1C0.edit finished, 2000loci\n",
"\t3I0.edit finished, 2000loci\n",
"\t2H0.edit finished, 2000loci\n",
"\t1B0.edit finished, 2000loci\n",
"\t3L0.edit finished, 2000loci\n",
"\t3J0.edit finished, 2000loci\n",
"\t2F0.edit finished, 2000loci\n",
"\t1D0.edit finished, 2000loci\n",
"\t2G0.edit finished, 2000loci\n",
"\t3K0.edit finished, 2000loci\n"
]
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once again, I recommend you use the unix command 'less' to look at the clustS files. These contain each cluster separated by \"//\". For the first few clusters below you can see that there is one or two alleles in the cluster and one or a few reads that contained a (simulated) sequencing error. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"less clust.85/1A0.clustS.gz | head -n 26 | cut -c 1-80"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
">1A0_2540_r1;size=17;\n",
"TGCAGTGTAACGTTGTATCCATCGAGTCGATCATAGCCTAAAATAAGTAACACTAATCAGGCGCGCTGGTTGGGGGATCA\n",
">1A0_2549_r1;size=1;+\n",
"TGCAGTGTAACGTTGTATCCATCGAGTCGATCATAGCCTAAAATAAGTAACGCTAATCAGGCGCGCTGGTTGGGGGATCA\n",
">1A0_2541_r1;size=1;+\n",
"TGCAGTGTAACGTTGTATCCAACGAGTCGATCATAGCCTAAAATAAGTAACACTAATCAGGCGCGCTGGTTGGGGGATCA\n",
">1A0_2551_r1;size=1;+\n",
"TGCAGTGTAACGTTGTATCCATCGAGTCGATCATAGCCTAAAATAAGTAACACTAATCAGGCGCGTTGGTTGGGGGATCA\n",
"//\n",
"//\n",
">1A0_2140_r1;size=19;\n",
"TGCAGCTCCGTCACTGCTCAGCGAACCTACTATCTAGTCGGAAAAGGTTCCGGCCCTTATGCTAAGTGCAAGCTGCCAGT\n",
">1A0_2155_r1;size=1;+\n",
"TGCAGCTCCCTCACTGCTCAGCGAACCTACTATCTAGTCGGAAAAGGTTCCGGCCCTTATGCTAAGTGCAAGCTGCCAGT\n",
"//\n",
"//\n",
">1A0_8280_r1;size=10;\n",
"TGCAGCGTATATGATCAGAACCGGGTGAGTGGGTACCGCGAACCGAAAGGCATCGAAAGTTTAGCGCAGCACTAATCTCA\n",
">1A0_8290_r1;size=8;+\n",
"TGCAGCGTATATGATCAGAACCGGGTGAGTGGGTACCGCGAACCGAAAGGCACCGAAAGTTTAGCGCAGCACTAATCTCA\n",
">1A0_8297_r1;size=1;+\n",
"TGCAGCGTATATGATCAGAACCGGGTGAGTGGGAACCGCGAACCGAAAGGCACCGAAAGTTTAGCGCAGCACTAATCTCA\n",
">1A0_8292_r1;size=1;+\n",
"TGCAGCCTATATGATCAGAACCGGGTGAGTGGGTACCGCGAACCGAAAGGCACCGAAAGTTTAGCGCAGCACTAATCTCA\n",
"//\n",
"//\n"
]
}
],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---------------\n",
"\n",
"\n",
"The stats output tells you how many clusters were found, and their mean depth of coverage. It also tells you how many pass your minimum depth setting. You can use this information to decide if you wish to increase or decrease the mindepth before it is applied for making consensus base calls in steps 4 & 5."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"cat stats/s3.clusters.txt"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"taxa\ttotal\tdpt.me\tdpt.sd\td>5.tot\td>5.me\td>5.sd\tbadpairs\n",
"1A0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"1B0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"1C0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"1D0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"2E0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"2F0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"2G0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"2H0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"3I0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"3J0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"3K0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"3L0\t2000\t20.0\t0.0\t2000\t20.0\t0.0\t0\n",
"\n",
" ## total = total number of clusters, including singletons\n",
" ## dpt.me = mean depth of clusters\n",
" ## dpt.sd = standard deviation of cluster depth\n",
" ## >N.tot = number of clusters with depth greater than N\n",
" ## >N.me = mean depth of clusters with depth greater than N\n",
" ## >N.sd = standard deviation of cluster depth for clusters with depth greater than N\n",
" ## badpairs = mismatched 1st & 2nd reads (only for paired ddRAD data)\n",
"\n",
"HISTOGRAMS\n",
"\n",
" \n",
"sample: 1A0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n",
"sample: 1B0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n",
"sample: 1C0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n",
"sample: 1D0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n",
"sample: 2E0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n",
"sample: 2F0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n",
"sample: 2G0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n",
"sample: 2H0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n",
"sample: 3I0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n",
"sample: 3J0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n",
"sample: 3K0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n",
"sample: 3L0\n",
"bins\tdepth_histogram\tcnts\n",
" :\t0------------50-------------100%\n",
"0 \t 0\n",
"5 \t 0\n",
"10 \t 0\n",
"15 \t 0\n",
"20 \t******************************* 2000\n",
"25 \t 0\n",
"30 \t 0\n",
"35 \t 0\n",
"40 \t 0\n",
"50 \t 0\n",
"100 \t 0\n",
"250 \t 0\n",
"500 \t 0\n",
"\n"
]
}
],
"prompt_number": 16
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Steps 4 & 5: Call consensus sequences"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Step 4 jointly infers the error-rate and heterozygosity across samples."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"./pyrad -p params.txt -s 4"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stderr",
"text": [
"\n",
"\n",
" ------------------------------------------------------------\n",
" pyRAD : RADseq for phylogenetics & introgression analyses\n",
" ------------------------------------------------------------\n",
"\n",
"\n",
"\tstep 4: estimating error rate and heterozygosity\n",
"\t............"
]
}
],
"prompt_number": 17
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"less stats/Pi_E_estimate.txt"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"taxa\tH\tE\n",
"3K0\t0.00135982\t0.00048078\t\n",
"1C0\t0.00134858\t0.00048372\t\n",
"1D0\t0.00135375\t0.00048822\t\n",
"3I0\t0.00129751\t0.00048694\t\n",
"2H0\t0.00133223\t0.00049211\t\n",
"2F0\t0.00135365\t0.0004995\t\n",
"1A0\t0.00136043\t0.00051028\t\n",
"2E0\t0.00126915\t0.00051556\t\n",
"1B0\t0.00149924\t0.00049663\t\n",
"3J0\t0.00144422\t0.0005089\t\n",
"2G0\t0.00138185\t0.00051206\t\n",
"3L0\t0.00143349\t0.00051991\t\n"
]
}
],
"prompt_number": 18
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Step 5 calls consensus sequences using the parameters inferred above, and filters for paralogs."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"./pyrad -p params.txt -s 5"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stderr",
"text": [
"\n",
"\n",
" ------------------------------------------------------------\n",
" pyRAD : RADseq for phylogenetics & introgression analyses\n",
" ------------------------------------------------------------\n",
"\n",
"\n",
"\tstep 5: creating consensus seqs for 12 samples, using H=0.00137 E=0.00050\n",
"\t............"
]
}
],
"prompt_number": 19
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"The stats output for step 5"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"less stats/s5.consens.txt"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"taxon \tnloci\tf1loci\tf2loci\tnsites\tnpoly\tpoly\n",
"3L0.clustS.gz \t2000\t2000\t2000\t178003\t255\t0.0014326\n",
"2F0.clustS.gz \t2000\t2000\t2000\t178002\t241\t0.0013539\n",
"1B0.clustS.gz \t2000\t2000\t2000\t178005\t267\t0.0015\n",
"3J0.clustS.gz \t2000\t2000\t2000\t178003\t257\t0.0014438\n",
"1A0.clustS.gz \t2000\t2000\t2000\t178002\t242\t0.0013595\n",
"2H0.clustS.gz \t2000\t2000\t2000\t178001\t237\t0.0013315\n",
"2E0.clustS.gz \t2000\t2000\t2000\t178002\t226\t0.0012696\n",
"2G0.clustS.gz \t2000\t2000\t2000\t178001\t246\t0.001382\n",
"1D0.clustS.gz \t2000\t2000\t2000\t178002\t241\t0.0013539\n",
"3I0.clustS.gz \t2000\t2000\t2000\t178003\t231\t0.0012977\n",
"1C0.clustS.gz \t2000\t2000\t2000\t178002\t240\t0.0013483\n",
"3K0.clustS.gz \t2000\t2000\t2000\t178001\t242\t0.0013595\n",
"\n",
" ## nloci = number of loci\n",
" ## f1loci = number of loci with >N depth coverage\n",
" ## f2loci = number of loci with >N depth and passed paralog filter\n",
" ## nsites = number of sites across f loci\n",
" ## npoly = number of polymorphic sites in nsites\n",
" ## poly = frequency of polymorphic sites\n"
]
}
],
"prompt_number": 20
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Step 6: Cluster across samples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Step 6 clusters consensus sequences across samples. This step can take a long time for very large data sets (>100 individuals). I suggest trying it first. It will print its progress and if it looks to be taking way too long then you can implement the hierarchical clustering method instead, described in detail in a separate tutorial)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"./pyrad -p params.txt -s 6 "
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"usearch v7.0.1090_i86linux32, 4.0Gb RAM (65.8Gb total), 40 cores\n",
"(C) Copyright 2013 Robert C. Edgar, all rights reserved.\n",
"http://drive5.com/usearch\n",
"\n",
"Licensed to: daeaton.chicago@gmail.com\n",
"\n",
"\n",
"\tfinished clustering\n"
]
},
{
"output_type": "stream",
"stream": "stderr",
"text": [
"\n",
"\n",
" ------------------------------------------------------------\n",
" pyRAD : RADseq for phylogenetics & introgression analyses\n",
" ------------------------------------------------------------\n",
"\n",
"\n",
"\tstep 6: clustering across 12 samples at '.85' similarity \n",
"\n",
"00:00 21Mb 0.1% 0 clusters, max size 0, avg 0.0\r",
"00:01 26Mb 1.0% 192 clusters, max size 2, avg 1.1\r",
"00:02 28Mb 3.2% 624 clusters, max size 4, avg 1.2\r",
"00:03 28Mb 4.9% 881 clusters, max size 4, avg 1.3\r",
"00:04 29Mb 6.5% 1085 clusters, max size 5, avg 1.4\r",
"00:05 29Mb 8.2% 1266 clusters, max size 5, avg 1.5\r",
"00:06 29Mb 10.1% 1435 clusters, max size 5, avg 1.7\r",
"00:07 29Mb 12.5% 1591 clusters, max size 7, avg 1.9\r",
"00:08 29Mb 15.5% 1733 clusters, max size 7, avg 2.1\r",
"00:09 30Mb 19.9% 1865 clusters, max size 8, avg 2.5\r",
"00:10 30Mb 29.4% 1972 clusters, max size 10, avg 3.6\r",
"00:11 30Mb 75.2% 2000 clusters, max size 12, avg 9.0\r",
"00:11 30Mb 100.0% 2000 clusters, max size 12, avg 12.0\r\n",
" \n",
" Seqs 24000 (24.0k)\n",
" Clusters 2000\n",
" Max size 12\n",
" Avg size 12.0\n",
" Min size 12\n",
"Singletons 0, 0.0% of seqs, 0.0% of clusters\n",
" Max mem 30Mb\n",
" Time 11.0s\n",
"Throughput 2181.8 seqs/sec.\n",
"\n"
]
}
],
"prompt_number": 21
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Step 7: Assemble final data sets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The final step is to output data only for the loci that you want to have included in your data set. This filters once again for potential paralogs or highly repetitive regions, and includes options to minimize the amount of missing data in the output. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"./pyrad -p params.txt -s 7"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\tingroup 1A0,1B0,1C0,1D0,2E0,2F0,2G0,2H0,3I0,3J0,3K0,3L0\n",
"\taddon \n",
"\texclude \n",
"\t\n",
"\tfinal stats written to:\n",
"\t /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_RAD/stats/c85m4p3.stats\n",
"\toutput files written to:\n",
"\t /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_RAD/outfiles/ directory\n",
"\n"
]
},
{
"output_type": "stream",
"stream": "stderr",
"text": [
"\n",
"\n",
" ------------------------------------------------------------\n",
" pyRAD : RADseq for phylogenetics & introgression analyses\n",
" ------------------------------------------------------------\n",
"\n",
"........"
]
}
],
"prompt_number": 51
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Final stats output"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"less stats/c85m4p3.stats"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"\n",
"2000 ## loci with > minsp containing data\n",
"2000 ## loci with > minsp containing data & paralogs removed\n",
"2000 ## loci with > minsp containing data & paralogs removed & final filtering\n",
"\n",
"## number of loci recovered in final data set for each taxon.\n",
"taxon\tnloci\n",
"1A0\t2000\n",
"1B0\t2000\n",
"1C0\t2000\n",
"1D0\t2000\n",
"2E0\t2000\n",
"2F0\t2000\n",
"2G0\t2000\n",
"2H0\t2000\n",
"3I0\t2000\n",
"3J0\t2000\n",
"3K0\t2000\n",
"3L0\t2000\n",
"\n",
"\n",
"## nloci = number of loci with data for exactly ntaxa\n",
"## ntotal = number of loci for which at least ntaxa have data\n",
"ntaxa\tnloci\tsaved\tntotal\n",
"1\t-\n",
"2\t-\t\t-\n",
"3\t-\t\t-\n",
"4\t0\t*\t2000\n",
"5\t0\t*\t2000\n",
"6\t0\t*\t2000\n",
"7\t0\t*\t2000\n",
"8\t0\t*\t2000\n",
"9\t0\t*\t2000\n",
"10\t0\t*\t2000\n",
"11\t0\t*\t2000\n",
"12\t2000\t*\t2000\n",
"\n",
"\n",
"## var = number of loci containing n variable sites.\n",
"## pis = number of loci containing n parsimony informative var sites.\n",
"n\tvar\tPIS\n",
"0\t145\t551\n",
"1\t1083\t699\n",
"2\t945\t475\n",
"3\t637\t187\n",
"4\t367\t69\n",
"5\t182\t13\n",
"6\t59\t3\n",
"7\t18\t2\n",
"8\t12\t1\n",
"9\t1\t0\n",
"total var= 7847\n",
"total pis= 2591\n"
]
}
],
"prompt_number": 52
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"--------------- \n",
"\n",
"## Output formats ##"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We created 8 output files from our analysis. The standard four (.loci, .phy, .excluded_loci, and .unlinked_snps), as well as the four additional formats we requested in the params file (.snps, .alleles, .str and .nex). These are all fully explained the general tutorial."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"ls outfiles"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"c85m4p3.alleles\n",
"c85m4p3.excluded_loci\n",
"c85m4p3.loci\n",
"c85m4p3.nex\n",
"c85m4p3.phy\n",
"c85m4p3.snps\n",
"c85m4p3.str\n",
"c85m4p3.unlinked_snps\n"
]
}
],
"prompt_number": 35
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loci format \n",
"The \".loci\" file contains each locus listed in a fasta-like format that also shows which sites are variable below each locus. Autapomorphies are listed as '-' and shared SNPs as '*'. This is a custom format that is human readable and also used as input to perform D-statistic tests in pyRAD. This is the easiest way to visualize your results. I recommend viewing the file with the command `less`. Below I use a head and cut to make it easy to view in this window."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"head -n 39 outfiles/c85m4p3.loci | cut -c 1-75"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
">1A0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCA\n",
">1B0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCA\n",
">1C0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCA\n",
">1D0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAMTGTTGGCGAGTCTCATCGCGAGGCA\n",
">2E0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCA\n",
">2F0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCA\n",
">2G0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCA\n",
">2H0 AGAAGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTMAATGTTGGCGAGTCTCATCGCGAGGCA\n",
">3I0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCA\n",
">3J0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCA\n",
">3K0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCA\n",
">3L0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCA\n",
"// - - - \n",
">1A0 CGGTAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCA\n",
">1B0 CGGTAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCA\n",
">1C0 CGGTAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCA\n",
">1D0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTMGTCA\n",
">2E0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCA\n",
">2F0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCA\n",
">2G0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCA\n",
">2H0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCA\n",
">3I0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCA\n",
">3J0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCA\n",
">3K0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCA\n",
">3L0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCA\n",
"// * - \n",
">1A0 TCCGATAGCCAGGTCTCGAGGTCGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCG\n",
">1B0 TCCGATAGCCAGGTCTCGAGGTCGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCG\n",
">1C0 TCCGATAGCCAGGTCTCGAGGTCGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCG\n",
">1D0 TCCGATAGCCAGGTCTCGAGGTCGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCG\n",
">2E0 TCCGATAGCCAGGTCTCGAGGTCGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCG\n",
">2F0 TCCGATAGCCAGGTCTCGAGGTCGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCG\n",
">2G0 TCCGATAGCCAGGTCTCGAGGTCGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCG\n",
">2H0 TCCGATAGCCAGGTCTCGAGGTCGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCG\n",
">3I0 TCCGATAGCCAGGACTCGAGGTCGACTACCGGCGTGATGTCGGGTTCACCCCCCGAGCATCGGTGCG\n",
">3J0 TCCGATAGCCAGGACTCGAGGTCGACTACCGGCGTGATGTCGGGTTCACCCCCCGGGCATCGGTGCG\n",
">3K0 TCCGATAGCCAGGACTCGAGGTCGACTACCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCG\n",
">3L0 TCCGATAGCCAGGACTCGAGGTCGACTACCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCG\n",
"// * * * - \n"
]
}
],
"prompt_number": 36
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### PHY format"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"head -n 50 outfiles/c85m4p3.phy | cut -c 1-85"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"12 178083\n",
"1A0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCGAG\n",
"1B0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCGAG\n",
"1C0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTMCGAG\n",
"1D0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAMTGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCGAG\n",
"2E0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCGAG\n",
"2F0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCGAG\n",
"2G0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCGAG\n",
"2H0 AGAAGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTMAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCGAG\n",
"3I0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCGAG\n",
"3J0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCGAG\n",
"3K0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCGAG\n",
"3L0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCGAG\n"
]
}
],
"prompt_number": 37
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### NEX format"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"head -n 50 outfiles/c85m4p3.nex | cut -c 1-85"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"#NEXUS\n",
"BEGIN DATA;\n",
" DIMENSIONS NTAX=12 NCHAR=178083;\n",
" FORMAT DATATYPE=DNA MISSING=N GAP=- INTERLEAVE=YES;\n",
" MATRIX\n",
" 1B0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
" 2G0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
" 2F0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
" 1A0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
" 2H0 AGAAGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTMAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
" 2E0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
" 3I0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
" 1C0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTMCG\n",
" 3L0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
" 1D0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAMTGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
" 3J0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
" 3K0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
"\n",
" 1B0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACATCAAGGGTACC\n",
" 2G0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACAACAAGGGTACC\n",
" 2F0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACAACAAGGGTACC\n",
" 1A0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACATCAAGGGTACC\n",
" 2H0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACAACAAGGGTACC\n",
" 2E0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACARCAAGGGTACC\n",
" 3I0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACAACAAGGGTACC\n",
" 1C0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACATCAAGGGTACC\n",
" 3L0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACAACAAGGGTACC\n",
" 1D0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTMGTCAATGTTCCACATCAAGGGTACC\n",
" 3J0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACAACAAGGGTACC\n",
" 3K0 CCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACAACAAGGGTACC\n",
"\n",
" 1B0 CGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATATGGATACGCCGAGAGGAAGAGCTGA\n",
" 2G0 CGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATATGGATACGCCGAGAGGAAGAGCTGA\n",
" 2F0 CGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATATGGATACGCCGAGAGGAAGAGCTGA\n",
" 1A0 CGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATATGGATACGCCGAGAGGAAGAGCTGA\n",
" 2H0 CGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATATGGATACGCCGAGAGGAAGAGCTGA\n",
" 2E0 CGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATATGGATACGCCGAGAGGWAGAGCTGA\n",
" 3I0 CGACTACCGGCGTGATGTCGGGTTCACCCCCCGAGCATCGGTGCGAAGGATATGGATACGCCGAGAGGAAGAGCTGG\n",
" 1C0 CGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATATGGATACGCCGAGAGGAAGAGCTGA\n",
" 3L0 CGACTACCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATATGGATACGCCGAGAGGAAGAGCTGA\n",
" 1D0 CGACTTCCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATATGGATACGCCGAGAGGAAGAGCTGA\n",
" 3J0 CGACTACCGGCGTGATGTCGGGTTCACCCCCCGGGCATCGGTGCGAAGGATATGGATACGCCGAGAGGAAGAGCTGG\n",
" 3K0 CGACTACCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATATGGATACGCCGAGAGGAAGAGCTGR\n",
"\n",
" 1B0 CCTGCGTGTGTTCCTCGCACAGCTACAATTCTACTTGTAGTTTGGAGATCAGCGCCTTTGTGTTAACCGCCCTTTGC\n",
" 2G0 CCTGCGTGTGTTCCTCGCACAGCTACAATTCTACTTGTAGTTTGGAGATCAGCGCCTTTGTGTTAACCGCCCTTTGC\n",
" 2F0 CCTGCGYGTGTTCCTCGCACAGCTACAATTCTACTTGTAGTTTGGAGATCAGCGCCTTTGTGTTAACCGCCCTTTGC\n",
" 1A0 CCTGCGTGTGTTCCTCGCACAGCTACAATTCTACTTGTAGTTTGGAGATCAGCGCCTTTGTGTTAACCGCCCTTTGC\n",
" 2H0 CCTGCGTGTGTTCCTCGCACAGCTACAATTCTACTTGTAGTTTGGAGATCAGCGCCTTTGTGTTAACCGCCCTTTGC\n",
" 2E0 CCTGCGTGTGTTCCTCGCACAGCTACAATTCTACTTGTAGTTTGGAGATCAGCGCCTTTGTGTTAACCGCCCTTTGC\n"
]
}
],
"prompt_number": 38
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Alleles format"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"head -n 50 outfiles/c85m4p3.alleles| cut -c 1-85"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
">1A0_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">1A0_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">1B0_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">1B0_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">1C0_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">1C0_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTACG\n",
">1D0_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCACTGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">1D0_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">2E0_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">2E0_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">2F0_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">2F0_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">2G0_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">2G0_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">2H0_0 AGAAGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">2H0_1 AGAAGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTAAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">3I0_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">3I0_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">3J0_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">3J0_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">3K0_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">3K0_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">3L0_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
">3L0_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTCCG\n",
"// - - - - \n",
">1A0_0 CGGTAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">1A0_1 CGGTAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">1B0_0 CGGTAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">1B0_1 CGGTAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">1C0_0 CGGTAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">1C0_1 CGGTAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">1D0_0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTCGTCAATGTTCCACA\n",
">1D0_1 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">2E0_0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">2E0_1 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">2F0_0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">2F0_1 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">2G0_0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">2G0_1 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">2H0_0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">2H0_1 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">3I0_0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">3I0_1 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">3J0_0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">3J0_1 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">3K0_0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">3K0_1 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">3L0_0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
">3L0_1 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCACA\n",
"// * - \n"
]
}
],
"prompt_number": 39
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### STRUCTURE (.str) format"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"head -n 50 outfiles/c85m4p3.str | cut -c 1-20"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1A0 \t\t\t\t\t\t\t3\t1\t2\t3\n",
"1A0 \t\t\t\t\t\t\t3\t1\t2\t3\n",
"1B0 \t\t\t\t\t\t\t3\t1\t2\t3\n",
"1B0 \t\t\t\t\t\t\t3\t1\t2\t3\n",
"1C0 \t\t\t\t\t\t\t3\t1\t2\t3\n",
"1C0 \t\t\t\t\t\t\t3\t1\t2\t3\n",
"1D0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"1D0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"2E0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"2E0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"2F0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"2F0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"2G0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"2G0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"2H0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"2H0 \t\t\t\t\t\t\t0\t0\t2\t3\n",
"3I0 \t\t\t\t\t\t\t3\t0\t0\t3\n",
"3I0 \t\t\t\t\t\t\t3\t0\t0\t3\n",
"3J0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"3J0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"3K0 \t\t\t\t\t\t\t3\t0\t2\t2\n",
"3K0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"3L0 \t\t\t\t\t\t\t3\t0\t2\t3\n",
"3L0 \t\t\t\t\t\t\t3\t0\t2\t3\n"
]
}
],
"prompt_number": 40
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### SNPs format"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"head -n 50 outfiles/c85m4p3.snps | cut -c 1-85"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"## 12 taxa, 2000 loci, 7888 snps\n",
"1A0 CCACT TAT TTAG AATC TAYC ACGTACT AAGG TCCTTG AAA GGCTT CTGTA GCAACT G A GATGC T\n",
"1B0 CCACT TAT TTAG AATC TATC ACGTACT AAGG TCCTTG AAA GGCTT CTGTA GCAACT G A GATGC T\n",
"1C0 CCAMT TAT TTAG AATC TATC ACGTTCT AAGG TCCTTG AAA GGCTT YTATA AAACCT G A GATGC T\n",
"1D0 CCMCT AMT TTAG AATC TATT ACGTTCA AAAG TCCTTG AAA GGCTT CTGTT GCACCT C A GCTGC T\n",
"2E0 CCACT AAR TTAG WATC TATC ACKTTAT AAGG TCCTKG TAG TGCTT CAGTA GCTCCT C A GATGY T\n",
"2F0 CCACT AAA TTAG AAYC TATC ACGTTAT AAGG TCCTGG TAG GGCAT CAGTA GCACCT C A GATGC T\n",
"2G0 CCACT AAA TTAG AATC TATC ACGTTAT AWGG TTCTTA TAG TGCTT CAGTA GCACCT C A GATGC T\n",
"2H0 AMACT AAA TTAG AATC TATC ACGTTCT AAGR TCTTTG TAA GGCTK CAGCA GCACCY C T GATGC T\n",
"3I0 CCACG AAA AACA AGTC TATC AGGTTCT TAGG TCCTTG AAA GGCTT CTGTA GCACCT C A AAKAC T\n",
"3J0 CCACG AAA AACG AGTC TATC AGGKTCT TAGG TCCTTG AAA GGCTT CTGTA GCACCT C A AATAC T\n",
"3K0 CCACT AAA AAAG ARTS TWTC WGGTTCT TAGG WCCTTG AAA GGCTT CTGTA GCACCT C A GATAC T\n",
"3L0 CCACG AAA AAAG AATC CATC AGGTTCT TAGG TCCATG AMA GAMTT CTGTA GCACTT C A GATAC C\n"
]
}
],
"prompt_number": 53
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### UNLINKED_SNPs format"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"head -n 50 outfiles/c85m4p3.unlinked_snps | cut -c 1-85"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"12 1959\n",
"1A0 CTGCYAGTATGTGAGARKGACARGAGGTGACTCGCTSACAGAACTCTTTCGCGCGGCCTACGGCAGTAGACAAATTTCA\n",
"1B0 CTGCTAGTATGTGAGAGTGACAGGAGGTGACTCGCTGACTGAACTCTTCCGCGCGGCCTACGACAGTAGACAAATTTCA\n",
"1C0 CTGCTTGTATATGAGAGTGACAGGAGGTGAATCGCTGACAGAACTCTTTCTCGCGGCCTACGACAGTCGACAAATTTCA\n",
"1D0 CAGCTTGTATGTCAGTGTGASAGGAGGTGACTCGCTGACAGAACTCTATCGCGCGTCCTACGCTAGTAGACAAATTTCA\n",
"2E0 CAGCTTGTGTGTCAGTGTGACAGGAACTGACTCGCAGACAGAACTCTATCGCGCGGCCTAMGACAGTAGACGAATCTTM\n",
"2F0 CAGCTTGTGTGTCAGTGTGACAGGAACTGACTCGCAGACAGAACTCTATCGMGCGGCCTACGACAGTAGACGAATCTTA\n",
"2G0 CAGCTTGTGTGTCAGTGTAACAGCAACTGACTCGCTGACAGAACTCTATCGCGCGGCCTACRACATYAGACAACTYTTA\n",
"2H0 MAGCTTRTAKGYCTGTGTGACAGGAGCTGACTCGCTGACAGAACTCTATTGCGCGGCCWACGACAGTAGAAAAATTGCA\n",
"3I0 CAACTTGTATGTCAATGTGCCTGGRGGGGCCACGCTGRGAKCACTCAATCGCGGGGCCTTCGACAGTACAAAGACTTCA\n",
"3J0 CAGCTTGTATGTCAATGTGCCAGGAGGTGCCTCGCTGAGAKAWTTTAATCGCAGGGCCTTCGACAGTACWAAGACTTCA\n",
"3K0 CAGSTTGWATGTCAATGTGCCAGGAGGTCCCTCGCTGAGATAACTCAATCGCGGGGCCTTCGACAGTACAAAAATTTCA\n",
"3L0 CAGCTTGTATGTCAATGTGACAGGAGGTGCCTASSTGAGAGAACKCAATCGCGGRGATTTCGACGGTACAAAAATTTCA\n"
]
}
],
"prompt_number": 45
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## OTHER FORMATS \n",
"\n",
"You may also produce some more complicated formatting options that involve pooling individuals into groups or populations. This can be done for the \"treemix\" and \"migrate\" outputs, which are formatted for input into the programs _TreeMix_ and _migrate-n_, respectively. Grouping individuals into populations is done with the final lines of the params file as shown below, and similar to the assignment of individuals into clades for hierarchical clustering (see full tutorial). \n",
"\n",
"Each line designates a group, and has three arguments that are separated __by a single space__. The first is the group name, the second is the minimum number of individuals that must have data in that group for a locus to be included in the output, and the third is a list of the members of that group. Example below:\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"## append group designations to the params file\n",
"echo \"pop1 4 1A0,1B0,1C0,1D0 ##\" >> params.txt\n",
"echo \"pop2 4 2E0,2F0,2G0,2H0 ##\" >> params.txt\n",
"echo \"pop3 4 3I0,3J0,3K0,3L0 ##\" >> params.txt\n",
"\n",
"## view params file\n",
"cat params.txt"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"==== parameter inputs for pyRAD version 2.16 ============================ affected step ==\n",
"./ ## 1. Working directory (all)\n",
"./*.fastq.gz ## 2. Loc. of non-demultiplexed files (if not line 18) (s1)\n",
"./*.barcodes ## 3. Loc. of barcode file (if not line 18) (s1)\n",
"usearch7.0.1090_i86linux32 ## 4. command (or path) to call usearch v.7 (s3,s6)\n",
"muscle ## 5. command (or path) to call muscle (s3,s7)\n",
"TGCAG ## 6. restriction overhang (e.g., C|TGCAG -> TGCAG) (s1,s2)\n",
"8 ## 7. N processors... \n",
"6 ## 8. Mindepth: min coverage for a cluster (s4,s5)\n",
"4 ## 9. NQual: max # sites with qual < 20 (line 20) (s2)\n",
".85 ## 10. lowered clust thresh... \n",
"rad ## 11. Datatype: rad,gbs,ddrad,pairgbs,pairddrad,merge (all)\n",
"4 ## 12. MinCov: min samples in a final locus (s7)\n",
"3 ## 13. MaxSH: max inds with shared hetero site (s7)\n",
"c85m4p3 ## 14. outprefix... \n",
"==== optional params below this line =================================== affected step ==\n",
" ## 15.opt.: select subset (prefix* selector) (s2-s7)\n",
" ## 16.opt.: add-on (outgroup) taxa (list or prefix*) (s6,s7)\n",
" ## 17.opt.: exclude taxa (list or prefix*) (s7)\n",
" ## 18.opt.: Loc. of de-multiplexed data (s2)\n",
" ## 19.opt.: maxM: N mismatches in barcodes (def. 1) (s1)\n",
" ## 20.opt.: Phred Qscore offset (def. 33) (s2)\n",
" ## 21.opt.: Filter: 0=NQual 1=NQual+adapters. 2=1+strict (s2)\n",
" ## 22.opt.: a priori E,H (def. 0.001,0.01, if not estimated) (s5)\n",
" ## 23.opt.: maxN: Ns in a consensus seq (def. 5) (s5)\n",
"8 ## 24. maxH raised ... \n",
" ## 25.opt.: ploidy: max alleles in consens (def. 2) see doc (s5)\n",
" ## 26.opt.: maxSNPs: step 7. (def=100). Paired (def=100,100) (s7)\n",
" ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)\n",
" ## 28.opt.: random number seed (def. 112233) (s3,s6,s7)\n",
" ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7)\n",
"m,t ## 30. more output formats... \n",
" ## 31.opt.: call maj. consens if dpth < stat. limit (def. 0) (s5)\n",
" ## 32.opt.: merge/remove paired overlap (def 0), 1=check (s2)\n",
" ## 33.opt.: keep trimmed reads (def=0). Enter min length. (s2)\n",
" ## 34.opt.: max stack size (int), def= max(500,mean+2*SD) (s3)\n",
" ## 35.opt.: minDerep: exclude dereps with <= N copies, def=0 (s3)\n",
" ## 36.opt.: hierarch. cluster groups (def.=0, 1=yes) (s6)\n",
"==== list hierachical cluster groups below this line =====================================\n",
"pop1 4 1A0,1B0,1C0,1D0 ##\n",
"pop2 4 2E0,2F0,2G0,2H0 ##\n",
"pop3 4 3I0,3J0,3K0,3L0 ##\n"
]
}
],
"prompt_number": 66
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating population output files \n",
"Now if we run _pyRAD_ with the 'm' (migrate) or 't' (treemix) output options, it will create their output files. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"\n",
"## add m and t to output options\n",
"sed -i '/## 30./c\\m,t ## 30. more output formats... ' params.txt\n",
"\n",
"## assemble data set\n",
"./pyrad -p params.txt -s 7"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\tgroups for 't' or 'm' outputs: ['pop1', 'pop2', 'pop3']\n",
"\tingroup 1A0,1B0,1C0,1D0,2E0,2F0,2G0,2H0,3I0,3J0,3K0,3L0\n",
"\taddon \n",
"\texclude \n",
"\t\n",
"\tfinal stats written to:\n",
"\t /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_RAD/stats/c85m4p3.stats\n",
"\toutput files written to:\n",
"\t /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_RAD/outfiles/ directory\n",
"\n"
]
},
{
"output_type": "stream",
"stream": "stderr",
"text": [
"\n",
"\n",
" ------------------------------------------------------------\n",
" pyRAD : RADseq for phylogenetics & introgression analyses\n",
" ------------------------------------------------------------\n",
"\n",
"\n",
"\tCluster input file: using \n",
"\t/home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_RAD/clust.85/cat.clust_.gz\n",
"\n",
"........"
]
}
],
"prompt_number": 87
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TREEMIX format"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"less outfiles/c85m4p3.treemix.gz | head -n 30"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"pop3 pop2 pop1\n",
"8,0 7,1 8,0\n",
"8,0 8,0 2,6\n",
"6,2 8,0 8,0\n",
"7,1 8,0 8,0\n",
"8,0 8,0 7,1\n",
"8,0 8,0 4,4\n",
"8,0 7,1 8,0\n",
"7,1 8,0 8,0\n",
"8,0 2,6 8,0\n",
"8,0 7,1 8,0\n",
"8,0 8,0 6,2\n",
"8,0 7,1 8,0\n",
"8,0 8,0 2,6\n",
"8,0 6,2 8,0\n",
"0,8 8,0 8,0\n",
"8,0 8,0 2,6\n",
"8,0 8,0 7,1\n",
"8,0 8,0 7,1\n",
"8,0 6,2 8,0\n",
"2,6 8,0 8,0\n",
"8,0 8,0 7,1\n",
"6,2 8,0 8,0\n",
"8,0 8,0 7,1\n",
"8,0 6,2 8,0\n",
"7,1 8,0 8,0\n",
"8,0 2,6 8,0\n",
"8,0 0,8 8,0\n",
"6,2 8,0 8,0\n",
"6,2 8,0 8,0\n"
]
}
],
"prompt_number": 88
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## MIGRATE-n FORMAT"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"head -n 40 outfiles/c85m4p3.migrate | cut -c 1-85"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"3 2000 ( npops nloci for data set c85m4p3.loci )\n",
"89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 90 89 89 89 89 89 89 89 89 89 89 90 8\n",
"4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4\n",
"ind_0 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTC\n",
"ind_1 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTC\n",
"ind_2 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTC\n",
"ind_3 AGACGGCCTCGTTTCTTTACGAAACATAGGGACTCACTTCAATGTTGGCGAGTCTCATCGCGAGGCATCCTCCTC\n",
"ind_0 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCA\n",
"ind_1 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCA\n",
"ind_2 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCA\n",
"ind_3 CGGAAAACGACCCTACCTGAGGGGAACAGACTGGGGACAATGTTTGATTTGTAATTGAGCCTAGTCAATGTTCCA\n",
"ind_0 TCCGATAGCCAGGACTCGAGGTCGACTACCGGCGTGATGTCGGGTTCACCCCCCGAGCATCGGTGCGAAGGATAT\n",
"ind_1 TCCGATAGCCAGGACTCGAGGTCGACTACCGGCGTGATGTCGGGTTCACCCCCCGGGCATCGGTGCGAAGGATAT\n",
"ind_2 TCCGATAGCCAGGACTCGAGGTCGACTACCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATAT\n",
"ind_3 TCCGATAGCCAGGACTCGAGGTCGACTACCGGCGTGATGTCGGGTTCAACCCCCGGGCATCGGTGCGAAGGATAT\n",
"ind_0 GAAGAGCTGGGTGTATCTTCCGAAACATCCCCCCCTGCGTGTGTTCCTCGCACAGCTACAATTCTACTTGTAGTT\n",
"ind_1 GAAGAGCTGGGTGTATCTTCCGAAACATCCCCCCCTGCGTGTGTTCCTCGCACAGCTACAATTCTACTTGTAGTT\n",
"ind_2 GAAGAGCTGRGTGTATCTTCCGAAACATCCCCCCCTGCGTGTGTTCSTCGCACAGCTACAATTCTACTTGTAGTT\n",
"ind_3 GAAGAGCTGAGTGTATCTTCCGAAACATCCCCCCCTGCGTGTGTTCCTCGCACAGCTACAATTCTACTTGTAGTT\n",
"ind_0 TTTGTGTTAACCGCCCTTTGCTTTGATATTGCCCGCCAAGCGTCTATTGGCAATTCAGAAGGCTATCAAACGTCT\n",
"ind_1 TTTGTGTTAACCGCCCTTTGCTTTGATATTGCCCGCCAAGCGTCTATTGGCAATTCAGAAGGCTATCAAACGTCT\n",
"ind_2 TTTGTGTTAACCGCCCTTTGCTTTGATWTTGCCCGCCAAGCGTCTATTGGCAATTCAGAAGGCTATCAAACGTCT\n",
"ind_3 TTTGTGTTAACCGCCCCTTGCTTTGATATTGCCCGCCAAGCGTCTATTGGCAATTCAGAAGGCTATCAAACGTCT\n",
"ind_0 TCGAATCAAACCGTACTCGCAAGCCTTGTGTTCGCACCCACCTCGATACGATCGTTGAGCTACAGCGTAGTTTTC\n",
"ind_1 TCGAATCAAACCGTACTCGCAAGCCKTGTGTTCGCACCCACCTCGATACGATCGTTGAGCTACAGCGTAGTTTTC\n",
"ind_2 TCGAATCAAACCGTWCTCGCAAGCCTTGTGTTCGCACCCACCTCGATACGATCGTTGAGCTACAGCGTAGTTTTC\n",
"ind_3 TCGAATCAAACCGTACTCGCAAGCCTTGTGTTCGCACCCACCTCGATACGATCGTTGAGCTACAGCGTAGTTTTC\n",
"ind_0 TGTATTTTGGGTTTCTCACTGCTTCTTTGAAAACCGCGCCCTCCATGCTCCTGAAAGGCGCACAAGGCCACGCGG\n",
"ind_1 TGTATTTTGGGTTTCTCACTGCTTCTTTGAAAACCGCGCCCTCCATGCTCCTGAAAGGCGCACAAGGCCACGCGG\n",
"ind_2 TGTATTTTGGGTTTCTCACTGCTTCTTTGAAAACCGCGCCCTCCATGCTCCTGAAAGGCGCACAAGGCCACGCGG\n",
"ind_3 TGTATTTTGGGTTTCTCACTGCTTCTTTGAAAACCGCGCCCTCCATGCTCCTGAAAGGCGCACAAGGCCACGCGG\n",
"ind_0 GTTTCGAGCGAATCTAGGCTTGGCCGCCCCAAGTCACAGCGAGGATGATCCCATTTAATGCTATGTCGGTAGAAC\n",
"ind_1 GTTTCGAGCGAATCTAGGCTTGGCCGCCCCAAGTCACAGCGAGGATGATCCCATTTAATGCTATGTCGGTAGAAC\n",
"ind_2 GWTTCGAGCGAATCTAGGCTTGGCCGCCCCAAGTCACAGCGAGGATGATCCCATTTAATGCTATGTCGGTAGAAC\n",
"ind_3 GTTTCGAGCGAATCTAGGCTTGGCCGCCCCAAGTCACAGCGAGGATGATCCCATTTAATGCTATGACGGTAGAAC\n",
"ind_0 CCTTGTGTACGCTCATCACCCTAAATAGCGCTCCCGTTACCCGGCTACCCAGTGGTTCTTTCCCTATCGAACAAT\n",
"ind_1 CCTTGTGTACGCTCATCACCCTAAATAGCGCTCCCGTTACCCGGCTACCCAGTGGTTCTTTCCCTATCGAACAAT\n",
"ind_2 CCTTGTGTACGCTCATCACCCTAAATAGCGCTCCCGTTACCCGGCTACCCAGTGGTTCTTTCCCTATCGAACAAT\n",
"ind_3 CCTTGTGTACGCTCATCACCCTAAATAGCGCTCCCGTTACCCGGCTACCCAGTGGTTCTTTCCCTATCGAACMAT\n",
"ind_0 TAGCTAGAAATTAAGAAGGCTGTAACCCGGCGCGCGCAATGACTATCGCCGATTACAAGGGCAGGTGGTGACACT\n"
]
}
],
"prompt_number": 89
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 58
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment