dereneaton/tutorial_ddRAD_3.0.4.ipynb

## tutorial_ddRAD_3.0.4.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              tutorial_ddRAD_3.0.4.ipynb
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## tutorial_pairddRAD.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              tutorial_pairddRAD.ipynb
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## tutorial_pairddRAD_2.16.ipynb
{
 "metadata": {
  "name": "",
  "signature": "sha256:62fdcda3f6fd04b57877c72594a4f561b15cf4a649107a3e01b9e0642e869ea4"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Basic Tutorial: paired-end ddRAD analysis\n",
      "### _pyRAD_ v.2.16"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "-----------------------------  \n",
      "\n",
      "###__Topics__:  \n",
      "\n",
      " + Setup of params file for paired-end ddRAD\n",
      " + Visualize ideal results on simulated data\n",
      "\n",
      "-----------------------------  "
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "What is ddRAD?"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The double digest library preparation method was developed and described by [Peterson et al. 2012](http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0037135). Here I will be talking about __ _paired-end ddRAD_ __ data and describe my recommendations for how to setup a params file to analyze them in _pyRAD_. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A ddRAD library is prepared by cutting genomic DNA with two different restriction enzymes and selecting the intervening fragments that are within a certain size window. These will contain overhangs from the respective cut sites on either side. One side will have a barcode+illumina adapter ligated, and the other end will have only the reverse Illumina adapter ligated. The first reads may come in one or multiple files with \"\\_R1\\_\" in the name, and the second reads are in a separate file/s with \"\\_R2\\_\". Second read files will contain the corresponding pair from the first read files in the exact same order.\n",
      "\n",
      "-------------------"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "![alt](https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/diag_ddradpair.svg)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---------------------  \n",
      "\n",
      "Your data will likely come to you non-demultiplexed (meaning not sorted by which individual the reads belong to). You can demultiplex the reads according to their barcodes using _pyRAD_ or separate software. For simplicity, I suggest using _pyRAD_, but for data with combinatorial barcodes (different barcodes on both ends of paired reads) you will need to use a more advanced de-multiplexing software.  \n",
      "\n",
      "A number of programs are available to check for overlap of paired end reads, and you can run your data through these programs before being input to _pyRAD_. However, as of _pyRAD_ v.2.0 I now include a robust built-in overlap detection method utilizing _USEARCH_. This can be used to separate merged from non-merged reads to be analyzed separately in _pyRAD_. This method is illustrated in the Advanced pairddRAD tutorial.  \n"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "The Example data"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "For this tutorial I simulated paired-end ddRAD loci on a 12 taxon species tree, shown below. You can download the data using the script below and assemble these data by following along with all of the instructions below: "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "![](https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/fig_tree4sims.svg)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "## download the data\n",
      "wget -q http://www.dereneaton.com/downloads/simpairddrads.zip simpairddrads.zip\n",
      "## unzip the data\n",
      "unzip simpairddrads.zip"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Where the example data come from (skip this section if you want)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "You can simply download the data, but it might also be worth describing how the data were generated. I simulated ddRAD-like data using the egglib coalescent simulator with the following options.  \n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "indels  = 0.005   ## low rate of indels (prob. mutation is indel)\n",
      "dropout = 0       ## if 1, mutations to restriction site can cause locus dropout.\n",
      "nloci   = 1000    ## Total Nloci simulated (less if locus dropout or merged reads) \n",
      "ninds   = 1       ## sampled individuals per tip taxon\n",
      "shortfrag = 300   ## smallest size of digested fragments\n",
      "longfrag  = 600   ## largest size of digested fragments"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 45
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "## Because the data are simulated at 20X coverage, \n",
      "## and the reads are sequenced from both ends (paired-end)\n",
      "## the total number of reads is:  \n",
      "reads = 12*ninds*nloci*20*2\n",
      "print \"%s sequenced reads\" % reads"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "480000 sequenced reads\n"
       ]
      }
     ],
     "prompt_number": 46
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Here I execute the simulation script ([link](http://www.dereneaton.com/software/simrrls/)) to simulate the data, and write it to a file called simpairddRADs. If you simply downloaded the data you do not need to do this step, I show it only for those interested. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "## download the simulation program\n",
      "wget -q http://www.dereneaton.com/downloads/simRRLs.py\n",
      "\n",
      "## simulate data with param settings described above\n",
      "## (requires the Python package `Egglib`)\n",
      "python simRRLs.py 0.005 0 1000 1 300,600 pairddrad simpairddrads"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\tsimulating pairddrad data\n",
        "\tsimulating 1000 loci at 20X coverage across 12 tip taxa with 1 samples per taxon\n",
        "\tindels arise at frequency of 0.005000 per mutation\n",
        "\tmutations in restriction site = False\n",
        "\tsequencing error rate = 0.0005\n",
        "\ttheta=4Nu= 0.0014\n",
        "\tcreating new barcode map\n",
        ".\n"
       ]
      }
     ],
     "prompt_number": 54
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We simulated 1000 loci per sample. Because we simulated no sequence overlap (min frag size = 300), and no mutations to restriction sites we should recover all 1000 loci shared across our entire data set.  \n",
      "\n",
      "------------------------  \n"
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Assembling the data set with _pyRAD_"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We first create an empty params.txt input file for the _pyRAD_ analysis.  \n",
      "The following command will create a template which we will fill in with all relevant parameter settings for the analysis."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "pyrad -n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "\tnew params.txt file created\n"
       ]
      }
     ],
     "prompt_number": 55
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "-------------  \n",
      "\n",
      "Take a look at the default options. Each line designates a parameter, and contains a \"##\" symbol after which comments can be added, and which includes a description of the parameter. The affected step for each parameter is shown in parentheses. The first 14 parameters are required. Numbers 15-36 are optional."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "cat params.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "==== parameter inputs for pyRAD version 2.16.1  ============================  affected step ==\n",
        "./                          ## 1. Working directory                                 (all)\n",
        "./*.fastq.gz                ## 2. Loc. of non-demultiplexed files (if not line 18)  (s1)\n",
        "./*.barcodes                ## 3. Loc. of barcode file (if not line 18)             (s1)\n",
        "usearch7.0.1090_i86linux32  ## 4. command (or path) to call usearch v.7             (s3,s6)\n",
        "muscle                      ## 5. command (or path) to call muscle                  (s3,s7)\n",
        "TGCAG                       ## 6. restriction overhang (e.g., C|TGCAG -> TGCAG)     (s1,s2)\n",
        "2                           ## 7. N processors to use in parallel                   (all)\n",
        "6                           ## 8. Mindepth: min coverage for a cluster              (s4,s5)\n",
        "4                           ## 9. NQual: max # sites with qual < 20 (line 20)       (s2)\n",
        ".88                         ## 10. Wclust: clustering threshold as a decimal        (s3,s6)\n",
        "rad                         ## 11. Datatype: rad,gbs,ddrad,pairgbs,pairddrad,merge  (all)\n",
        "4                           ## 12. MinCov: min samples in a final locus             (s7)\n",
        "3                           ## 13. MaxSH: max inds with shared hetero site          (s7)\n",
        "c88d6m4p3                   ## 14. prefix name for final output (no spaces)         (s7)\n",
        "==== optional params below this line ===================================  affected step ==\n",
        "                       ## 15.opt.: select subset (prefix* selector)                 (s2-s7)\n",
        "                       ## 16.opt.: add-on (outgroup) taxa (list or prefix*)         (s6,s7)\n",
        "                       ## 17.opt.: exclude taxa (list or prefix*)                   (s7)\n",
        "                       ## 18.opt.: Loc. of de-multiplexed data                      (s2)\n",
        "                       ## 19.opt.: maxM: N mismatches in barcodes (def. 1)          (s1)\n",
        "                       ## 20.opt.: Phred Qscore offset (def. 33)                    (s2)\n",
        "                       ## 21.opt.: Filter: 0=NQual 1=NQual+adapters. 2=1+strict     (s2)\n",
        "                       ## 22.opt.: a priori E,H (def. 0.001,0.01, if not estimated) (s5)\n",
        "                       ## 23.opt.: maxN: Ns in a consensus seq (def. 5)             (s5)\n",
        "                       ## 24.opt.: maxH: hetero. sites in consensus seq (def. 5)    (s5)\n",
        "                       ## 25.opt.: ploidy: max alleles in consens (def. 2) see doc  (s5)\n",
        "                       ## 26.opt.: maxSNPs: step 7. (def=100). Paired (def=100,100) (s7)\n",
        "                       ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)\n",
        "                       ## 28.opt.: random number seed (def. 112233)              (s3,s6,s7)\n",
        "                       ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7)\n",
        "                       ## 30.opt.: add output formats: a,n,s,u (see documentation)  (s7)\n",
        "                       ## 31.opt.: call maj. consens if dpth < stat. limit (def. 0) (s5)\n",
        "                       ## 32.opt.: merge/remove paired overlap (def 0), 1=check     (s2)\n",
        "                       ## 33.opt.: keep trimmed reads (def=0). Enter min length.    (s2)\n",
        "                       ## 34.opt.: max stack size (int), def= max(500,mean+2*SD)    (s3)\n",
        "                       ## 35.opt.: minDerep: exclude dereps with <= N copies, def=0 (s3)\n",
        "                       ## 36.opt.: hierarch. cluster groups (def.=0, 1=yes)         (s6)\n",
        "==== list hierachical cluster groups below this line =====================================\n"
       ]
      }
     ],
     "prompt_number": 56
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Edit the params file"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "\n",
      "I will use the script below to substitute new values, but you should simply __use any text editor__ to make changes. For this analysis I made the following changes from the defaults:  \n",
      "\n",
      "--------------------  \n",
      "\n",
      "      6. set the two restriction enzymes used to generate the ddRAD data\n",
      "     10. lowered clustering threshold to .85\n",
      "     11. set datatype to pairddrad\n",
      "     14. changed the output name prefix\n",
      "     19. mismatches for demulitiplexing set to 0, exact match.\n",
      "     24. Raised maxH. Lower is better for filtering paralogs.\n",
      "     30. added additional output formats (e.g., nexus,SNPs,STRUCTURE)\n",
      "\n",
      "--------------------  \n",
      "    "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "sed -i '/## 6. /c\\TGCAG,AATT                 ## 6. cutsites... ' ./params.txt\n",
      "sed -i '/## 10. /c\\.85                        ## 10. lowered clust thresh' ./params.txt\n",
      "sed -i '/## 11. /c\\pairddrad                  ## 11. datatype... ' ./params.txt\n",
      "sed -i '/## 14. /c\\c85d6m4p3                  ## 14. prefix name ' ./params.txt\n",
      "sed -i '/## 19./c\\0                     ## 19. errbarcode... ' ./params.txt\n",
      "sed -i '/## 24./c\\10                    ## 24. maxH... ' ./params.txt\n",
      "sed -i '/## 30./c\\a,s,u,n,k             ## 30. outformats... ' ./params.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 57
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "-----------------  \n",
      "\n",
      "Let's take a look at the edited params.txt file\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "cat params.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "==== parameter inputs for pyRAD version 2.16.1  ============================  affected step ==\n",
        "./                          ## 1. Working directory                                 (all)\n",
        "./*.fastq.gz                ## 2. Loc. of non-demultiplexed files (if not line 18)  (s1)\n",
        "./*.barcodes                ## 3. Loc. of barcode file (if not line 18)             (s1)\n",
        "usearch7.0.1090_i86linux32  ## 4. command (or path) to call usearch v.7             (s3,s6)\n",
        "muscle                      ## 5. command (or path) to call muscle                  (s3,s7)\n",
        "TGCAG,AATT                 ## 6. cutsites... \n",
        "2                           ## 7. N processors to use in parallel                   (all)\n",
        "6                           ## 8. Mindepth: min coverage for a cluster              (s4,s5)\n",
        "4                           ## 9. NQual: max # sites with qual < 20 (line 20)       (s2)\n",
        ".85                        ## 10. lowered clust thresh\n",
        "pairddrad                  ## 11. datatype... \n",
        "4                           ## 12. MinCov: min samples in a final locus             (s7)\n",
        "3                           ## 13. MaxSH: max inds with shared hetero site          (s7)\n",
        "c85d6m4p3                  ## 14. prefix name \n",
        "==== optional params below this line ===================================  affected step ==\n",
        "                       ## 15.opt.: select subset (prefix* selector)                 (s2-s7)\n",
        "                       ## 16.opt.: add-on (outgroup) taxa (list or prefix*)         (s6,s7)\n",
        "                       ## 17.opt.: exclude taxa (list or prefix*)                   (s7)\n",
        "                       ## 18.opt.: Loc. of de-multiplexed data                      (s2)\n",
        "0                     ## 19. errbarcode... \n",
        "                       ## 20.opt.: Phred Qscore offset (def. 33)                    (s2)\n",
        "                       ## 21.opt.: Filter: 0=NQual 1=NQual+adapters. 2=1+strict     (s2)\n",
        "                       ## 22.opt.: a priori E,H (def. 0.001,0.01, if not estimated) (s5)\n",
        "                       ## 23.opt.: maxN: Ns in a consensus seq (def. 5)             (s5)\n",
        "10                    ## 24. maxH... \n",
        "                       ## 25.opt.: ploidy: max alleles in consens (def. 2) see doc  (s5)\n",
        "                       ## 26.opt.: maxSNPs: step 7. (def=100). Paired (def=100,100) (s7)\n",
        "                       ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)\n",
        "                       ## 28.opt.: random number seed (def. 112233)              (s3,s6,s7)\n",
        "                       ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7)\n",
        "a,s,u,n,k             ## 30. outformats... \n",
        "                       ## 31.opt.: call maj. consens if dpth < stat. limit (def. 0) (s5)\n",
        "                       ## 32.opt.: merge/remove paired overlap (def 0), 1=check     (s2)\n",
        "                       ## 33.opt.: keep trimmed reads (def=0). Enter min length.    (s2)\n",
        "                       ## 34.opt.: max stack size (int), def= max(500,mean+2*SD)    (s3)\n",
        "                       ## 35.opt.: minDerep: exclude dereps with <= N copies, def=0 (s3)\n",
        "                       ## 36.opt.: hierarch. cluster groups (def.=0, 1=yes)         (s6)\n",
        "==== list hierachical cluster groups below this line =====================================\n"
       ]
      }
     ],
     "prompt_number": 58
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Step 1: De-multiplexing the data"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "----------------  \n",
      "  \n",
      "Four examples of acceptable input file name formats for paired-end data: \n",
      "\n",
      "       1. xxxx_R1_001.fastq      xxxx_R2_001.fastq\n",
      "       2. xxxx_R1_001.fastq.gz   xxxx_R2_001.fastq.gz\n",
      "       3. xxxx_R1_100.fq.gz      xxxx_R2_100.fq.gz\n",
      "       4. xxxx_R1_.fq            xxxx_R2_.fq\n",
      "\n",
      "The file ending can be .fastq, .fq, or .gz.  \n",
      "There should be a unique name or number shared by each pair and the characters \\_R1\\_ and \\_R2\\_.  \n",
      "For every file name with \\_R1\\_ there should be a corresponding \\_R2\\_ file. \n",
      "\n",
      "If your data are _already_ demultiplexed skip step 1 and see step 2 below.  \n",
      "\n",
      "-----------------  \n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "pyrad -p params.txt -s 1"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "\n",
        "\n",
        "     ------------------------------------------------------------\n",
        "      pyRAD : RADseq for phylogenetics & introgression analyses\n",
        "     ------------------------------------------------------------\n",
        "\n",
        "\n",
        "\tstep 1: sorting reads by barcode\n",
        " ."
       ]
      }
     ],
     "prompt_number": 59
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we can look at the stats output for this below:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "cat stats/s1.sorting.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "file    \tNreads\tcut_found\tbar_matched\n",
        "simpairddrads_.fastq.gz\t240000\t240000\t240000\n",
        "\n",
        "\n",
        "sample\ttrue_bar\tobs_bars\tN_obs\n",
        "3K0    \tAATGTG    \tAATGTG\t20000   \n",
        "3L0    \tAGGGAT    \tAGGGAT\t20000   \n",
        "2G0    \tAGTTGA    \tAGTTGA\t20000   \n",
        "3I0    \tATGGGA    \tATGGGA\t20000   \n",
        "2H0    \tATGTTA    \tATGTTA\t20000   \n",
        "1D0    \tATTTTT    \tATTTTT\t20000   \n",
        "1A0    \tCATCAT    \tCATCAT\t20000   \n",
        "1C0    \tGAATGT    \tGAATGT\t20000   \n",
        "3J0    \tGTGTTT    \tGTGTTT\t20000   \n",
        "2E0    \tTAGTGA    \tTAGTGA\t20000   \n",
        "2F0    \tTGGGAG    \tTGGGAG\t20000   \n",
        "1B0    \tTTAGAA    \tTTAGAA\t20000   \n",
        "\n",
        "nomatch  \t_            \t0\n"
       ]
      }
     ],
     "prompt_number": 60
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The de-multiplexed reads are written to a new file for each individual in a new directory created within your working directory called fastq/"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash \n",
      "ls -l fastq/"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "1A0_R1.fq.gz\n",
        "1A0_R2.fq.gz\n",
        "1B0_R1.fq.gz\n",
        "1B0_R2.fq.gz\n",
        "1C0_R1.fq.gz\n",
        "1C0_R2.fq.gz\n",
        "1D0_R1.fq.gz\n",
        "1D0_R2.fq.gz\n",
        "2E0_R1.fq.gz\n",
        "2E0_R2.fq.gz\n",
        "2F0_R1.fq.gz\n",
        "2F0_R2.fq.gz\n",
        "2G0_R1.fq.gz\n",
        "2G0_R2.fq.gz\n",
        "2H0_R1.fq.gz\n",
        "2H0_R2.fq.gz\n",
        "3I0_R1.fq.gz\n",
        "3I0_R2.fq.gz\n",
        "3J0_R1.fq.gz\n",
        "3J0_R2.fq.gz\n",
        "3K0_R1.fq.gz\n",
        "3K0_R2.fq.gz\n",
        "3L0_R1.fq.gz\n",
        "3L0_R2.fq.gz\n"
       ]
      }
     ],
     "prompt_number": 61
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "An individual file will look like below:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "## FIRST READS file\n",
      "less fastq/1A0_R1.fq.gz | head -n 12 "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "@lane1_fakedata0_R1_0 1:N:0:\n",
        "TGCAGTTCTAATTTTCCGCGACCATTCACAATTCGCTAGGGCAGAGACTTAGGGACTGGTCAGCGTGTACTTCGTGGGCCATGGCCCAACGCGA\n",
        "+\n",
        "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\n",
        "@lane1_fakedata0_R1_1 1:N:0:\n",
        "TGCAGTTCTAATTTTCCGCGACCATTCACAATTCGCTAGGGCAGAGACTTAGGGACTGGTCAGCGTGTACTTCGTGGGCCATGGCCCAACGCGA\n",
        "+\n",
        "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\n",
        "@lane1_fakedata0_R1_2 1:N:0:\n",
        "TGCAGTTCTAATTTTCCGCGACCATTCACAATTCGCTAGGGCAGAGACTTAGGGACTGGTCAGCGTGTACTTCGTGGGCCATGGCCCAACGCGA\n",
        "+\n",
        "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\n"
       ]
      }
     ],
     "prompt_number": 75
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "## SECOND READS file\n",
      "less fastq/1A0_R2.fq.gz | head -n 12 "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "@lane1_fakedata0_R2_0 1:N:0:\n",
        "AATTACCGTGCCGTTAACACAGGGGACGATATTTGGATACTGGTGAGTATCTCGACCGTGCATGTGGGTGGACTGGGTGATAAAAGAGCGAGCAGTACAT\n",
        "+\n",
        "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\n",
        "@lane1_fakedata0_R2_1 1:N:0:\n",
        "AATTACCGTGCCGTTAACACAGGGGACGATATTTGGATACTGGTGAGTATCTCGACCGTGCATGTGGGTGGACTGGGTGATAAAAGAGCGAGCAGTACAT\n",
        "+\n",
        "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\n",
        "@lane1_fakedata0_R2_2 1:N:0:\n",
        "AATTACCGTGCCGTTAACACAGGGGACGATATTTGGATACTGGTGAGTATCTCGACCGTGCATGTGGGTGGACTGGGTGATAAAAGAGCGAGCAGTACAT\n",
        "+\n",
        "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\n"
       ]
      }
     ],
     "prompt_number": 76
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "-------------------------\n",
      "\n",
      "The reads were sorted into a separate file for the first (R1) and second (R2) reads for each individual.  \n",
      "\n",
      "__If your data were previously de-multiplexed__ you need the following things before step 2:  \n",
      "\n",
      "+  your sorted file names should be formatted similar to above, but with sample names substituted for 1A0, 1A1, etc.   \n",
      "+  the files can be zipped (.gz) or not (.fq or .fastq).  \n",
      "+  the barcode should be removed (not on left side of reads)  \n",
      "+  the restriction site should _not_ be removed, but if it is, enter a '@' symbol before the location of your sorted data.\n",
      "\n",
      "+  __Enter on line 18 of the params file the location of your sorted data.__\n",
      "\n",
      "--------------------  \n"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Step 2: Quality filtering"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next we apply the quality filtering.  \n",
      "\n",
      "+ We left the filter setting (line 21) at its default of 0, meaning it will apply only a filter based on quality scores. Low quality sites are converted to Ns, and any locus with more than X number of Ns is discarded, where X is the number set on line 9 of the params file.   \n",
      "\n",
      "\n",
      "+ We could also check for overlap of paired-end reads and filter for the presence of Illumina adapters in this step, however we assume here that the data were properly size selected and do not contain any of these problems. To see those filters applied see the Advanced tutorials. \n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "pyrad -p params.txt -s 2"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "\n",
        "\n",
        "     ------------------------------------------------------------\n",
        "      pyRAD : RADseq for phylogenetics & introgression analyses\n",
        "     ------------------------------------------------------------\n",
        "\n",
        "\n",
        "\tstep 2: quality filtering \n",
        "\t............"
       ]
      }
     ],
     "prompt_number": 81
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Statistics for the number of reads that passed filtering can be found in the stats/ directory. \n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash \n",
      "cat stats/s2.rawedit.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "sample\tNreads\texclude\ttrimmed\tpassed\n",
        "3K0\t20000\t0\t0\t20000\n",
        "1C0\t20000\t0\t0\t20000\n",
        "1D0\t20000\t0\t0\t20000\n",
        "3L0\t20000\t0\t0\t20000\n",
        "3J0\t20000\t0\t0\t20000\n",
        "1A0\t20000\t0\t0\t20000\n",
        "2G0\t20000\t0\t0\t20000\n",
        "1B0\t20000\t0\t0\t20000\n",
        "2E0\t20000\t0\t0\t20000\n",
        "3I0\t20000\t0\t0\t20000\n",
        "2H0\t20000\t0\t0\t20000\n",
        "2F0\t20000\t0\t0\t20000\n",
        "\n",
        "    Nreads = total number of reads for a sample\n",
        "    exclude = reads that were excluded\n",
        "    trimmed = reads that had adapter trimmed but were kept\n",
        "    passed = total kept reads\n",
        "\n",
        "    note: you can set minimum length of kept trimmed reads on line 35 the params file \n",
        "    kept trimmed (fragment) reads written to /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_pairddRAD/mergedreads/\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 82
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The filtered data files are converted to fasta format and written to a directory called edits/.  I show this below:\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "ls edits/"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "1A0.edit\n",
        "1B0.edit\n",
        "1C0.edit\n",
        "1D0.edit\n",
        "2E0.edit\n",
        "2F0.edit\n",
        "2G0.edit\n",
        "2H0.edit\n",
        "3I0.edit\n",
        "3J0.edit\n",
        "3K0.edit\n",
        "3L0.edit\n"
       ]
      }
     ],
     "prompt_number": 83
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "------------  \n",
      "\n",
      "Here you can see that the first and second reads have been concatenated into a single read separated by 'xxxx' in these files, and the read quality scores have now been discarded. \n",
      "\n",
      "----------  \n",
      "\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "head -n 8 edits/1A0.edit"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ">1A0_0_pair\n",
        "TGCAGTTCTAATTTTCCGCGACCATTCACAATTCGCTAGGGCAGAGACTTAGGGACTGGTCAGCGTGTACTTCGTGGGCCATGGCCCAACGCGxxxxATGTACTGCTCGCTCTTTTATCACCCAGTCCACCCACATGCACGGTCGAGATACTCACCAGTATCCAAATATCGTCCCCTGTGTTAACGGCACGGTAATT\n",
        ">1A0_1_pair\n",
        "TGCAGTTCTAATTTTCCGCGACCATTCACAATTCGCTAGGGCAGAGACTTAGGGACTGGTCAGCGTGTACTTCGTGGGCCATGGCCCAACGCGxxxxATGTACTGCTCGCTCTTTTATCACCCAGTCCACCCACATGCACGGTCGAGATACTCACCAGTATCCAAATATCGTCCCCTGTGTTAACGGCACGGTAATT\n",
        ">1A0_2_pair\n",
        "TGCAGTTCTAATTTTCCGCGACCATTCACAATTCGCTAGGGCAGAGACTTAGGGACTGGTCAGCGTGTACTTCGTGGGCCATGGCCCAACGCGxxxxATGTACTGCTCGCTCTTTTATCACCCAGTCCACCCACATGCACGGTCGAGATACTCACCAGTATCCAAATATCGTCCCCTGTGTTAACGGCACGGTAATT\n",
        ">1A0_3_pair\n",
        "TGCAGTTCTAATTTTCCGCGACCATTCACAATTCGCTAGGGCAGAGACTTAGGGACTGGTCAGCGTGTACTTCGTGGGCCATGGCCCAACGCGxxxxATGTACTGCTCGCTCTTTTATCACCCAGTCCACCCACATGCACGGTCGAGATACTCACCAGTATCCAAATATCGTCCCCTGTGTTAACGGCACGGTAATT\n"
       ]
      }
     ],
     "prompt_number": 84
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "--------------   \n",
      "\n",
      "## Step 3: Within-sample clustering"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "From here you have two options for how to proceed. The first is to treat these concatenated reads like single end ddRAD sequences and cluster them as concatenated reads. To do this change the datatype option in the params folder from \"pairddrad\" to \"ddrad\", and proceed with steps 3-7. It will detect and remove the 'xxxx' seperator and cluster the long reads. \n",
      "\n",
      "The more complex option, and the one I prefer, is to cluster the reads separately, which I call the \"split clustering\" method. This will do the following:  \n",
      "\n",
      "+ de-replicate concatenated reads \n",
      "+ split them and cluster first reads only \n",
      "+ match second reads back to firsts\n",
      "+ align second read clusters separately from first-read clusters\n",
      "+ discard loci for which second reads align poorly (incomplete digestion or paralogs)\n",
      "\n",
      "This is illustrated graphically below:"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "![](https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/pairclustermethod.svg)"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "The screen captures below show split clustering performed on an empirical paired ddRAD data set: \n"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "The good aligned clusters are written to a .clustS.gz file"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "![](https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/goodpairs.png)"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "And the poor aligned pairs which are excluded from further analysis are written to a separate file ending with .badpairs.gz."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "![](https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/badpairs.png)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "--------------  \n",
      "\n",
      "The benefit of this method over clustering concatenated first+second reads is that second reads can be retained that are either more divergent than the clustering threshold, or which contain many low quality base calls (Ns), thus potentially yielding more SNPs. Also, for very large data sets it is _much_ faster to cluster shorter length reads. The drawbacks are that it discards sequences for which the first and second reads do not appear to be from the same DNA fragment, whereas the concatenated clustering method _might_ cluster them separately. And also, even a single incompletely digested second read can cause an otherwise good cluster to be discarded."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The split clustering method filters the alignment of second reads based on the number of indels in the alignment. This number can be set on line 27 of the params file. The default values are 3,6,99,99. Meaning for within-sample clusters we allow 3 indels in first read clusters and 6 in second read clusters. For across-sample clusters we allow 99 in first read clusters and 99 in second read clusters. If only two numbers are set (e.g., 3,10) this is equivalent to 3,3,10,10. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "## we are using the split-clustering method since the \n",
      "## datatype is still set to pairddrad\n",
      "pyrad -p params.txt -s 3 "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "\n",
        "\n",
        "     ------------------------------------------------------------\n",
        "      pyRAD : RADseq for phylogenetics & introgression analyses\n",
        "     ------------------------------------------------------------\n",
        "\n",
        "\n",
        "\tde-replicating files for clustering...\n",
        "\n",
        "\tstep 3: within-sample clustering of 12 samples at \n",
        "\t        '.85' similarity using up to 2 processors\n",
        "\t2E0.edit finished, 1000loci\n",
        "\t1A0.edit finished, 1000loci\n",
        "\t1C0.edit finished, 1000loci\n",
        "\t3J0.edit finished, 1000loci\n",
        "\t3I0.edit finished, 1000loci\n",
        "\t1B0.edit finished, 1000loci\n",
        "\t3L0.edit finished, 1000loci\n",
        "\t2H0.edit finished, 1000loci\n",
        "\t2F0.edit finished, 1000loci\n",
        "\t3K0.edit finished, 1000loci\n",
        "\t1D0.edit finished, 1000loci\n",
        "\t2G0.edit finished, 1000loci\n"
       ]
      }
     ],
     "prompt_number": 85
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "head -n 23 stats/s3.clusters.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "taxa\ttotal\tdpt.me\tdpt.sd\td>5.tot\td>5.me\td>5.sd\tbadpairs\n",
        "1A0\t1000\t20.0\t0.0\t1000\t20.0\t0.0\t0\n",
        "1B0\t1000\t20.0\t0.0\t1000\t20.0\t0.0\t0\n",
        "1C0\t1000\t20.0\t0.0\t1000\t20.0\t0.0\t0\n",
        "1D0\t1000\t20.0\t0.0\t1000\t20.0\t0.0\t0\n",
        "2E0\t1000\t20.0\t0.0\t1000\t20.0\t0.0\t0\n",
        "2F0\t1000\t20.0\t0.0\t1000\t20.0\t0.0\t0\n",
        "2G0\t1000\t20.0\t0.0\t1000\t20.0\t0.0\t0\n",
        "2H0\t1000\t20.0\t0.0\t1000\t20.0\t0.0\t0\n",
        "3I0\t1000\t20.0\t0.0\t1000\t20.0\t0.0\t0\n",
        "3J0\t1000\t20.0\t0.0\t1000\t20.0\t0.0\t0\n",
        "3K0\t1000\t20.0\t0.0\t1000\t20.0\t0.0\t0\n",
        "3L0\t1000\t20.0\t0.0\t1000\t20.0\t0.0\t0\n",
        "\n",
        "    ## total = total number of clusters, including singletons\n",
        "    ## dpt.me = mean depth of clusters\n",
        "    ## dpt.sd = standard deviation of cluster depth\n",
        "    ## >N.tot = number of clusters with depth greater than N\n",
        "    ## >N.me = mean depth of clusters with depth greater than N\n",
        "    ## >N.sd = standard deviation of cluster depth for clusters with depth greater than N\n",
        "    ## badpairs = mismatched 1st & 2nd reads (only for paired ddRAD data)\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 89
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Steps 4 & 5: Consensus base calling"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We next make consensus base calls for each cluster within each individual. First we estimate the error rate and heterozygosity within each sample:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "pyrad -p params.txt -s 4"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "\n",
        "\n",
        "     ------------------------------------------------------------\n",
        "      pyRAD : RADseq for phylogenetics & introgression analyses\n",
        "     ------------------------------------------------------------\n",
        "\n",
        "\n",
        "\tstep 4: estimating error rate and heterozygosity\n",
        "\t............"
       ]
      }
     ],
     "prompt_number": 90
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Calling consensus sequences applies a number of filters, as listed in the params file, to remove potential paralogs or highly repetitive markers from the data set. For paired-end data the maxN filter only applies to first reads, since we don't mind allowing low quality base calls in second reads. The maxH filter applies to both reads together. The diploid filter (only allow 2 haplotypes in a consensus sequence) also applies to the two reads together."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "pyrad -p params.txt -s 5"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "\n",
        "\n",
        "     ------------------------------------------------------------\n",
        "      pyRAD : RADseq for phylogenetics & introgression analyses\n",
        "     ------------------------------------------------------------\n",
        "\n",
        "\n",
        "\tstep 5: created consensus seqs for 12 samples, using H=0.00135 E=0.00050\n",
        "\t............"
       ]
      }
     ],
     "prompt_number": 91
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "cat stats/s5.consens.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "taxon               \tnloci\tf1loci\tf2loci\tnsites\tnpoly\tpoly\n",
        "1D0.clustS.gz       \t1000\t1000\t1000\t183006\t254\t0.0013879\n",
        "3J0.clustS.gz       \t1000\t1000\t1000\t183005\t256\t0.0013989\n",
        "3K0.clustS.gz       \t1000\t1000\t1000\t183003\t233\t0.0012732\n",
        "1C0.clustS.gz       \t1000\t1000\t1000\t183006\t238\t0.0013005\n",
        "3L0.clustS.gz       \t1000\t1000\t1000\t183006\t281\t0.0015355\n",
        "1A0.clustS.gz       \t1000\t1000\t1000\t183006\t294\t0.0016065\n",
        "1B0.clustS.gz       \t1000\t1000\t1000\t183004\t235\t0.0012841\n",
        "2G0.clustS.gz       \t1000\t1000\t1000\t183001\t241\t0.0013169\n",
        "3I0.clustS.gz       \t1000\t1000\t1000\t183006\t230\t0.0012568\n",
        "2E0.clustS.gz       \t1000\t1000\t1000\t183006\t249\t0.0013606\n",
        "2F0.clustS.gz       \t1000\t1000\t1000\t183005\t211\t0.001153\n",
        "2H0.clustS.gz       \t1000\t1000\t1000\t183006\t257\t0.0014043\n",
        "\n",
        "    ## nloci = number of loci\n",
        "    ## f1loci = number of loci with >N depth coverage\n",
        "    ## f2loci = number of loci with >N depth and passed paralog filter\n",
        "    ## nsites = number of sites across f loci\n",
        "    ## npoly = number of polymorphic sites in nsites\n",
        "    ## poly = frequency of polymorphic sites\n"
       ]
      }
     ],
     "prompt_number": 92
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Step 6: Across-sample clustering"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This step clusters consensus sequences across samples. It can take a very long time for very large data sets. If run normally it will print its progress to the screen so you can judge how long it might take. Try this first. If it seems too slow (many days or weeks) you can cancel the job and try running the two-step (hierarchical) clustering method instead (see the Advanced tutorials). Here I show the results for a data set that runs very quickly.  "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "pyrad -p params.txt -s 6"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "usearch v7.0.1090_i86linux32, 4.0Gb RAM (65.8Gb total), 40 cores\n",
        "(C) Copyright 2013 Robert C. Edgar, all rights reserved.\n",
        "http://drive5.com/usearch\n",
        "\n",
        "Licensed to: daeaton.chicago@gmail.com\n",
        "\n",
        "\n",
        "\tfinished clustering\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "\n",
        "\n",
        "     ------------------------------------------------------------\n",
        "      pyRAD : RADseq for phylogenetics & introgression analyses\n",
        "     ------------------------------------------------------------\n",
        "\n",
        "\n",
        "\tstep 6: clustering across 12 samples at '.85' similarity \n",
        "\n",
        "00:00  21Mb    0.1% 0 clusters, max size 0, avg 0.0\r",
        "00:01  27Mb    4.3% 399 clusters, max size 4, avg 1.2\r",
        "00:02  28Mb    9.8% 697 clusters, max size 5, avg 1.7\r",
        "00:03  28Mb   18.7% 914 clusters, max size 7, avg 2.5\r",
        "00:04  28Mb   74.2% 1000 clusters, max size 12, avg 8.9\r",
        "00:04  28Mb  100.0% 1000 clusters, max size 12, avg 12.0\r\n",
        "                                                        \n",
        "      Seqs  12000 (12.0k)\n",
        "  Clusters  1000\n",
        "  Max size  12\n",
        "  Avg size  12.0\n",
        "  Min size  12\n",
        "Singletons  0, 0.0% of seqs, 0.0% of clusters\n",
        "   Max mem  28Mb\n",
        "      Time  4.00s\n",
        "Throughput  3000.0 seqs/sec.\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 93
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Step 7: Statistics and output files"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Alignment of loci, statistics, and output of various formatted data files."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "pyrad -p params.txt -s 7"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\tingroup 1A0,1B0,1C0,1D0,2E0,2F0,2G0,2H0,3I0,3J0,3K0,3L0\n",
        "\taddon \n",
        "\texclude \n",
        "\t\n",
        "\tfinal stats written to:\n",
        "\t /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_pairddRAD/stats/c85d6m4p3.stats\n",
        "\toutput files written to:\n",
        "\t /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_pairddRAD/outfiles/ directory\n",
        "\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "\n",
        "\n",
        "     ------------------------------------------------------------\n",
        "      pyRAD : RADseq for phylogenetics & introgression analyses\n",
        "     ------------------------------------------------------------\n",
        "\n",
        "..."
       ]
      }
     ],
     "prompt_number": 94
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Stats for this run. In this case 1000 loci were shared across all 12 samples."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "cat stats/c85d6m4p3.stats"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "\n",
        "1000        ## loci with > minsp containing data\n",
        "1000        ## loci with > minsp containing data & paralogs removed\n",
        "1000        ## loci with > minsp containing data & paralogs removed & final filtering\n",
        "\n",
        "## number of loci recovered in final data set for each taxon.\n",
        "taxon\tnloci\n",
        "1A0\t1000\n",
        "1B0\t1000\n",
        "1C0\t1000\n",
        "1D0\t1000\n",
        "2E0\t1000\n",
        "2F0\t1000\n",
        "2G0\t1000\n",
        "2H0\t1000\n",
        "3I0\t1000\n",
        "3J0\t1000\n",
        "3K0\t1000\n",
        "3L0\t1000\n",
        "\n",
        "\n",
        "## nloci = number of loci with data for exactly ntaxa\n",
        "## ntotal = number of loci for which at least ntaxa have data\n",
        "ntaxa\tnloci\tsaved\tntotal\n",
        "1\t-\n",
        "2\t-\t\t-\n",
        "3\t-\t\t-\n",
        "4\t0\t*\t1000\n",
        "5\t0\t*\t1000\n",
        "6\t0\t*\t1000\n",
        "7\t0\t*\t1000\n",
        "8\t0\t*\t1000\n",
        "9\t0\t*\t1000\n",
        "10\t0\t*\t1000\n",
        "11\t0\t*\t1000\n",
        "12\t1000\t*\t1000\n",
        "\n",
        "\n",
        "## var = number of loci containing n variable sites.\n",
        "## pis = number of loci containing n parsimony informative var sites.\n",
        "n\tvar\tPIS\n",
        "0\t4\t73\n",
        "1\t224\t202\n",
        "2\t310\t238\n",
        "3\t322\t205\n",
        "4\t313\t142\n",
        "5\t213\t69\n",
        "6\t199\t46\n",
        "7\t142\t12\n",
        "8\t87\t6\n",
        "9\t56\t4\n",
        "10\t32\t2\n",
        "11\t12\t0\n",
        "12\t8\t0\n",
        "13\t2\t0\n",
        "14\t1\t1\n",
        "15\t0\t0\n",
        "16\t1\t0\n",
        "17\t1\t0\n",
        "total var= 8136\n",
        "total pis= 2684\n"
       ]
      }
     ],
     "prompt_number": 95
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Output files"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "ls outfiles/"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "c85d6m4p3.alleles\n",
        "c85d6m4p3.excluded_loci\n",
        "c85d6m4p3.loci\n",
        "c85d6m4p3.nex\n",
        "c85d6m4p3.phy\n",
        "c85d6m4p3.snps\n",
        "c85d6m4p3.str\n",
        "c85d6m4p3.unlinked_snps\n"
       ]
      }
     ],
     "prompt_number": 100
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Stats files"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "ls stats/"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "c85d6m4p3.stats\n",
        "Pi_E_estimate.txt\n",
        "s1.sorting.txt\n",
        "s2.rawedit.txt\n",
        "s3.clusters.txt\n",
        "s5.consens.txt\n"
       ]
      }
     ],
     "prompt_number": 101
    }
   ],
   "metadata": {}
  }
 ]
}

## tutorial_pairddRAD_3.0-merged.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              tutorial_pairddRAD_3.0-merged.ipynb
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## tutorial_pairddRAD_3.0.4-merged.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              tutorial_pairddRAD_3.0.4-merged.ipynb
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## tutorial_pairddRAD_3.0.4.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              tutorial_pairddRAD_3.0.4.ipynb
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## tutorial_pairddRAD_3.0.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              tutorial_pairddRAD_3.0.ipynb
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.