aaronsaunders/UPARSE_workflow

## UPARSE_workflow
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# UPARSE analysis pipeline form amplicon sequences\n",
    "\n",
    "We will follow the Illumina paired end tutorial.\n",
    "\n",
    "http://drive5.com/usearch/manual/upp_ill_pe.html\n",
    "\n",
    "\n",
    "This page gives an example of 2 pipelines using UPARSE for Illumina paired reads. Both use the same preprocessing.\n",
    "\n",
    "### Preprocessing\n",
    "\n",
    "These commands make the following assumptions (which are usually but not always true).\n",
    "\n",
    "\n",
    "1. The forward and reverse read have a significant overlap, say 32 bases or more.\n",
    "\n",
    "2. There are no non-biological bases in the read such as adapters or barcodes.\n",
    "\n",
    "3. Sequences complementary to PCR primers are not included in the reads.\n",
    "\n",
    "4. The reads have been demultiplexed, i.e. split into separate FASTQ files for each sample.\n",
    "\n",
    "5. The FASTQ filenames start with the sample name.\n",
    "\n",
    "6. The reads are all on the same strand.\n",
    "\n",
    "\n",
    "Commands:\n",
    "\n",
    "    usearch -fastq_mergepairs *_R1_*.fastq -relabel @ -fastqout merged.fq\n",
    "\n",
    "    usearch -fastq_filter merged.fq -fastq_maxee 1.0 -relabel Filt -fastaout filtered.fa\n",
    "\n",
    "    usearch -fastx_uniques filtered.fa -relabel Uniq -sizeout -fastaout uniques.fa\n",
    "\n",
    "\n",
    "### Original UPARSE (with UCHIME)\n",
    "\n",
    "http://drive5.com/usearch/manual/uparse_pipeline.html\n",
    "\n",
    "Takes the good quality, unique reads and clusters at 97% to identify centroids (which are screened for chimeras with UCHIME). Then clusters all of the sequences within 3% of a centroid as an 97% OTU.\n",
    "\n",
    "\n",
    "Commands:\n",
    "\n",
    "    usearch -cluster_otus uniques.fa -minsize 2 -otus otus.fa -relabel Otu\n",
    "\n",
    "    usearch -usearch_global merged.fq -db otus.fa -strand plus -id 0.97 \\\n",
    "      -otutabout otutab.txt -biomout otutab.json\n",
    "\n",
    " \n",
    " ### UNOISE pipeline (NEW!)\n",
    " http://www.drive5.com/usearch/manual/unoise_pipeline.html\n",
    " \n",
    "Takes the good quality, unique reads and identifies centroids with an error model. These are also screened for chimeras (with UCHIME) in the same script. Then clusters all of the sequences within 3% of a centroid as an 97% OTU.\n",
    "\n",
    "Commands:\n",
    " \n",
    "    usearch -unoise uniques.fa -tabbedout out.txt -fastaout denoised.fa\n",
    "\n",
    "    usearch -usearch_global reads.fq -db denoised.fa -strand plus -id 0.97 -otutabout otu_table.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Download Sample data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!rm -rf data # delete output and start fresh\n",
    "!mkdir data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "fetch data\n",
      "URL transformed to HTTPS due to an HSTS policy\n",
      "--2017-01-11 10:20:31--  https://www.mothur.org/w/images/d/d6/MiSeqSOPData.zip\n",
      "Resolving www.mothur.org (www.mothur.org)... 141.214.120.22\n",
      "Connecting to www.mothur.org (www.mothur.org)|141.214.120.22|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 36910055 (35M) [application/zip]\n",
      "Saving to: ‘data/MiSeqSOPData.zip’\n",
      "\n",
      "MiSeqSOPData.zip    100%[===================>]  35.20M   598KB/s    in 86s     \n",
      "\n",
      "2017-01-11 10:21:58 (419 KB/s) - ‘data/MiSeqSOPData.zip’ saved [36910055/36910055]\n",
      "\n",
      "Archive:  data/MiSeqSOPData.zip\n",
      "   creating: MiSeq_SOP/\n",
      "  inflating: MiSeq_SOP/F3D0_S188_L001_R1_001.fastq  \n",
      "   creating: __MACOSX/\n",
      "   creating: __MACOSX/MiSeq_SOP/\n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D0_S188_L001_R1_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D0_S188_L001_R2_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D0_S188_L001_R2_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D141_S207_L001_R1_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D141_S207_L001_R1_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D141_S207_L001_R2_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D141_S207_L001_R2_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D142_S208_L001_R1_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D142_S208_L001_R1_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D142_S208_L001_R2_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D142_S208_L001_R2_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D143_S209_L001_R1_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D143_S209_L001_R1_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D143_S209_L001_R2_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D143_S209_L001_R2_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D144_S210_L001_R1_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D144_S210_L001_R1_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D144_S210_L001_R2_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D144_S210_L001_R2_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D145_S211_L001_R1_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D145_S211_L001_R1_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D145_S211_L001_R2_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D145_S211_L001_R2_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D146_S212_L001_R1_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D146_S212_L001_R1_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D146_S212_L001_R2_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D146_S212_L001_R2_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D147_S213_L001_R1_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D147_S213_L001_R1_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D147_S213_L001_R2_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D147_S213_L001_R2_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D148_S214_L001_R1_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D148_S214_L001_R1_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D148_S214_L001_R2_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D148_S214_L001_R2_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D149_S215_L001_R1_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D149_S215_L001_R1_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D149_S215_L001_R2_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D149_S215_L001_R2_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D150_S216_L001_R1_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D150_S216_L001_R1_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D150_S216_L001_R2_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D150_S216_L001_R2_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D1_S189_L001_R1_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D1_S189_L001_R1_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D1_S189_L001_R2_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D1_S189_L001_R2_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D2_S190_L001_R1_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D2_S190_L001_R1_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D2_S190_L001_R2_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D2_S190_L001_R2_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D3_S191_L001_R1_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D3_S191_L001_R1_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D3_S191_L001_R2_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D3_S191_L001_R2_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D5_S193_L001_R1_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D5_S193_L001_R1_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D5_S193_L001_R2_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D5_S193_L001_R2_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D6_S194_L001_R1_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D6_S194_L001_R1_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D6_S194_L001_R2_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D6_S194_L001_R2_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D7_S195_L001_R1_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D7_S195_L001_R1_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D7_S195_L001_R2_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D7_S195_L001_R2_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D8_S196_L001_R1_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D8_S196_L001_R1_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D8_S196_L001_R2_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D8_S196_L001_R2_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D9_S197_L001_R1_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D9_S197_L001_R1_001.fastq  \n",
      "  inflating: MiSeq_SOP/F3D9_S197_L001_R2_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._F3D9_S197_L001_R2_001.fastq  \n",
      "  inflating: MiSeq_SOP/HMP_MOCK.v35.fasta  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._HMP_MOCK.v35.fasta  \n",
      "  inflating: MiSeq_SOP/Mock_S280_L001_R1_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._Mock_S280_L001_R1_001.fastq  \n",
      "  inflating: MiSeq_SOP/Mock_S280_L001_R2_001.fastq  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._Mock_S280_L001_R2_001.fastq  \n",
      "  inflating: MiSeq_SOP/mouse.dpw.metadata  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._mouse.dpw.metadata  \n",
      "  inflating: MiSeq_SOP/mouse.time.design  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._mouse.time.design  \n",
      "  inflating: MiSeq_SOP/stability.batch  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._stability.batch  \n",
      "  inflating: MiSeq_SOP/stability.files  \n",
      "  inflating: __MACOSX/MiSeq_SOP/._stability.files  \n",
      "Fetch taxonomy reference data\n",
      "--2017-01-11 10:22:00--  http://drive5.com/sintax/rdp_16s_v16.fa.gz\n",
      "Resolving drive5.com (drive5.com)... 199.195.116.69\n",
      "Connecting to drive5.com (drive5.com)|199.195.116.69|:80... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 4602323 (4.4M) [application/x-gzip]\n",
      "Saving to: ‘data/rdp_16s_v16.fa.gz’\n",
      "\n",
      "rdp_16s_v16.fa.gz   100%[===================>]   4.39M  2.44MB/s    in 1.8s    \n",
      "\n",
      "2017-01-11 10:22:03 (2.44 MB/s) - ‘data/rdp_16s_v16.fa.gz’ saved [4602323/4602323]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!sh code/download.sh"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 56M\r\n",
      "drwxr-xr-x 2 amsa amsa 4.0K Aug 14  2014 MiSeq_SOP\r\n",
      "-rw-rw-r-- 1 amsa amsa  36M Aug 14  2014 MiSeqSOPData.zip\r\n",
      "-rw-rw-r-- 1 amsa amsa  20M Sep 10 01:16 rdp_16s_v16.fa\r\n",
      "-rw-rw-r-- 1 amsa amsa  206 Jan 11 10:22 README\r\n"
     ]
    }
   ],
   "source": [
    "!rm -r data/__MACOSX\n",
    "!ls -lh data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sequence preprocessing\n",
    "\n",
    "### Step 01: Merging pairs\n",
    "\n",
    "http://drive5.com/usearch/manual/cmd_fastq_mergepairs.html\n",
    "\n",
    "The fastq_mergepairs command merges (assembles) paired-end reads to create consensus sequences and, optionally, consensus quality scores.\n",
    "\n",
    "If given the forward reads eg. `F3D*_R1_*.fastq` the script automatically finds the R2 reads and matches the samples\n",
    "\n",
    "---\n",
    "\n",
    "After the merge examine the [merge report](http://drive5.com/usearch/manual/merge_report.html), which is part of the output.\n",
    "\n",
    "![](http://drive5.com/usearch/manual/mergerep.gif)\n",
    "\n",
    "With long overlaps (here shown by the mean alignment length of 248). Misalignments are therefore very unlikely, and it would be reasonable to increase the -fastq_maxdiffs and -fastq_maxdiffspct values to increase the number of merged pairs. Quality filtering will take care of discarding reads where many mismatches induce a large number of expected errors. This doesn't necessarily happen -- e.g., if low quality base calls in R2 are mismatches against high-quality base calls in R1 then the merged Q scores can still be high.\n",
    "\n",
    "**Filtering**\n",
    "\n",
    "`-fastq_maxdiffs`  Maximum number of mismatches in the alignment. Default 5. Consider increasing if you have long overlaps.\n",
    "`-fastq_maxdiffpct`  Maximum number of mismatches as an integer percentage. Default 5. Consider increasing if you have long overlaps.\n",
    "\n",
    "Deafult params work well for merging 2 x 300 reads with short overlap shuch as the V3-V4 product.\n",
    "\n",
    "**Merged read labels**\n",
    "\n",
    "  `-relabel`  Prefix string from sample file for output labels, then the read number 1, 2, 3... is appended after the prefix.\n",
    "  \n",
    "`F3D9_S197_L001_R1_001.fastq` becomes\n",
    "\n",
    "      @F3D9.1\n",
    "      @F3D9.2\n",
    "      @F3D9.3 ...\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "!rm -r uparse_out\n",
    "!mkdir uparse_out"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "usearch v9.2.64_i86linux32, 4.0Gb RAM (8.1Gb total), 4 cores\n",
      "(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.\n",
      "http://drive5.com/usearch\n",
      "\n",
      "License: amsa@eaaa.dk\n",
      "\n",
      "\n",
      "Merging\n",
      "  Fwd data/MiSeq_SOP/F3D0_S188_L001_R1_001.fastq\n",
      "  Rev data/MiSeq_SOP/F3D0_S188_L001_R2_001.fastq\n",
      "  Relabel reads as F3D0.#\n",
      "\n",
      "00:00 100Mb   100.0% 84.8% merged\n",
      "                                 \n",
      "Merging\n",
      "  Fwd data/MiSeq_SOP/F3D141_S207_L001_R1_001.fastq\n",
      "  Rev data/MiSeq_SOP/F3D141_S207_L001_R2_001.fastq\n",
      "  Relabel reads as F3D141.#\n",
      "\n",
      "00:00 100Mb     0.1% 0% merged00:00 100Mb   100.0% 83.9% merged\n",
      "                                 \n",
      "Merging\n",
      "  Fwd data/MiSeq_SOP/F3D142_S208_L001_R1_001.fastq\n",
      "  Rev data/MiSeq_SOP/F3D142_S208_L001_R2_001.fastq\n",
      "  Relabel reads as F3D142.#\n",
      "\n",
      "00:01 100Mb     0.2% 83.9% merged00:01 100Mb   100.0% 83.3% merged\n",
      "                                 \n",
      "Merging\n",
      "  Fwd data/MiSeq_SOP/F3D143_S209_L001_R1_001.fastq\n",
      "  Rev data/MiSeq_SOP/F3D143_S209_L001_R2_001.fastq\n",
      "  Relabel reads as F3D143.#\n",
      "\n",
      "00:01 100Mb     0.1% 0% merged00:01 100Mb   100.0% 83.1% merged\n",
      "                                 \n",
      "Merging\n",
      "  Fwd data/MiSeq_SOP/F3D144_S210_L001_R1_001.fastq\n",
      "  Rev data/MiSeq_SOP/F3D144_S210_L001_R2_001.fastq\n",
      "  Relabel reads as F3D144.#\n",
      "\n",
      "00:01 100Mb     0.1% 0% merged00:01 100Mb   100.0% 81.9% merged\n",
      "                                 \n",
      "Merging\n",
      "  Fwd data/MiSeq_SOP/F3D145_S211_L001_R1_001.fastq\n",
      "  Rev data/MiSeq_SOP/F3D145_S211_L001_R2_001.fastq\n",
      "  Relabel reads as F3D145.#\n",
      "\n",
      "00:01 100Mb     0.1% 0% merged00:01 100Mb   100.0% 80.8% merged\n",
      "                                 \n",
      "Merging\n",
      "  Fwd data/MiSeq_SOP/F3D146_S212_L001_R1_001.fastq\n",
      "  Rev data/MiSeq_SOP/F3D146_S212_L001_R2_001.fastq\n",
      "  Relabel reads as F3D146.#\n",
      "\n",
      "00:01 100Mb     0.1% 0% merged00:01 100Mb   100.0% 81.2% merged\n",
      "                                 \n",
      "Merging\n",
      "  Fwd data/MiSeq_SOP/F3D147_S213_L001_R1_001.fastq\n",
      "  Rev data/MiSeq_SOP/F3D147_S213_L001_R2_001.fastq\n",
      "  Relabel reads as F3D147.#\n",
      "\n",
      "00:01 100Mb     0.1% 0% merged00:01 100Mb   100.0% 79.4% merged\n",
      "                                 \n",
      "Merging\n",
      "  Fwd data/MiSeq_SOP/F3D148_S214_L001_R1_001.fastq\n",
      "  Rev data/MiSeq_SOP/F3D148_S214_L001_R2_001.fastq\n",
      "  Relabel reads as F3D148.#\n",
      "\n",
      "00:01 100Mb     0.1% 0% merged00:01 100Mb   100.0% 79.6% merged\n",
      "                                 \n",
      "Merging\n",
      "  Fwd data/MiSeq_SOP/F3D149_S215_L001_R1_001.fastq\n",
      "  Rev data/MiSeq_SOP/F3D149_S215_L001_R2_001.fastq\n",
      "  Relabel reads as F3D149.#\n",
      "\n",
      "00:02 100Mb     0.2% 79.6% merged00:02 100Mb   100.0% 80.1% merged\n",
      "                                 \n",
      "Merging\n",
      "  Fwd data/MiSeq_SOP/F3D150_S216_L001_R1_001.fastq\n",
      "  Rev data/MiSeq_SOP/F3D150_S216_L001_R2_001.fastq\n",
      "  Relabel reads as F3D150.#\n",
      "\n",
      "00:02 100Mb     0.1% 0% merged00:02 100Mb   100.0% 80.1% merged\n",
      "                                 \n",
      "Merging\n",
      "  Fwd data/MiSeq_SOP/F3D1_S189_L001_R1_001.fastq\n",
      "  Rev data/MiSeq_SOP/F3D1_S189_L001_R2_001.fastq\n",
      "  Relabel reads as F3D1.#\n",
      "\n",
      "00:02 100Mb     0.1% 0% merged00:02 100Mb   100.0% 80.5% merged\n",
      "                                 \n",
      "Merging\n",
      "  Fwd data/MiSeq_SOP/F3D2_S190_L001_R1_001.fastq\n",
      "  Rev data/MiSeq_SOP/F3D2_S190_L001_R2_001.fastq\n",
      "  Relabel reads as F3D2.#\n",
      "\n",
      "00:02 100Mb     0.1% 0% merged00:02 101Mb   100.0% 81.3% merged\n",
      "                                 \n",
      "Merging\n",
      "  Fwd data/MiSeq_SOP/F3D3_S191_L001_R1_001.fastq\n",
      "  Rev data/MiSeq_SOP/F3D3_S191_L001_R2_001.fastq\n",
      "  Relabel reads as F3D3.#\n",
      "\n",
      "00:02 101Mb     0.1% 0% merged00:03 101Mb    48.1% 81.3% merged00:03 101Mb   100.0% 81.3% merged\n",
      "                                 \n",
      "Merging\n",
      "  Fwd data/MiSeq_SOP/F3D5_S193_L001_R1_001.fastq\n",
      "  Rev data/MiSeq_SOP/F3D5_S193_L001_R2_001.fastq\n",
      "  Relabel reads as F3D5.#\n",
      "\n",
      "00:03 101Mb     0.1% 0% merged00:03 104Mb   100.0% 81.3% merged\n",
      "                                 \n",
      "Merging\n",
      "  Fwd data/MiSeq_SOP/F3D6_S194_L001_R1_001.fastq\n",
      "  Rev data/MiSeq_SOP/F3D6_S194_L001_R2_001.fastq\n",
      "  Relabel reads as F3D6.#\n",
      "\n",
      "00:03 104Mb     0.1% 0% merged00:03 104Mb   100.0% 81.6% merged\n",
      "                                 \n",
      "Merging\n",
      "  Fwd data/MiSeq_SOP/F3D7_S195_L001_R1_001.fastq\n",
      "  Rev data/MiSeq_SOP/F3D7_S195_L001_R2_001.fastq\n",
      "  Relabel reads as F3D7.#\n",
      "\n",
      "00:03 104Mb     0.1% 0% merged00:03 104Mb   100.0% 81.6% merged\n",
      "                                 \n",
      "Merging\n",
      "  Fwd data/MiSeq_SOP/F3D8_S196_L001_R1_001.fastq\n",
      "  Rev data/MiSeq_SOP/F3D8_S196_L001_R2_001.fastq\n",
      "  Relabel reads as F3D8.#\n",
      "\n",
      "00:03 104Mb     0.1% 0% merged00:03 104Mb   100.0% 81.8% merged\n",
      "                                 \n",
      "Merging\n",
      "  Fwd data/MiSeq_SOP/F3D9_S197_L001_R1_001.fastq\n",
      "  Rev data/MiSeq_SOP/F3D9_S197_L001_R2_001.fastq\n",
      "  Relabel reads as F3D9.#\n",
      "\n",
      "00:03 104Mb     0.1% 0% merged00:03 104Mb   100.0% 82.1% merged\n",
      "\n",
      "Totals:\n",
      "    147581  Pairs (147.6k)\n",
      "    121176  Merged (121.2k, 82.11%)\n",
      "     40086  Alignments with zero diffs (27.16%)\n",
      "     26292  Too many diffs (> 10) (17.82%)\n",
      "         0  Fwd tails Q <= 2 trimmed (0.00%)\n",
      "       148  Rev tails Q <= 2 trimmed (0.10%)\n",
      "         0  Fwd too short (< 64) after tail trimming (0.00%)\n",
      "         5  Rev too short (< 64) after tail trimming (0.00%)\n",
      "       108  No alignment found (0.07%)\n",
      "         0  Alignment too short (< 16) (0.00%)\n",
      "        22  Staggered pairs (0.01%) merged & trimmed\n",
      "    249.23  Mean alignment length\n",
      "    252.47  Mean merged length\n",
      "      0.39  Mean fwd expected errors\n",
      "      1.14  Mean rev expected errors\n",
      "      0.08  Mean merged expected errors\n"
     ]
    }
   ],
   "source": [
    "!usearch -fastq_mergepairs data/MiSeq_SOP/F3D*_R1_*.fastq -fastqout 'uparse_out/01_merged.fq' -relabel @ -log 'uparse_out/01_merge.log' -fastq_maxdiffs 10 -fastq_maxdiffpct 10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Relabel reads as F3D9.#\r\n",
      "\r\n",
      "00:03 104Mb   100.0% 82.1% merged\r\n",
      "\r\n",
      "Totals:\r\n",
      "    147581  Pairs (147.6k)\r\n",
      "    121176  Merged (121.2k, 82.11%)\r\n",
      "     40086  Alignments with zero diffs (27.16%)\r\n",
      "     26292  Too many diffs (> 10) (17.82%)\r\n",
      "         0  Fwd tails Q <= 2 trimmed (0.00%)\r\n",
      "       148  Rev tails Q <= 2 trimmed (0.10%)\r\n",
      "         0  Fwd too short (< 64) after tail trimming (0.00%)\r\n",
      "         5  Rev too short (< 64) after tail trimming (0.00%)\r\n",
      "       108  No alignment found (0.07%)\r\n",
      "         0  Alignment too short (< 16) (0.00%)\r\n",
      "        22  Staggered pairs (0.01%) merged & trimmed\r\n",
      "    249.23  Mean alignment length\r\n",
      "    252.47  Mean merged length\r\n",
      "      0.39  Mean fwd expected errors\r\n",
      "      1.14  Mean rev expected errors\r\n",
      "      0.08  Mean merged expected errors\r\n",
      "\r\n",
      "Finished Wed Jan 11 12:02:31 2017\r\n",
      "Elapsed time 00:03\r\n",
      "Max memory 104Mb\r\n"
     ]
    }
   ],
   "source": [
    "!tail -n 25 'uparse_out/01_merge.log'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "@F3D0.1\r\n",
      "TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCCTGCCAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTACAGCCGTTGAAACTGCCGGGCTCGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACCCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGTGCGGGGATCAAACAGG\r\n",
      "+\r\n",
      "35JJJJJJJJJJJJJJ/JJJ5JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJB\r\n"
     ]
    }
   ],
   "source": [
    "!head -n 4 'uparse_out/01_merged.fq'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "!! Need a quality check here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 02: filter fastq\n",
    "http://drive5.com/usearch/manual/upp_readprep.html\n",
    "\n",
    "**Quality filtering**\n",
    "\n",
    "Quality filtering of FASTQ reads is done by the `fastq_filter` command. This enables length trimming and quality filtering to be done in a single step. If you have overlapping paired reads, then you probably don't need to trim the length.\n",
    "\n",
    "`-fastq_maxee`: set an expected error threshold (1.0 is recommended as a default). \n",
    "\n",
    "Don't worry if this filter appears to be stringent; you will probably find that most of the discarded reads map to OTUs after clustering.\n",
    "\n",
    "**Length trimming**\n",
    "\n",
    "`-fastq_trunclen`: If you have variable-length reads, such as 454, then you should trim them to a fixed length. Not a problem for Illumina unless the reads are truncated on quality score (above rejects reads based on average quality) and this is usually not a problem with overlapping reads."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "usearch v9.2.64_i86linux32, 4.0Gb RAM (8.1Gb total), 4 cores\n",
      "(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.\n",
      "http://drive5.com/usearch\n",
      "\n",
      "License: amsa@eaaa.dk\n",
      "\n",
      "00:02 32Mb    100.0% Filtering, 98.9% passed00:02 66Mb     98.9% Filtering, 98.9% passed\n",
      "    121176  Reads (121.2k)                  \n",
      "      1370  Discarded reads with expected errs > 1.00\n",
      "    119806  Filtered reads (119.8k, 98.9%)\n"
     ]
    }
   ],
   "source": [
    "!usearch -fastq_filter 'uparse_out/01_merged.fq' -fastq_maxee 1.0 -fastaout 'uparse_out/02_filtered.fa' -relabel Filt -log 'uparse_out/02_filter.log'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These data are high quality, paired end reads that have been merged so only 1.1% failed. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">Filt1\r\n",
      "TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCCTGCCAAGTCAGCGGTAAAATTGC\r\n",
      "GGGGCTCAACCCCGTACAGCCGTTGAAACTGCCGGGCTCGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGA\r\n",
      "AATGCATAGATATCACGCAGAACCCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGTGCGG\r\n",
      "GGATCAAACAGG\r\n",
      ">Filt2\r\n",
      "TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGAAGATCAAGTCAGCGGTAAAATTGA\r\n",
      "GAGGCTCAACCTCTTCGAGCCGTTGAAACTGGTTTTCTTGAGTGAGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGA\r\n",
      "AATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCTCAACTGACGCTCATGCACGAAAGTGTGG\r\n",
      "GTATCGAACAGG\r\n"
     ]
    }
   ],
   "source": [
    "!head 'uparse_out/02_filtered.fa'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 03: Dereplication\n",
    "\n",
    "The input sequences to the first `cluster_otus` (that generates the zOTUs) must be a set of unique sequences sorted in order of decreasing abundance with size annotations in the labels. The `fastx_uniques` command can do this. The input to `fastx_uniques` should be the reads after any quality filtering or length trimming.\n",
    "\n",
    "`-relabel Uniq` so that the unique sequences are labeled `Uniq1, Uniq2` and so on. \n",
    "\n",
    "**Samples should be pooled before dereplication**\n",
    "\n",
    "I recommend pooling samples, i.e. concatenating reads for all samples that were sequenced in the same run. This is important for getting the best detection of chimeras and cross-talk, and for getting the best sensitivity to low-abundance sequences that could be lost if individual samples or subsets of samples are clustered separately. See pooling samples for discussion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "usearch v9.2.64_i86linux32, 4.0Gb RAM (8.1Gb total), 4 cores\n",
      "(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.\n",
      "http://drive5.com/usearch\n",
      "\n",
      "License: amsa@eaaa.dk\n",
      "\n",
      "00:00 73Mb    100.0% Reading uparse_out/02_filtered.fa\n",
      "00:01 78Mb    100.0% DF                               00:01 78Mb     68.2% DF\n",
      "00:01 80Mb   119806 seqs, 15780 uniques, 12315 singletons (78.0%)\n",
      "00:01 80Mb   Min size 1, median 1, max 9798, avg 7.59\n",
      "00:01 72Mb      0.0% Writing uparse_out/03_uniques.fa00:01 72Mb    100.0% Writing uparse_out/03_uniques.fa\n"
     ]
    }
   ],
   "source": [
    "!usearch -fastx_uniques 'uparse_out/02_filtered.fa' -sizeout -relabel Uniq -fastaout 'uparse_out/03_uniques.fa' -log 'uparse_out/03_uniques.log'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">Uniq1;size=9798;\r\n",
      "TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGAAGATCAAGTCAGCGGTAAAATTGA\r\n",
      "GAGGCTCAACCTCTTCGAGCCGTTGAAACTGGTTTTCTTGAGTGAGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGA\r\n",
      "AATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCTCAACTGACGCTCATGCACGAAAGTGTGG\r\n",
      "GTATCGAACAGG\r\n",
      ">Uniq2;size=7743;\r\n",
      "TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGACTCTCAAGTCAGCGGTCAAATCGC\r\n",
      "GGGGCTCAACCCCGTTCCGCCGTTGAAACTGGGAGCCTTGAGTGCGCGAGAAGTAGGCGGAATGCGTGGTGTAGCGGTGA\r\n",
      "AATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCCTACCGGCGCGCAACTGACGCTCATGCACGAAAGCGTGG\r\n",
      "GTATCGAACAGG\r\n"
     ]
    }
   ],
   "source": [
    "!head 'uparse_out/03_uniques.fa'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# UPARSE\n",
    "\n",
    "\n",
    "### Step 04: : Identify centroids and chimera filter: UCHIME\n",
    "\n",
    "The purpose of this step is to identify the zOTUs in the population. The high-abundance, good quality sequences that we assume are the centroids of the actual population OTUs. In a UPARSE pipeline, clustering and chimera filtering is performed by the `cluster_otus` command. \n",
    "\n",
    "The chimera filter in cluster_otus is very effective, so I generally don't do any additional filtering. It may be possible to find some additional chimeras by running uchime2_ref on the OTU sequences generated by `cluster_otus`.\n",
    "\n",
    "A typical command line is:\n",
    "\n",
    "    usearch -cluster_otus derep.fa -otus otus.fa -relabel OTU -minsize 2 -uparseout out.up \n",
    "\n",
    "`-otu_radius_pct` Default is 3.0, corresponding to a minimum identity of 97%.\n",
    "\n",
    "`-otus` specifies a FASTA filename for the OTU representative sequences. By default, labels are taken from the input file, with size annotations stripped. \n",
    "\n",
    "`-relabel` option specifies a string that is used to re-label OTUs. If -relabel xxx is specified, then the labels are xxx followed by 1, 2 ... up to the number of OTUs. OTU identifiers in the labels is required for making an OTU table using usearch_global\n",
    "\n",
    "`‑minsize 2` option discards singleton unique sequences, most of which will contain sequencer error. See discarding singletons for discussion. Most singletons will be assigned to OTUs when the OTU table is constructed, so the data is not lost.\n",
    "\n",
    "\n",
    "Optional:\n",
    "\n",
    "`‑mode high_confidence` can be used to reduce the risk of false positive chimeras.\n",
    "\n",
    "`-uparseout` specifies a tabbed text output file documenting how the input sequences were classified."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "usearch v9.2.64_i86linux32, 4.0Gb RAM (8.1Gb total), 4 cores\n",
      "(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.\n",
      "http://drive5.com/usearch\n",
      "\n",
      "License: amsa@eaaa.dk\n",
      "\n",
      "00:01 47Mb      6.5% 183 OTUs, 59 chimeras00:01 48Mb    100.0% 207 OTUs, 165 chimeras\n"
     ]
    }
   ],
   "source": [
    "!usearch -cluster_otus 'uparse_out/03_uniques.fa' -minsize 2 -otus 'uparse_out/04_otus.fa' -relabel Otu -log 'uparse_out/04_cluster_otus.log'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Classify OTUs: SINTAX\n",
    "\n",
    "Classify the centroids (repseqs)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "usearch v9.2.64_i86linux32, 4.0Gb RAM (8.1Gb total), 4 cores\n",
      "(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.\n",
      "http://drive5.com/usearch\n",
      "\n",
      "License: amsa@eaaa.dk\n",
      "\n",
      "00:00 61Mb    100.0% Reading data/rdp_16s_v16.fa\n",
      "00:00 27Mb      0.1% Masking (fastnucleo)       00:01 27Mb     76.8% Masking (fastnucleo)00:01 27Mb    100.0% Masking (fastnucleo)\n",
      "00:02 28Mb    100.0% Word stats          00:02 28Mb     98.4% Word stats\n",
      "00:02 28Mb    100.0% Alloc rows\n",
      "00:02 99Mb      0.1% Build index00:03 99Mb     64.8% Build index00:03 99Mb    100.0% Build index\n",
      "00:03 99Mb      0.0% Initialize taxonomy data00:03 100Mb   100.0% Initialize taxonomy data\n",
      "00:03 100Mb   100.0% Building name table     \n",
      "00:03 100Mb  3172 names, tax levels min 3, avg 5.9, max 6\n",
      "00:03 133Mb     0.1% Processing00:04 163Mb    20.3% Processing00:05 163Mb    75.3% Processing00:05 163Mb   100.0% Processing\n"
     ]
    }
   ],
   "source": [
    "!usearch -sintax 'uparse_out/04_otus.fa' -db data/rdp_16s_v16.fa -strand plus -tabbedout 'uparse_out/04_otus_sintax.txt' -log 'uparse_out/04_sintax.log'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 05: Populate OTU tables with clustering at 97% identity\n",
    "\n",
    "http://drive5.com/usearch/manual/mapreadstootus.html\n",
    "\n",
    "Now that the high quality centroids have been identified the surrounding _member_ (point error) reads can then be clustered with each centroid."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "usearch v9.2.64_i86linux32, 4.0Gb RAM (8.1Gb total), 4 cores\n",
      "(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.\n",
      "http://drive5.com/usearch\n",
      "\n",
      "License: amsa@eaaa.dk\n",
      "\n",
      "00:01 56Mb     76.7% Reading data/rdp_16s_v16.fa00:01 61Mb    100.0% Reading data/rdp_16s_v16.fa\n",
      "00:01 28Mb      0.1% Masking (fastnucleo)       00:01 28Mb    100.0% Masking (fastnucleo)\n",
      "00:01 28Mb      0.1% Word stats          00:02 28Mb     54.6% Word stats00:02 28Mb    100.0% Word stats\n",
      "00:02 28Mb    100.0% Alloc rows\n",
      "00:02 99Mb      0.1% Build index00:03 99Mb     34.9% Build index00:04 99Mb     95.3% Build index00:04 99Mb    100.0% Build index\n",
      "00:04 132Mb     0.1% Searching 04_otus.fa, 0.0% matched00:04 166Mb   100.0% Searching 04_otus.fa, 14.5% matched\n",
      "usearch v9.2.64_i86linux32, 4.0Gb RAM (8.1Gb total), 4 cores\n",
      "(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.\n",
      "http://drive5.com/usearch\n",
      "\n",
      "License: amsa@eaaa.dk\n",
      "\n",
      "00:00 40Mb    100.0% Reading uparse_out/04_otus.fa\n",
      "00:00 6.5Mb   100.0% Masking (fastnucleo)         \n",
      "00:00 7.5Mb   100.0% Word stats          \n",
      "00:00 7.5Mb   100.0% Alloc rows\n",
      "00:00 7.7Mb   100.0% Build index\n",
      "00:00 41Mb      0.1% Searching 01_merged.fq, 0.0% matched00:01 74Mb      6.7% Searching 01_merged.fq, 96.9% matched00:02 74Mb     41.6% Searching 01_merged.fq, 96.0% matched00:03 74Mb     92.2% Searching 01_merged.fq, 96.4% matched00:03 74Mb    100.0% Searching 01_merged.fq, 96.5% matched\n",
      "116831 / 121176 mapped to OTUs (96.4%)                    \n",
      "00:03 74Mb   Writing uparse_out/05_otutab.txt\n",
      "00:03 74Mb   Writing uparse_out/05_otutab.txt ...done.\n",
      "00:03 74Mb   Writing uparse_out/05_otutab.json\n",
      "00:03 74Mb   Writing uparse_out/05_otutab.json ...done.\n",
      "00:03 74Mb   Writing uparse_out/05_otutab.mothur\n",
      "00:03 74Mb   Writing uparse_out/05_otutab.mothur ...done.\n"
     ]
    }
   ],
   "source": [
    "!usearch -usearch_global 'uparse_out/04_otus.fa' -db 'data/rdp_16s_v16.fa' -strand plus -id 0.97 -alnout 'uparse_out/04_otus_ref.aln' -userout 'uparse_out/04_otus_ref.user' -userfields query+target+id\n",
    "\n",
    "!usearch -usearch_global 'uparse_out/01_merged.fq' -db 'uparse_out/04_otus.fa' -strand plus -id 0.97 -log 'uparse_out/05_make_otutab.log' -otutabout 'uparse_out/05_otutab.txt' -biomout 'uparse_out/05_otutab.json' -mothur_shared_out 'uparse_out/05_otutab.mothur'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 06: Remove sequencing errors: UNOISE2\n",
    "\n",
    "http://www.drive5.com/usearch/manual/cmd_unoise2.html\n",
    "\n",
    "Edgar, R.C. (2016), UNOISE2: Improved error-correction for Illumina 16S and ITS amplicon reads.http://dx.doi.org/10.1101/081257\n",
    "\n",
    "#### UNOISE algorithm\n",
    "\n",
    "**Skew**\n",
    "\n",
    "Skew is the relative abundance of member sequence $M$ (abundance $a_M$) and a centroid ($C$, abundance $a_C$).\n",
    "\n",
    "$skew(M, C) = a_M / a_C$\n",
    "\n",
    "**distance**\n",
    "\n",
    "$d$ is the Levenshtein distance (n diffs inc. substitutions and gaps) between $C$ & $M$\n",
    "\n",
    "\n",
    "**Error model**\n",
    "\n",
    "A member $M$ is is probably an erroneous sequence derived from $C$ when it has:\n",
    "\n",
    "1. It is in low enough abundance compared to $C$ => small enough $skew$ with respect to $M$ (higher $skew$ more likely an error)\n",
    "2. small enough $d$ (more differences => higher $d$ => more likely an error)\n",
    "\n",
    "This is formalised in a distribution model where $\\beta$ is the maximum skew allowed for a member ($M$) with $d$ differences:\n",
    "\n",
    "$\\beta(d) = 1 / 2^{\\alpha d + 1}$\n",
    "\n",
    "\n",
    "where: $\\alpha$ = 2 (default). \n",
    "\n",
    "$\\beta(1) = 1/8, \\beta(2) = 1/32, \\beta(3) = 1/128 ... $\n",
    "\n",
    "So if $skew(M, C) >= \\beta(d)$ then the $M$ is an error, and $\\beta$ increases exponentially with $d$.\n",
    "\n",
    "Higher $\\alpha$ means you get more \"real\" sequences but increases in the number of bad sequences that are assumed to be good. \n",
    "\n",
    "\n",
    "Where $\\alpha$ = 3. \n",
    "\n",
    "$\\beta(1) = 1/16, \\beta(2) = 1/128, \\beta(3) = 1/1024 ... $ \n",
    "\n",
    "so sequences with $d = 1$ and  $skew(M, C)$ between 1/8 and 1/16 are assumed errors when $\\alpha$ = 2, but accepted as \"real\" when $\\alpha$ = 3.\n",
    "\n",
    "---\n",
    "\n",
    "The input to UNOISE2 is the good quality, unique reads from preprocessing with size=nnn; abundance annotations. \n",
    "\n",
    "It uses a one-pass clustering that does not use quality scores (Q).\n",
    "\n",
    "It has two parameters with preset values:\n",
    "\n",
    "$\\alpha$ = 2 (default) is the sensitivity to difference in abundance with a centroid\n",
    "$\\gamma$ = 4 (default) is a threshold that sets the the minimum number of reads with a given unique sequence. A smaller $\\gamma$ will increase the number of spurious zOTUs. \n",
    "\n",
    "It is also therefore better to pool samples sa that there is more data and if they fail it is due to the $\\alpha$ nit the $\\gamma$.\n",
    "\n",
    "\n",
    "Errors are corrected as follows:\n",
    "  - Reads with sequencing error are identified and corrected.\n",
    "  - Chimeras are removed.\n",
    "  - PhiX sequences are removed.\n",
    "  \n",
    "`-fastaout` denoised sequences (\"Corrected biological sequences\"). \n",
    "\n",
    "Labels are formatted as Otunnn;Uniqlabel; where nnn is 1, 2, 3... and Uniqlabel is the label from the input file (truncated at the first semi-colon, to strip any annotations).\n",
    "\n",
    "\n",
    "`-minampsize` option specifies the minimum abundance (size= annotation) for an error-corrected amplicon. Default is 4.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "usearch v9.2.64_i86linux32, 4.0Gb RAM (8.1Gb total), 4 cores\n",
      "(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.\n",
      "http://drive5.com/usearch\n",
      "\n",
      "License: amsa@eaaa.dk\n",
      "\n",
      "00:00 208Mb   100.0% Word stats\n",
      "00:00 208Mb   100.0% Alloc rows\n",
      "00:00 208Mb   100.0% Build index\n",
      "00:01 175Mb     0.1% Reading uparse_out/03_uniques.fa00:01 182Mb   100.0% Reading uparse_out/03_uniques.fa\n",
      "00:01 150Mb     0.1% 0 amplicons, 0 bad (size >= 9798)00:01 159Mb   100.0% 607 amplicons, 6259 bad (size >= 4)\n",
      "00:01 162Mb     0.2% 1 good, 0 chimeras                 00:02 166Mb    64.1% 254 good, 136 chimeras00:03 166Mb    89.0% 296 good, 245 chimeras00:03 166Mb   100.0% 303 good, 305 chimeras\n",
      "00:03 166Mb   100.0% Writing amplicons     \n"
     ]
    }
   ],
   "source": [
    "!usearch -unoise2 'uparse_out/03_uniques.fa' -fastaout 'uparse_out/06_denoised.fa' -log 'uparse_out/06_unoise.log' -minampsize 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "usearch v9.2.64_i86linux32, 4.0Gb RAM (8.1Gb total), 4 cores\n",
      "(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.\n",
      "http://drive5.com/usearch\n",
      "\n",
      "License: amsa@eaaa.dk\n",
      "\n",
      "00:00 61Mb    100.0% Reading data/rdp_16s_v16.fa\n",
      "00:00 27Mb      0.1% Masking (fastnucleo)       00:01 27Mb     28.2% Masking (fastnucleo)00:01 27Mb    100.0% Masking (fastnucleo)\n",
      "00:01 28Mb      0.1% Word stats          00:02 28Mb     65.6% Word stats00:02 28Mb    100.0% Word stats\n",
      "00:02 28Mb    100.0% Alloc rows\n",
      "00:02 99Mb      0.1% Build index00:03 99Mb     34.7% Build index00:03 99Mb    100.0% Build index\n",
      "00:03 99Mb      0.0% Initialize taxonomy data00:03 100Mb   100.0% Initialize taxonomy data\n",
      "00:03 100Mb   100.0% Building name table     \n",
      "00:03 100Mb  3172 names, tax levels min 3, avg 5.9, max 6\n",
      "00:03 133Mb     0.1% Processing00:04 164Mb     6.0% Processing00:05 164Mb    56.4% Processing00:05 164Mb   100.0% Processing\n"
     ]
    }
   ],
   "source": [
    "!usearch -sintax 'uparse_out/06_denoised.fa' -db data/rdp_16s_v16.fa -strand plus -tabbedout 'uparse_out/06_otus_sintax.txt' -log 'uparse_out/06_sintax.log'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">Otu1;uniq=Uniq1;size=9798;\r\n",
      "TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGAAGATCAAGTCAGCGGTAAAATTGA\r\n",
      "GAGGCTCAACCTCTTCGAGCCGTTGAAACTGGTTTTCTTGAGTGAGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGA\r\n",
      "AATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCTCAACTGACGCTCATGCACGAAAGTGTGG\r\n",
      "GTATCGAACAGG\r\n",
      ">Otu2;uniq=Uniq2;size=7743;\r\n",
      "TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGACTCTCAAGTCAGCGGTCAAATCGC\r\n",
      "GGGGCTCAACCCCGTTCCGCCGTTGAAACTGGGAGCCTTGAGTGCGCGAGAAGTAGGCGGAATGCGTGGTGTAGCGGTGA\r\n",
      "AATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCCTACCGGCGCGCAACTGACGCTCATGCACGAAAGCGTGG\r\n",
      "GTATCGAACAGG\r\n"
     ]
    }
   ],
   "source": [
    "!head uparse_out/06_denoised.fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 07: Populate OTU table\n",
    "\n",
    "Search for one (default) or a few high-identity hits to a database using the USEARCH algorithm. Alignments are global. \n",
    "\n",
    "To get more than one hit, increase `-maxaccepts` (see accept options).\n",
    "\n",
    "`‑id` specifies the identity threshold. \n",
    "\n",
    "If full-length, exact matches are required, then it is better to use search_exact than usearch_global with -id 1.0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "usearch v9.2.64_i86linux32, 4.0Gb RAM (8.1Gb total), 4 cores\n",
      "(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.\n",
      "http://drive5.com/usearch\n",
      "\n",
      "License: amsa@eaaa.dk\n",
      "\n",
      "00:00 40Mb    100.0% Reading uparse_out/06_denoised.fa\n",
      "00:00 6.7Mb   100.0% Masking (fastnucleo)             \n",
      "00:00 7.5Mb   100.0% Word stats          \n",
      "00:00 7.5Mb   100.0% Alloc rows\n",
      "00:00 7.8Mb   100.0% Build index\n",
      "00:01 74Mb     38.3% Searching 01_merged.fq, 97.4% matched00:02 74Mb     89.0% Searching 01_merged.fq, 97.8% matched00:02 74Mb    100.0% Searching 01_merged.fq, 97.8% matched\n",
      "118456 / 121176 mapped to OTUs (97.8%)                    \n",
      "00:02 74Mb   Writing uparse_out/07_otutab_den.txt\n",
      "00:02 74Mb   Writing uparse_out/07_otutab_den.txt ...done.\n",
      "00:02 74Mb   Writing uparse_out/07_otutab_den.json\n",
      "00:02 74Mb   Writing uparse_out/07_otutab_den.json ...done.\n",
      "00:02 74Mb   Writing uparse_out/07_otutab_den.mothur\n",
      "00:02 74Mb   Writing uparse_out/07_otutab_den.mothur ...done.\n"
     ]
    }
   ],
   "source": [
    "!usearch -usearch_global 'uparse_out/01_merged.fq' -db 'uparse_out/06_denoised.fa' -strand plus -id 0.97 -log 'uparse_out/07_make_otutab_den.log' -otutabout 'uparse_out/07_otutab_den.txt' -biomout 'uparse_out/07_otutab_den.json' -mothur_shared_out 'uparse_out/07_otutab_den.mothur'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## second option: Make otudb and use usearch_exact (faster)\n",
    "\n",
    "### Step 08: UNOISE\n",
    "\n",
    "`-otudbout` option specifies a FASTA filename which contains all non-chimeric unique sequences annotated with the Otu identifier of the corrected sequence. This enables generating a OTU table with `search_exact` rather than `usearch_global`, which is much faster than `usearch_global`. \n",
    "\n",
    "Sequences in the FASTA file are annotated as `Otun;Ampm;uniq=u;diffs=d`; To generate an OTU table, this file is used as a database and reads with sample identifiers are used as the query set. \n",
    "\n",
    "The `-otutabout`, `-biomout` and -`mothur_shared_out` options of `search_exact` specify OTU table file names. An OTU table can also be generated using `usearch_global`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "usearch v9.2.64_i86linux32, 4.0Gb RAM (8.1Gb total), 4 cores\n",
      "(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.\n",
      "http://drive5.com/usearch\n",
      "\n",
      "License: amsa@eaaa.dk\n",
      "\n",
      "00:00 208Mb   100.0% Word stats\n",
      "00:00 208Mb   100.0% Alloc rows\n",
      "00:00 208Mb   100.0% Build index\n",
      "00:00 175Mb     0.1% Reading uparse_out/03_uniques.fa00:00 182Mb   100.0% Reading uparse_out/03_uniques.fa\n",
      "00:00 150Mb     0.1% 0 amplicons, 0 bad (size >= 9798)00:00 159Mb   100.0% 607 amplicons, 6259 bad (size >= 4)\n",
      "00:00 162Mb     0.2% 1 good, 0 chimeras                 00:01 166Mb    18.8% 103 good, 11 chimeras00:02 166Mb    70.4% 266 good, 162 chimeras00:03 166Mb    91.0% 298 good, 255 chimeras00:03 166Mb   100.0% 303 good, 305 chimeras\n",
      "00:03 166Mb   100.0% Writing OTU db        \n"
     ]
    }
   ],
   "source": [
    "!usearch -unoise2 'uparse_out/03_uniques.fa' -otudbout 'uparse_out/08_denoised_db.fa' -log 'uparse_out/08_unoise_db.log' -minampsize 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">Otu1;amp=Amp1;uniq=152886412;diffs=0;\r\n",
      "TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGAAGATCAAGTCAGCGGTAAAATTGA\r\n",
      "GAGGCTCAACCTCTTCGAGCCGTTGAAACTGGTTTTCTTGAGTGAGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGA\r\n",
      "AATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCTCAACTGACGCTCATGCACGAAAGTGTGG\r\n",
      "GTATCGAACAGG\r\n",
      ">Otu2;amp=Amp2;uniq=152886412;diffs=0;\r\n",
      "TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGACTCTCAAGTCAGCGGTCAAATCGC\r\n",
      "GGGGCTCAACCCCGTTCCGCCGTTGAAACTGGGAGCCTTGAGTGCGCGAGAAGTAGGCGGAATGCGTGGTGTAGCGGTGA\r\n",
      "AATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCCTACCGGCGCGCAACTGACGCTCATGCACGAAAGCGTGG\r\n",
      "GTATCGAACAGG\r\n"
     ]
    }
   ],
   "source": [
    "!head 'uparse_out/08_denoised_db.fa'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Classify: SINTAX"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "usearch v9.2.64_i86linux32, 4.0Gb RAM (8.1Gb total), 4 cores\n",
      "(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.\n",
      "http://drive5.com/usearch\n",
      "\n",
      "License: amsa@eaaa.dk\n",
      "\n",
      "00:01 57Mb     82.3% Reading data/rdp_16s_v16.fa00:01 61Mb    100.0% Reading data/rdp_16s_v16.fa\n",
      "00:01 27Mb      0.1% Masking (fastnucleo)       00:01 27Mb    100.0% Masking (fastnucleo)\n",
      "00:01 28Mb      0.1% Word stats          00:02 28Mb     38.9% Word stats00:02 28Mb    100.0% Word stats\n",
      "00:02 28Mb    100.0% Alloc rows\n",
      "00:02 99Mb      0.1% Build index00:03 99Mb      6.8% Build index00:04 99Mb     71.1% Build index00:04 99Mb    100.0% Build index\n",
      "00:04 99Mb      0.0% Initialize taxonomy data00:04 100Mb   100.0% Initialize taxonomy data\n",
      "00:04 100Mb   100.0% Building name table     \n",
      "00:04 100Mb  3172 names, tax levels min 3, avg 5.9, max 6\n",
      "00:04 133Mb     0.1% Processing00:05 163Mb     5.1% Processing00:06 163Mb    17.6% Processing00:07 163Mb    28.4% Processing00:08 163Mb    44.4% Processing00:09 163Mb    58.7% Processing00:10 163Mb    70.8% Processing00:11 163Mb    85.0% Processing00:12 163Mb    98.5% Processing00:12 163Mb   100.0% Processing\n"
     ]
    }
   ],
   "source": [
    "!usearch -sintax 'uparse_out/08_denoised_db.fa' -db data/rdp_16s_v16.fa -strand plus -tabbedout 'uparse_out/08_otus_sintax.txt' -log 'uparse_out/08_sintax.log'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 09: Populate OTU table (search_exact)\n",
    "\n",
    "`search_exact` command searches for exact, full-length matches to a database sequence. Same as `usearch_global` with -id 1.0, but the algorithm is faster, uses less memory, and is guaranteed to find all correct matches, i.e. there are no heuristics.\n",
    "\n",
    "\n",
    "`-notmatched` output file can be used to save query sequences that are not found in the database\n",
    "\n",
    "`-dbnotmatched` option can be used to save database sequences that are not present in the query set. This enables incremental updating of dereplicated sequence sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "usearch v9.2.64_i86linux32, 4.0Gb RAM (8.1Gb total), 4 cores\n",
      "(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.\n",
      "http://drive5.com/usearch\n",
      "\n",
      "License: amsa@eaaa.dk\n",
      "\n",
      "00:00 40Mb    100.0% Reading uparse_out/08_denoised_db.fa\n",
      "00:00 6.8Mb   100.0% Masking (fastnucleo)                \n",
      "00:01 69Mb     61.5% Processing          00:01 69Mb    100.0% Processing\n",
      "96236 / 121176 mapped to OTUs (79.4%)\n",
      "00:01 69Mb   Writing uparse_out/09_otutab_den.txt\n",
      "00:01 69Mb   Writing uparse_out/09_otutab_den.txt ...done.\n",
      "00:01 69Mb   Writing uparse_out/09_otutab_den.json\n",
      "00:01 69Mb   Writing uparse_out/09_otutab_den.json ...done.\n"
     ]
    }
   ],
   "source": [
    "!usearch -search_exact 'uparse_out/01_merged.fq' -db 'uparse_out/08_denoised_db.fa' -strand plus -log 'uparse_out/09_make_otutab_dendb.log' -otutabout 'uparse_out/09_otutab_den.txt' -biomout 'uparse_out/09_otutab_den.json' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Saved 2 seconds! OK it is a small dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ls: cannot access 'out': No such file or directory\r\n"
     ]
    }
   ],
   "source": [
    "!ls -lh out"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}