gregcaporaso/cloud_demo_complete.ipynb

## cloud_demo_complete.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              cloud_demo_complete.ipynb
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## cloud_demo_fast.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              cloud_demo_fast.ipynb
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## pearson_v_rf.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              pearson_v_rf.ipynb
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## timing.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              timing.ipynb
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## v3_v4_regions_only.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              v3_v4_regions_only.ipynb
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## variable_region_position_calculations.ipynb
{
 "metadata": {
  "name": "Variable Region Position Boundaries"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": "This notebook is intended to calculate the positions of primers in an alignment, using functions from PrimerProspector."
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": "Import the needed functions, and define the primer sequences"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "# Code modified from PrimerProspector library slice_aligned_region.py (development version)\n\n# Imports and definitions\nfrom string import lower, upper\nfrom operator import itemgetter\n\nfrom cogent import LoadSeqs, DNA\nfrom cogent.core.alphabet import AlphabetError\nfrom cogent.align.align import make_dna_scoring_dict, local_pairwise\nfrom cogent.parse.fasta import MinimalFastaParser\nfrom cogent.core.moltype import IUPAC_DNA_ambiguities\n\nDNA_CODES = ['A', 'C', 'T', 'G', 'R', 'Y', 'M', 'K',\n 'W', 'S', 'B', 'D', 'H', 'V', 'N']\n\n\n# Note that these are all written 5'->3', the reverse primers are reverse complemented for the local alignment\n\n# If one wanted to test different primers, they would be defined here.\n\n# 27f/338r = V2 (also includes V1, but generally just referred to as V2)\n# 349f/534r = V3\n# 515f/806r = V4\n# 967f/1046r = V6\n# 1391f/1492r = V9\n\nprimer_seqs = {\n           '27f':'AGAGTTTGATCMTGGCTCAG',\n           '338r':DNA.rc('GCTGCCTCCCGTAGGAGT'),\n           '349f':'GYGCASCAGKCGMGAAW',\n           '534r':DNA.rc('ATTACCGCGGCTGCTGG'),\n           '515f':'GTGCCAGCMGCCGCGGTAA',\n           '806r':DNA.rc('GGACTACVSGGGTATCTAAT'),\n           '967f':'CAACGCGAAGAACCTTACC',\n           '1048r':DNA.rc('CGRCRGCCATGYACCWC'),\n           '1391f':'TGYACACACCGCCCGTC',\n           '1492r':DNA.rc('GGCTACCTTGTTACGACTT'),\n           '1391r':'TGYACACACCGCCCGTC' # Need this rather than forward primer to get proper 3' position of reverse version\n           }\n\nreference_aligned_file = '/home/ubuntu/qiime_software/gg_otus-4feb2011-release/rep_set/gg_76_otus_4feb2011_aligned.fasta'",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": "Functions from the PrimerProspector code are here.  Should only be calling get_primer_indices_counts and get_final_indices directly to get aligned indices.  Takes a while to run, as this has to do many local alignments for each primer (and takes the mode of the index in the alignment as the most accurate aligned index)."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "\n# Begin functions\n\n\"\"\" This library was written for the purpose of slicing out a region of a reference sequence set to \nmatch a target region amplified by a set of primers to improve taxonomic classification as seen in\nWerner et al.  It has been modified here to simply return the positions (3' end of the primer) in the\nalignment of the given primer pair, as well as the position of simulated short reads of 150, 250, and\n400 base pairs.  Note that some of these reads are larger than the size of the amplicon itself, and in\nthis case the simulated read simply isn't created.\"\"\"\n\ndef match_scorer_ambigs(match=1,\n                       mismatch=-1,\n                       matches=None):\n    \"\"\" Alternative scorer factory for sw_align, allows match to ambiguous chars\n\n    It allows for matching to ambiguous characters which is useful for \n     primer/sequence matching. Not sure what should happen with gaps, but they\n     shouldn't be passed to this function anyway. Currently a gap will only match\n     a gap.\n\n    match and mismatch should both be numbers. Typically, match should be \n    positive and mismatch should be negative.\n\n    Resulting function has signature f(x,y) -> number.\n    \n    match: score for nucleotide match\n    mismatch: score for nucleotide mismatch\n    matches: dictionary for matching nucleotides, including degenerate bases\n    \"\"\"\n    \n    matches = matches or \\\n     {'A':{'A':None},'G':{'G':None},'C':{'C':None},\\\n      'T':{'T':None},'-':{'-':None}}\n    for ambig, chars in IUPAC_DNA_ambiguities.items():\n        try:\n            matches[ambig].update({}.fromkeys(chars))\n        except KeyError:\n            matches[ambig] = {}.fromkeys(chars)\n        \n        for char in chars:\n            try:\n                matches[char].update({ambig:None})\n            except KeyError:\n                matches[char] = {ambig:None}\n            \n    def scorer(x, y):\n        # need a better way to disallow unknown characters (could\n        # try/except for a KeyError on the next step, but that would only \n        # test one of the characters)\n        if x not in matches or y not in matches:\n            raise ValueError, \"Unknown character: %s or %s\" % (x,y)\n        if y in matches[x]:\n            return match\n        else:\n            return mismatch\n    return scorer\n\n\ndef get_primer_mismatches(primer_hit,\n                          target_hit,\n                          sw_scorer=match_scorer_ambigs(1, -1)):\n    \"\"\" Gets mismatches for a given primer sequence and target hit\n    \n    Specifically this returns total mismatches\n    \n    primer_hit: Alignment object for primer, normally matches primer unless\n     gaps were used in the alignment\n    target_hit: Alignment object for segment of sequence where primer was\n     aligned to.  Can contain gaps.\n    sw_scorer: Gives scores for mismatches, gap insertions in alignment.\n    \n    \"\"\"\n\n    # Sum mismatches\n    mismatches = 0\n    for i in range(len(primer_hit)):\n        # using the scoring function to check for\n        # matches, but might want to just access the dict\n        if sw_scorer(target_hit[i], primer_hit[i]) == -1: \n            mismatches += 1\n            \n    return mismatches\n\ndef get_aligned_pos(unaligned_index, aligned_seq):\n    \"\"\" Returns aligned index given index in degapped sequence\n    \n    unaligned_index: int with index in unaligned sequence\n    aligned_seq: aligned sequence to find index\"\"\"\n    \n    nt_counter = 0\n    total_counter = 0\n    \n    for nt in aligned_seq:\n        if nt in DNA_CODES:\n            nt_counter += 1\n        total_counter += 1\n        if nt_counter == unaligned_index:\n            break\n            \n    return total_counter\n\ndef pair_hmm_align_unaligned_seqs(seqs,\n                                  moltype=DNA,\n                                  params={}):\n    \"\"\"\n        Handles pairwise alignment of given sequence pair\n        \n        seqs: list of [primer, target sequence] in string format\n        moltype: molecule type tested.  Only DNA supported.\n        params: Used to set parameters for opening, extending gaps  and score\n         matrix if something other than the default given in this function \n         is desired.\n    \"\"\"\n    \n    seqs = LoadSeqs(data=seqs,moltype=moltype,aligned=False)\n    try:\n        s1, s2 = seqs.values()\n    except ValueError:\n        raise ValueError,\\\n         \"Pairwise aligning of seqs requires exactly two seqs.\"\n    \n    try:\n        gap_open = params['gap_open']\n    except KeyError:\n        gap_open = 5\n    try:\n        gap_extend = params['gap_extend']\n    except KeyError:\n        gap_extend = 2\n    try:\n        score_matrix = params['score_matrix']\n    except KeyError:\n        score_matrix = make_dna_scoring_dict(\\\n         match=1, transition=-1, transversion=-1)\n    \n    return local_pairwise(s1, s2, score_matrix, gap_open, gap_extend)\n\ndef local_align_primer_seq(primer,\n                           sequence):\n    \"\"\"Perform local alignment of primer and sequence\n    \n        primer: Current primer being tested\n        sequence: Current sequence\n        \n        Returns the Alignment object primer sequence and target sequence, \n         and the start position in sequence of the hit.\n    \"\"\"\n\n\n    query_sequence = sequence\n     \n    # Get alignment object from primer, target sequence\n    alignment = pair_hmm_align_unaligned_seqs([primer,query_sequence])\n\n    # Extract sequence of primer, target site, may have gaps in insertions\n    # or deletions have occurred.\n    primer_hit = str(alignment.Seqs[0])\n    target_hit = str(alignment.Seqs[1])\n    \n    # Get index of primer hit in target sequence.\n    try:\n        hit_start = query_sequence.index(target_hit.replace('-',''))\n    except ValueError:\n        raise ValueError,('substring not found, query string %s, target_hit %s'\\\n         % (query_sequence, target_hit))\n         \n    \n    return primer_hit, target_hit, hit_start\n\ndef get_primer_indices_counts(curr_fasta,\n                              f_primer_seq,\n                              r_primer_seq,\n                              mismatch_threshold=3):\n    \"\"\" Gets counts of primer hit indices for forward and reverse primer\n    \n    curr_fasta:  current aligned fasta filepath\n    f_primer_seq: forward primer sequence\n    r_primer_seq: reverse primer sequence (already reverse complemented)\n    mismatch_threshold: used to determine which primer hits are counted for\n     getting the final aligned index\"\"\"\n    \n    f_primer_indices = {}\n    r_primer_indices = {}\n    \n    for label, seq in MinimalFastaParser(open(curr_fasta), \"U\"):\n            \n            unaligned_seq = upper(seq.replace(\".\",\"\").replace(\"-\",\"\"))\n            \n            primer_hit, target_hit, hit_start = \\\n             local_align_primer_seq(f_primer_seq, unaligned_seq)\n             \n            # get mismatches\n            mismatches = get_primer_mismatches(primer_hit, target_hit)\n            \n            if mismatches <= mismatch_threshold:\n                # Correction for forward primer to get 3' position\n                aligned_pos = get_aligned_pos(hit_start + len(target_hit), seq)\n                \n                try:\n                    f_primer_indices[aligned_pos] += 1\n                except KeyError:\n                    f_primer_indices[aligned_pos] = 0\n            \n            primer_hit, target_hit, hit_start = \\\n             local_align_primer_seq(r_primer_seq, unaligned_seq)\n             \n            # get mismatches\n            mismatches = get_primer_mismatches(primer_hit, target_hit)\n            \n            if mismatches <= mismatch_threshold:\n                # No correction for reverse primer hit\n                aligned_pos = get_aligned_pos(hit_start, seq)\n                \n                try:\n                    r_primer_indices[aligned_pos] += 1\n                except KeyError:\n                    r_primer_indices[aligned_pos] = 0\n                    \n    return f_primer_indices, r_primer_indices\n\ndef get_final_indices(f_primer_indices,\n                      r_primer_indices):\n    \"\"\" Sorts, retrieves mode of aligning primer hits indices\n    \n    f_primer_indices:  dictionary of primer_indices:counts\n    r_primer_indices:  dictionary of primer_indices:counts\n    \"\"\"\n    \n    f_primer_final = []\n    r_primer_final = []\n    \n    for index in f_primer_indices:\n        f_primer_final.append((index, f_primer_indices[index]))\n    f_primer_final.sort(key=itemgetter(1), reverse=True)\n    f_primer_index = f_primer_final[0][0]\n\n    for index in r_primer_indices:\n        r_primer_final.append((index, r_primer_indices[index]))\n    r_primer_final.sort(key=itemgetter(1), reverse=True)\n    r_primer_index = r_primer_final[0][0]\n    \n    \n    return f_primer_index, r_primer_index\n\nprimer_indices = {}\n\nprimer_f_indices, primer_r_indices = get_primer_indices_counts(reference_aligned_file, primer_seqs['27f'], primer_seqs['338r'])\nprimer_indices['27f'], primer_indices['338r'] = get_final_indices(primer_f_indices, primer_r_indices)\n\nprimer_f_indices, primer_r_indices = get_primer_indices_counts(reference_aligned_file, primer_seqs['349f'], primer_seqs['534r'])\nprimer_indices['349f'], primer_indices['534r'] = get_final_indices(primer_f_indices, primer_r_indices)\n\nprimer_f_indices, primer_r_indices = get_primer_indices_counts(reference_aligned_file, primer_seqs['515f'], primer_seqs['806r'])\nprimer_indices['515f'], primer_indices['806r'] = get_final_indices(primer_f_indices, primer_r_indices)\n\nprimer_f_indices, primer_r_indices = get_primer_indices_counts(reference_aligned_file, primer_seqs['967f'], primer_seqs['1048r'])\nprimer_indices['967f'], primer_indices['1048r'] = get_final_indices(primer_f_indices, primer_r_indices)\n\nprimer_f_indices, primer_r_indices = get_primer_indices_counts(reference_aligned_file, primer_seqs['1391f'], primer_seqs['1492r'])\nprimer_indices['1391f'], primer_indices['1492r'] = get_final_indices(primer_f_indices, primer_r_indices)\n\nprimer_f_indices, primer_r_indices = get_primer_indices_counts(reference_aligned_file, primer_seqs['515f'], primer_seqs['1391r'])\nprimer_indices['515f'], primer_indices['1391r'] = get_final_indices(primer_f_indices, primer_r_indices)",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": "Code for getting 150, 250, and 400 base pair \"read\" indices from the previously calculated primer indices.\nThere will be some duplicates, such as the forward 150 reads from v2 and v2.v3 regions.  We manually deleted the duplicates for our demo."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "# Need to manually add the full length read, ('full.length', 0, 7682), which covers the entire Greengenes alignment.\n\nregion_boundaries = [\n      ('v2', primer_indices['27f'], primer_indices['338r']),\n      ('v2.v3', primer_indices['27f'], primer_indices['534r']),\n      ('v2.v4', primer_indices['27f'], primer_indices['806r']),\n      ('v2.v6', primer_indices['27f'], primer_indices['1048r']),\n      ('v2.v8', primer_indices['27f'], primer_indices['1391r']),\n      ('v2.v9', primer_indices['27f'], primer_indices['1492r']),\n      ('v3', primer_indices['349f'], primer_indices['534r']),\n      ('v3.v4', primer_indices['349f'], primer_indices['806r']),\n      ('v3.v6', primer_indices['349f'], primer_indices['1048r']),\n      ('v3.v8', primer_indices['349f'], primer_indices['1391r']),\n      ('v3.v9', primer_indices['349f'], primer_indices['1492r']),\n      ('v4', primer_indices['515f'], primer_indices['806r']),\n      ('v4.v6', primer_indices['515f'], primer_indices['1048r']),\n      ('v4.v8', primer_indices['515f'], primer_indices['1391r']),\n      ('v4.v9', primer_indices['515f'], primer_indices['1492r']),\n      ('v6', primer_indices['967f'], primer_indices['1048r']),\n      ('v6.v8', primer_indices['967f'], primer_indices['1391r']),\n      ('v6.v9', primer_indices['967f'], primer_indices['1492r']),\n      ('v9', primer_indices['1391f'],  primer_indices['1492r'])\n]\n\nregion_index = 0\nstart_index = 1\nend_index = 2\n\n# Print in easy format to generate a list\nfor curr_region in region_boundaries:\n    print \"('%s', %d, %d),\" % (curr_region[region_index], curr_region[start_index], curr_region[end_index])\n    \n# Manually print full length alignment\nprint \"('full.length', 0, 7682),\"\n\nix_sizes = [150, 250, 400]\n\nnames_ix = ['.150', '.250', '.400']\n\n# We used the first sequence in the 97% clustered file for the demonstration, so using the\n# same file here to recreate exact results\nreference97_aligned_file = '/home/ubuntu/qiime_software/gg_otus-4feb2011-release/rep_set/gg_97_otus_4feb2011_aligned.fasta'\n\nf = open(reference97_aligned_file, \"U\")\n\nfor label, seq in MinimalFastaParser(f):\n    curr_seq = seq\n    break\n    \nf.close()\n\nnts = ['A', 'T', 'C', 'G', 'N']\n\n\nfor curr_region in region_boundaries:\n\n    for forward_ix in range(len(ix_sizes)):\n        curr_name = curr_region[region_index]\n        curr_name += names_ix[forward_ix]\n        \n        start_ix = curr_region[start_index]\n        end_ix = curr_region[end_index]\n\n        curr_slice = curr_seq[start_ix:end_ix]\n\n        \n        counter = 0\n        target_count = ix_sizes[forward_ix]\n        \n        curr_forward_ix = start_ix\n        for nt in curr_slice:\n            if nt in nts:\n                counter += 1\n            curr_forward_ix += 1\n            # Skip if amplicon is smaller than current read size\n            if curr_forward_ix == end_ix:\n                break\n            if counter == target_count:\n                print \"('%s', %d, %d),\" % (curr_name, start_ix, curr_forward_ix)\n                break ",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "('v2', 136, 1868),\n('v2.v3', 136, 2232),\n('v2.v4', 136, 4051),\n('v2.v6', 136, 4932),\n('v2.v8', 136, 6426),\n('v2.v9', 136, 6791),\n('v3', 1916, 2232),\n('v3.v4', 1916, 4051),\n('v3.v6', 1916, 4932),\n('v3.v8', 1916, 6426),\n('v3.v9', 1916, 6791),\n('v4', 2263, 4051),\n('v4.v6', 2263, 4932),\n('v4.v8', 2263, 6426),\n('v4.v9', 2263, 6791),\n('v6', 4653, 4932),\n('v6.v8', 4653, 6426),\n('v6.v9', 4653, 6791),\n('v9', 6450, 6791),\n('full.length', 0, 7682),\n('v2.150', 136, 702),\n('v2.250', 136, 1752),\n('v2.v3.150', 136, 702),\n('v2.v3.250', 136, 1752),\n('v2.v3.400', 136, 2036),\n('v2.v4.150', 136, 702),\n('v2.v4.250', 136, 1752),\n('v2.v4.400', 136, 2036),\n('v2.v6.150', 136, 702),\n('v2.v6.250', 136, 1752),\n('v2.v6.400', 136, 2036),\n('v2.v8.150', 136, 702),\n('v2.v8.250', 136, 1752),\n('v2.v8.400', 136, 2036),\n('v2.v9.150', 136, 702),\n('v2.v9.250', 136, 1752),\n('v2.v9.400', 136, 2036),\n('v3.v4.150', 1916, 2235),\n('v3.v4.250', 1916, 2493),\n('v3.v4.400', 1916, 4014),\n('v3.v6.150', 1916, 2235),\n('v3.v6.250', 1916, 2493),\n('v3.v6.400', 1916, 4014),\n('v3.v8.150', 1916, 2235),\n('v3.v8.250', 1916, 2493),\n('v3.v8.400', 1916, 4014),\n('v3.v9.150', 1916, 2235),\n('v3.v9.250', 1916, 2493),\n('v3.v9.400', 1916, 4014),\n('v4.150', 2263, 3794),\n('v4.250', 2263, 4046),\n('v4.v6.150', 2263, 3794),\n('v4.v6.250', 2263, 4046),\n('v4.v6.400', 2263, 4574),\n('v4.v8.150', 2263, 3794),\n('v4.v8.250', 2263, 4046),\n('v4.v8.400', 2263, 4574),\n('v4.v9.150', 2263, 3794),\n('v4.v9.250', 2263, 4046),\n('v4.v9.400', 2263, 4574),\n('v6.v8.150', 4653, 5085),\n('v6.v8.250', 4653, 5903),\n('v6.v8.400', 4653, 6419),\n('v6.v9.150', 4653, 5085),\n('v6.v9.250', 4653, 5903),\n('v6.v9.400', 4653, 6419),\n"
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": "Example output and curation for final list of indices"
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": "Here is the raw output of the previous steps that can be easily put into the list base_region_boundaries in the main notebook:\n('v2', 136, 1868),\n('v2.v3', 136, 2232),\n('v2.v4', 136, 4051),\n('v2.v6', 136, 4932),\n('v2.v8', 136, 6426),\n('v2.v9', 136, 6791),\n('v3', 1916, 2232),\n('v3.v4', 1916, 4051),\n('v3.v6', 1916, 4932),\n('v3.v8', 1916, 6426),\n('v3.v9', 1916, 6791),\n('v4', 2263, 4051),\n('v4.v6', 2263, 4932),\n('v4.v8', 2263, 6426),\n('v4.v9', 2263, 6791),\n('v6', 4653, 4932),\n('v6.v8', 4653, 6426),\n('v6.v9', 4653, 6791),\n('v9', 6450, 6791),\n('full.length', 0, 7682),\n('v2.150', 136, 702),\n('v2.250', 136, 1752),\n('v2.v3.150', 136, 702),\n('v2.v3.250', 136, 1752),\n('v2.v3.400', 136, 2036),\n('v2.v4.150', 136, 702),\n('v2.v4.250', 136, 1752),\n('v2.v4.400', 136, 2036),\n('v2.v6.150', 136, 702),\n('v2.v6.250', 136, 1752),\n('v2.v6.400', 136, 2036),\n('v2.v8.150', 136, 702),\n('v2.v8.250', 136, 1752),\n('v2.v8.400', 136, 2036),\n('v2.v9.150', 136, 702),\n('v2.v9.250', 136, 1752),\n('v2.v9.400', 136, 2036),\n('v3.v4.150', 1916, 2235),\n('v3.v4.250', 1916, 2493),\n('v3.v4.400', 1916, 4014),\n('v3.v6.150', 1916, 2235),\n('v3.v6.250', 1916, 2493),\n('v3.v6.400', 1916, 4014),\n('v3.v8.150', 1916, 2235),\n('v3.v8.250', 1916, 2493),\n('v3.v8.400', 1916, 4014),\n('v3.v9.150', 1916, 2235),\n('v3.v9.250', 1916, 2493),\n('v3.v9.400', 1916, 4014),\n('v4.150', 2263, 3794),\n('v4.250', 2263, 4046),\n('v4.v6.150', 2263, 3794),\n('v4.v6.250', 2263, 4046),\n('v4.v6.400', 2263, 4574),\n('v4.v8.150', 2263, 3794),\n('v4.v8.250', 2263, 4046),\n('v4.v8.400', 2263, 4574),\n('v4.v9.150', 2263, 3794),\n('v4.v9.250', 2263, 4046),\n('v4.v9.400', 2263, 4574),\n('v6.v8.150', 4653, 5085),\n('v6.v8.250', 4653, 5903),\n('v6.v8.400', 4653, 6419),\n('v6.v9.150', 4653, 5085),\n('v6.v9.250', 4653, 5903),\n('v6.v9.400', 4653, 6419),\n\nBut there are some duplicate regions here, such as v2.150 and v2.v3.150. These can be manually removed to get this:\n('v2', 136, 1868),\n('v2.v3', 136, 2232),\n('v2.v4', 136, 4051),\n('v2.v6', 136, 4932),\n('v2.v8', 136, 6426),\n('v2.v9', 136, 6791),\n('v3', 1916, 2232),\n('v3.v4', 1916, 4051),\n('v3.v6', 1916, 4932),\n('v3.v8', 1916, 6426),\n('v3.v9', 1916, 6791),\n('v4', 2263, 4051),\n('v4.v6', 2263, 4932),\n('v4.v8', 2263, 6426),\n('v4.v9', 2263, 6791),\n('v6', 4653, 4932),\n('v6.v8', 4653, 6426),\n('v6.v9', 4653, 6791),\n('v9', 6450, 6791),\n('full.length', 0, 7682),\n('v2.150', 136, 702),\n('v2.250', 136, 1752),\n('v2.v3.400', 136, 2036),\n('v3.v4.150', 1916, 2235),\n('v3.v4.250', 1916, 2493),\n('v3.v4.400', 1916, 4014),\n('v4.150', 2263, 3794),\n('v4.250', 2263, 4046),\n('v4.v6.400', 2263, 4574),\n('v6.v8.150', 4653, 5085),\n('v6.v8.250', 4653, 5903),\n('v6.v8.400', 4653, 6419)\n\nWhich is the same subset of sequences used to define base_region_boundaries (copied from the main demonstration notebook):\n      ('v2', 136, 1868), #27f-338r\n      ('v2.v3', 136, 2232),\n      ('v2.v4', 136, 4051),\n      ('v2.v6', 136, 4932),\n      ('v2.v8', 136, 6426),\n      ('v2.v9', 136, 6791),\n      ('v3', 1916, 2232), #349f-534r\n      ('v3.v4', 1916, 4051),\n      ('v3.v6', 1916, 4932),\n      ('v3.v8', 1916, 6426),\n      ('v3.v9', 1916, 6791),\n      ('v4', 2263, 4051), #515f-806r\n      ('v4.v6', 2263, 4932),\n      ('v4.v8', 2263, 6426),\n      ('v4.v9', 2263, 6791),\n      ('v6', 4653, 4932), #967f-1048r\n      ('v6.v8', 4653, 6426),\n      ('v6.v9', 4653, 6791),\n      ('v9', 6450, 6791), #1391f-1492r\n      ('full.length', 0, 7682),  # Start 150, 250, 400 base pair reads\n      ('v2.150', 136, 702),\n      ('v2.250', 136, 1752),\n      ('v2.v3.400', 136, 2036), # Skips reads that are larger than amplicon size\n      ('v3.v4.150', 1916, 2235),\n      ('v3.v4.250', 1916, 2493),\n      ('v3.v4.400', 1916, 4014),\n      ('v4.150', 2263, 3794),\n      ('v4.250', 2263, 4046),\n      ('v4.v6.400', 2263, 4574),\n      ('v6.v8.150', 4653, 5085),\n      ('v6.v8.250', 4653, 5903),\n      ('v6.v8.400', 4653, 6419)"
    }
   ],
   "metadata": {}
  }
 ]
}