Here I'm creating a hash of expected 515F/806R amplicons from the Greengenes OTUs (for a couple of different sizes of OTUs), and comparing the uniqueness of sequences with the number of different taxonomic identities at each level.
There are basically three categories of sequences:
- those that are unique, and therefore can only map to a single taxa
- those that are not unique, but still only map to a single taxa
- those that are not unique, and map to multiple taxa.
The reads in category 3 are the problematic ones: we need a sequence with more taxonomic resolution to differentiate taxa. Surprisingly, this seems to be a very small fraction (e.g., around 1.3% of the simulated amplicons map to multiple species in the 99% OTU data set) which means that this region of the 16S does give us resolution to differentiate species in most cases.
ALL CODE IN THIS REPOSITORY IS COMPLETELY UNTESTED.
Get Greengenes 13_5 and unpack it.
wget ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_5_otus.tar.gz
tar -xzf gg_13_5_otus.tar.gz
Slicing the region that would be amplified/sequence from the greengenes reference file this steps requires a working PrimerProspector installation.
slice_aligned_region.py -f gg_13_5_otus/rep_set_aligned/70_otus.fasta -P primers.txt -o sliced_reads_99/ -p 515f -r 806r
slice_aligned_region.py -f gg_13_5_otus/rep_set_aligned/97_otus.fasta -P primers.txt -o sliced_reads_99/ -p 515f -r 806r
slice_aligned_region.py -f gg_13_5_otus/rep_set_aligned/99_otus.fasta -P primers.txt -o sliced_reads_99/ -p 515f -r 806r
Determine the taxonomic specificity of the amplicons at levels 1-7 (i.e., kingdom through species). A summary will be printed to the screen, and a report of the non-specific taxa at each level will be written to file (currently the output files will be named non_taxa_specific_seqs_LX.txt
where the X
indicates the taxonomic level).
Perform analysis with 70% OTUs.
summarize_taxa_specific_reads.py sliced_reads_70/70_otus_alignment_region.fasta gg_13_5_otus/taxonomy/70_otu_taxonomy.txt 1,2,3,4,5,6,7
Level Fraction of unique that are taxa specific Fraction of unique that are not taxa specific Fraction of total that are unique
1 1.0 0.0 1.0
2 1.0 0.0 1.0
3 1.0 0.0 1.0
4 1.0 0.0 1.0
5 1.0 0.0 1.0
6 1.0 0.0 1.0
7 1.0 0.0 1.0
Perform analysis with 97% OTUs.
summarize_taxa_specific_reads.py sliced_reads_97/97_otus_alignment_region.fasta gg_13_5_otus/taxonomy/97_otu_taxonomy.txt 1,2,3,4,5,6,7
Level Fraction of unique that are taxa specific Fraction of unique that are not taxa specific Fraction of total that are unique
1 1.0 0.0 0.877036306156
2 0.999804842209 0.000195157790814 0.877036306156
3 0.999586724678 0.000413275321723 0.877036306156
4 0.998036942222 0.00196305777819 0.877036306156
5 0.993008759141 0.00699124085915 0.877036306156
6 0.988416811122 0.0115831888783 0.877036306156
7 0.987073666326 0.0129263336739 0.877036306156
Perform analysis with 99% OTUs.
summarize_taxa_specific_reads.py sliced_reads_99/99_otus_alignment_region.fasta gg_13_5_otus/taxonomy/99_otu_taxonomy.txt 1,2,3,4,5,6,7
Level Fraction of unique that are taxa specific Fraction of unique that are not taxa specific Fraction of total that are unique
1 1.0 0.0 0.729867487171
2 0.999858579192 0.000141420807715 0.729867487171
3 0.999717158385 0.00028284161543 0.729867487171
4 0.998531917329 0.00146808267056 0.729867487171
5 0.994666415252 0.0053335847481 0.729867487171
6 0.98903652024 0.01096347976 0.729867487171
7 0.986820927586 0.0131790724142 0.729867487171
Code and notes compiled by Greg Caporaso.
From Tal: