Skip to content

Instantly share code, notes, and snippets.

@gregcaporaso
Last active December 21, 2015 02:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gregcaporaso/6236268 to your computer and use it in GitHub Desktop.
Save gregcaporaso/6236268 to your computer and use it in GitHub Desktop.
Code and analysis notes for determine the taxonomic-specificity of a set of sequences with associated taxonomy strings. This has been tested with the Greengenes 13_5 database. See README.md for usage instructions and some analysis notes.

Taxonomic specificity of sequences in Greengenes

Here I'm creating a hash of expected 515F/806R amplicons from the Greengenes OTUs (for a couple of different sizes of OTUs), and comparing the uniqueness of sequences with the number of different taxonomic identities at each level.

There are basically three categories of sequences:

  1. those that are unique, and therefore can only map to a single taxa
  2. those that are not unique, but still only map to a single taxa
  3. those that are not unique, and map to multiple taxa.

The reads in category 3 are the problematic ones: we need a sequence with more taxonomic resolution to differentiate taxa. Surprisingly, this seems to be a very small fraction (e.g., around 1.3% of the simulated amplicons map to multiple species in the 99% OTU data set) which means that this region of the 16S does give us resolution to differentiate species in most cases.

ALL CODE IN THIS REPOSITORY IS COMPLETELY UNTESTED.

Analysis

Get Greengenes 13_5 and unpack it.

wget ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_5_otus.tar.gz
tar -xzf gg_13_5_otus.tar.gz

Slicing the region that would be amplified/sequence from the greengenes reference file this steps requires a working PrimerProspector installation.

slice_aligned_region.py -f gg_13_5_otus/rep_set_aligned/70_otus.fasta -P primers.txt -o sliced_reads_99/ -p 515f -r 806r

slice_aligned_region.py -f gg_13_5_otus/rep_set_aligned/97_otus.fasta -P primers.txt -o sliced_reads_99/ -p 515f -r 806r

slice_aligned_region.py -f gg_13_5_otus/rep_set_aligned/99_otus.fasta -P primers.txt -o sliced_reads_99/ -p 515f -r 806r

Determine the taxonomic specificity of the amplicons at levels 1-7 (i.e., kingdom through species). A summary will be printed to the screen, and a report of the non-specific taxa at each level will be written to file (currently the output files will be named non_taxa_specific_seqs_LX.txt where the X indicates the taxonomic level).

Perform analysis with 70% OTUs.

summarize_taxa_specific_reads.py sliced_reads_70/70_otus_alignment_region.fasta gg_13_5_otus/taxonomy/70_otu_taxonomy.txt 1,2,3,4,5,6,7

Level	Fraction of unique that are taxa specific	Fraction of unique that are not taxa specific	Fraction of total that are unique
1	1.0	0.0	1.0
2	1.0	0.0	1.0
3	1.0	0.0	1.0
4	1.0	0.0	1.0
5	1.0	0.0	1.0
6	1.0	0.0	1.0
7	1.0	0.0	1.0

Perform analysis with 97% OTUs.

summarize_taxa_specific_reads.py sliced_reads_97/97_otus_alignment_region.fasta gg_13_5_otus/taxonomy/97_otu_taxonomy.txt 1,2,3,4,5,6,7

Level	Fraction of unique that are taxa specific	Fraction of unique that are not taxa specific	Fraction of total that are unique
1	1.0	0.0	0.877036306156
2	0.999804842209	0.000195157790814	0.877036306156
3	0.999586724678	0.000413275321723	0.877036306156
4	0.998036942222	0.00196305777819	0.877036306156
5	0.993008759141	0.00699124085915	0.877036306156
6	0.988416811122	0.0115831888783	0.877036306156
7	0.987073666326	0.0129263336739	0.877036306156

Perform analysis with 99% OTUs.

summarize_taxa_specific_reads.py sliced_reads_99/99_otus_alignment_region.fasta gg_13_5_otus/taxonomy/99_otu_taxonomy.txt 1,2,3,4,5,6,7

Level	Fraction of unique that are taxa specific	Fraction of unique that are not taxa specific	Fraction of total that are unique
1	1.0	0.0	0.729867487171
2	0.999858579192	0.000141420807715	0.729867487171
3	0.999717158385	0.00028284161543	0.729867487171
4	0.998531917329	0.00146808267056	0.729867487171
5	0.994666415252	0.0053335847481	0.729867487171
6	0.98903652024	0.01096347976	0.729867487171
7	0.986820927586	0.0131790724142	0.729867487171

Code and notes compiled by Greg Caporaso.

scikit-bio

515f GTGCCAGCMGCCGCGGTAA
806r GGACTACVSGGGTATCTAAT
#!/usr/bin/env python
from __future__ import division
from sys import argv
from skbio import parse_fasta
import pandas as pd
def generate_id_to_tax_lookup(lines,level):
result = {}
for line in lines:
id_, tax = line.strip().split('\t')
result[id_] = ';'.join(tax.split(';')[:level])
return result
def generate_seq_to_taxa_lookup(seqs,id_to_tax):
result = {}
for seq_id, seq in seqs:
try:
result[seq].append(id_to_tax[seq_id])
except KeyError:
result[seq] = [id_to_tax[seq_id]]
return result
def main():
fasta_fp = argv[1]
taxa_map_fp = argv[2]
levels = map(int,argv[3].split(','))
columns = ["Fraction of unique that are taxa specific",
"Fraction of unique that are not taxa specific",
"Fraction of total that are unique"]
results = []
for level in levels:
id_to_tax = generate_id_to_tax_lookup(open(taxa_map_fp,'U'),level)
seq_to_taxa = generate_seq_to_taxa_lookup(parse_fasta(open(fasta_fp,'U')),
id_to_tax)
total_seqs = len(list(parse_fasta(open(fasta_fp,'U'))))
unique_seqs = len(seq_to_taxa)
taxa_specific = 0
non_taxa_specific = 0
non_taxa_specific_seqs_f = open('non_taxa_specific_seqs_L%d.txt' % level, 'w')
for seq, taxa in seq_to_taxa.items():
if len(set(taxa)) == 1:
taxa_specific += 1
else:
non_taxa_specific += 1
non_taxa_specific_seqs_f.write('\t'.join(taxa + [seq]))
non_taxa_specific_seqs_f.write('\n')
non_taxa_specific_seqs_f.close()
results.append([taxa_specific / unique_seqs,
non_taxa_specific / unique_seqs,
unique_seqs / total_seqs])
df = pd.DataFrame(results, columns=columns, index=levels)
print df
if __name__ == "__main__":
main()
@gregcaporaso
Copy link
Author

From Tal:

So maybe add some notes about next steps:

  1.   What taxa and what degree of resolution are in that 3rd category.
    
  2.   Pull out sequences associated with the most common niches – gut, skin, aquatic etc…
    

a. What % of reads are in that 3rd category?
b. What are those taxa and what degree of resolution do we get?

@walterst
Copy link

You want to be sure to use the pynast aligned Greengenes (gg_13_5_pynast.fasta.gz from http://greengenes.secondgenome.com/downloads/database/13_5) reference sequences when using slice_aligned_region.py (or any process where you want the unaligned sequences to match the degapped aligned regions).

@wasade
Copy link

wasade commented Mar 11, 2015

@walterst @gregcaporaso, good spot. rep_set_aligned are ssu-aligned, not pynast. The issue is that ssu-align will drop internal positions, so those sequences (unaligned) are not the same as the files used for when picking OTUs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment