gregcaporaso/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Taxonomic specificity of sequences in Greengenes

Here I'm creating a hash of expected 515F/806R amplicons from the Greengenes OTUs (for a couple of different sizes of OTUs), and comparing the uniqueness of sequences with the number of different taxonomic identities at each level.
There are basically three categories of sequences:

those that are unique, and therefore can only map to a single taxa
those that are not unique, but still only map to a single taxa
those that are not unique, and map to multiple taxa.

The reads in category 3 are the problematic ones: we need a sequence with more taxonomic resolution to differentiate taxa. Surprisingly, this seems to be a very small fraction (e.g., around 1.3% of the simulated amplicons map to multiple species in the 99% OTU data set) which means that this region of the 16S does give us resolution to differentiate species in most cases.
ALL CODE IN THIS REPOSITORY IS COMPLETELY UNTESTED.
Analysis

Get Greengenes 13_5 and unpack it.
wget ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_5_otus.tar.gz
tar -xzf gg_13_5_otus.tar.gz

Slicing the region that would be amplified/sequence from the greengenes reference file this steps requires a working PrimerProspector installation.
slice_aligned_region.py -f gg_13_5_otus/rep_set_aligned/70_otus.fasta -P primers.txt -o sliced_reads_99/ -p 515f -r 806r

slice_aligned_region.py -f gg_13_5_otus/rep_set_aligned/97_otus.fasta -P primers.txt -o sliced_reads_99/ -p 515f -r 806r

slice_aligned_region.py -f gg_13_5_otus/rep_set_aligned/99_otus.fasta -P primers.txt -o sliced_reads_99/ -p 515f -r 806r

Determine the taxonomic specificity of the amplicons at levels 1-7 (i.e., kingdom through species). A summary will be printed to the screen, and a report of the non-specific taxa at each level will be written to file (currently the output files will be named non_taxa_specific_seqs_LX.txt where the X indicates the taxonomic level).
Perform analysis with 70% OTUs.
summarize_taxa_specific_reads.py sliced_reads_70/70_otus_alignment_region.fasta gg_13_5_otus/taxonomy/70_otu_taxonomy.txt 1,2,3,4,5,6,7

Level	Fraction of unique that are taxa specific	Fraction of unique that are not taxa specific	Fraction of total that are unique
1	1.0	0.0	1.0
2	1.0	0.0	1.0
3	1.0	0.0	1.0
4	1.0	0.0	1.0
5	1.0	0.0	1.0
6	1.0	0.0	1.0
7	1.0	0.0	1.0

Perform analysis with 97% OTUs.
summarize_taxa_specific_reads.py sliced_reads_97/97_otus_alignment_region.fasta gg_13_5_otus/taxonomy/97_otu_taxonomy.txt 1,2,3,4,5,6,7

Level	Fraction of unique that are taxa specific	Fraction of unique that are not taxa specific	Fraction of total that are unique
1	1.0	0.0	0.877036306156
2	0.999804842209	0.000195157790814	0.877036306156
3	0.999586724678	0.000413275321723	0.877036306156
4	0.998036942222	0.00196305777819	0.877036306156
5	0.993008759141	0.00699124085915	0.877036306156
6	0.988416811122	0.0115831888783	0.877036306156
7	0.987073666326	0.0129263336739	0.877036306156

Perform analysis with 99% OTUs.
summarize_taxa_specific_reads.py sliced_reads_99/99_otus_alignment_region.fasta gg_13_5_otus/taxonomy/99_otu_taxonomy.txt 1,2,3,4,5,6,7

Level	Fraction of unique that are taxa specific	Fraction of unique that are not taxa specific	Fraction of total that are unique
1	1.0	0.0	0.729867487171
2	0.999858579192	0.000141420807715	0.729867487171
3	0.999717158385	0.00028284161543	0.729867487171
4	0.998531917329	0.00146808267056	0.729867487171
5	0.994666415252	0.0053335847481	0.729867487171
6	0.98903652024	0.01096347976	0.729867487171
7	0.986820927586	0.0131790724142	0.729867487171

Code and notes compiled by Greg Caporaso.


## primers.txt
515f	GTGCCAGCMGCCGCGGTAA
806r	GGACTACVSGGGTATCTAAT

## summarize_taxa_specific_reads.py
#!/usr/bin/env python
from __future__ import division

from sys import argv

from skbio import parse_fasta
import pandas as pd


def generate_id_to_tax_lookup(lines,level):
    result = {}
    for line in lines:
        id_, tax = line.strip().split('\t')
        result[id_] = ';'.join(tax.split(';')[:level])
    return result

def generate_seq_to_taxa_lookup(seqs,id_to_tax):
    result = {}
    for seq_id, seq in seqs:
        try:
            result[seq].append(id_to_tax[seq_id])
        except KeyError:
            result[seq] = [id_to_tax[seq_id]]
    return result

def main():
    fasta_fp = argv[1]
    taxa_map_fp = argv[2]
    levels = map(int,argv[3].split(','))

    columns = ["Fraction of unique that are taxa specific",
               "Fraction of unique that are not taxa specific",
               "Fraction of total that are unique"]
    results = []
    for level in levels:
        id_to_tax = generate_id_to_tax_lookup(open(taxa_map_fp,'U'),level)
        seq_to_taxa = generate_seq_to_taxa_lookup(parse_fasta(open(fasta_fp,'U')),
                                                id_to_tax)
        total_seqs = len(list(parse_fasta(open(fasta_fp,'U'))))
        unique_seqs = len(seq_to_taxa)
        taxa_specific = 0
        non_taxa_specific = 0
        non_taxa_specific_seqs_f = open('non_taxa_specific_seqs_L%d.txt' % level, 'w')
        for seq, taxa in seq_to_taxa.items():
            if len(set(taxa)) == 1:
                taxa_specific += 1
            else:
                non_taxa_specific += 1
                non_taxa_specific_seqs_f.write('\t'.join(taxa + [seq]))
                non_taxa_specific_seqs_f.write('\n')

        non_taxa_specific_seqs_f.close()
        results.append([taxa_specific / unique_seqs,
                        non_taxa_specific / unique_seqs,
                        unique_seqs / total_seqs])
    df = pd.DataFrame(results, columns=columns, index=levels)
    print df

if __name__ == "__main__":
    main()
	#!/usr/bin/env python
	from __future__ import division

	from sys import argv

	from skbio import parse_fasta
	import pandas as pd


	def generate_id_to_tax_lookup(lines,level):
	result = {}
	for line in lines:
	id_, tax = line.strip().split('\t')
	result[id_] = ';'.join(tax.split(';')[:level])
	return result

	def generate_seq_to_taxa_lookup(seqs,id_to_tax):
	result = {}
	for seq_id, seq in seqs:
	try:
	result[seq].append(id_to_tax[seq_id])
	except KeyError:
	result[seq] = [id_to_tax[seq_id]]
	return result

	def main():
	fasta_fp = argv[1]
	taxa_map_fp = argv[2]
	levels = map(int,argv[3].split(','))

	columns = ["Fraction of unique that are taxa specific",
	"Fraction of unique that are not taxa specific",
	"Fraction of total that are unique"]
	results = []
	for level in levels:
	id_to_tax = generate_id_to_tax_lookup(open(taxa_map_fp,'U'),level)
	seq_to_taxa = generate_seq_to_taxa_lookup(parse_fasta(open(fasta_fp,'U')),
	id_to_tax)
	total_seqs = len(list(parse_fasta(open(fasta_fp,'U'))))
	unique_seqs = len(seq_to_taxa)
	taxa_specific = 0
	non_taxa_specific = 0
	non_taxa_specific_seqs_f = open('non_taxa_specific_seqs_L%d.txt' % level, 'w')
	for seq, taxa in seq_to_taxa.items():
	if len(set(taxa)) == 1:
	taxa_specific += 1
	else:
	non_taxa_specific += 1
	non_taxa_specific_seqs_f.write('\t'.join(taxa + [seq]))
	non_taxa_specific_seqs_f.write('\n')

	non_taxa_specific_seqs_f.close()
	results.append([taxa_specific / unique_seqs,
	non_taxa_specific / unique_seqs,
	unique_seqs / total_seqs])
	df = pd.DataFrame(results, columns=columns, index=levels)
	print df

	if __name__ == "__main__":
	main()