Notes by Greg Caporaso (gregcaporaso@gmail.com)
From email to picrust-developers on 4 Oct 2012:
-
Filter the HMP (not HMP-mock) data set to ~50-100k sequences at random to form a filtered dataset (for decreased run time).
-
Select ~12 of the most abundant IMG-defined OTUs from the HMP, and slice the reference sequence to the amplified region in that dataset. "IMG-defined" here means that we have an IMG genome attached to the OTU, opposed to the Greengenes-defined OTUs where we don't have a genome for that specific OTU. "most abundant" will be somewhat arbitrary - I'm thinking something like a random 12 IMG-defined from the 25% most abundant OTUs in the dataset.
-
Spike a combined total of 50k of the sliced reads from Step 2 into the filtered dataset from Step 1.
-
Pick OTUs with available methods (uclust-fast, uclust-strict, BWA-short, BLAT, usearch, bowtie2 (when available)).
-
Share OTU tables for PICRUST prediction accuracy analysis "positive control" (i.e., we expect these to achieve higher accuracy as we have enriched for genomes that we suspect are very likely to be present in the samples).
Run on compy.colorado.edu on 26 Oct 2012 by Greg Caporaso (gregcaporaso@gmail.com)
python ../spike_control/picrust_spike_control.py -f /scratch/caporaso/picrust/otus/mammals/seqs.fna -i /scratch/caporaso/picrust/otus/mammals/ucrC88.strict/uclust_ref_picked_otus/otu_table.biom -o /scratch/caporaso/picrust/otus/mammals/spike_control/ -p 0.3 -g /scratch/caporaso/gg_otus_4feb2011/taxonomies/greengenes_tax.txt -r /Users/caporaso/data/img_gg_otus_18may2012/rep_set/97_otus_img_gg_18may2012.fasta -c /scratch/caporaso/gg_otus_4feb2011/rep_set/gg_97_otus_4feb2011_aligned.fasta
# w spiked sequences
echo "pick_reference_otus_through_otu_table.py -i /scratch/caporaso/picrust/otus/mammals/spike_control/seqs.subsampled0.300000.spiked.fna -o /scratch/caporaso/picrust/otus/mammals/spike_control/ucrC97_subsampled0.300000_plus_spike/ -p /scratch/caporaso/picrust/otus/otu_picking_params_97.txt -r /Users/caporaso/data/img_gg_otus_18may2012/rep_set/97_otus_img_gg_18may2012.fasta" | qsub -keo -N picrustspk -l pvmem=4gb -q memroute
echo "beta_diversity_through_plots.py -i /scratch/caporaso/picrust/otus/mammals/spike_control/ucrC97_subsampled0.300000_plus_spike/uclust_ref_picked_otus/otu_table.biom -o /scratch/caporaso/picrust/otus/mammals/spike_control/ucrC97_subsampled0.300000_plus_spike/bdiv_even488 -e 488 -t /Users/caporaso/data/img_gg_otus_18may2012/trees/97_otus_img_gg_18may2012.tre -m /scratch/caporaso/picrust/otus/mammals/map.txt" | qsub -keo -N spkbdiv -l pvmem=4gb -q memroute
# w/o spiked sequences
echo "pick_reference_otus_through_otu_table.py -i /scratch/caporaso/picrust/otus/mammals/spike_control/seqs.subsampled0.300000.fna -o /scratch/caporaso/picrust/otus/mammals/spike_control/ucrC97_subsampled0.300000/ -p /scratch/caporaso/picrust/otus/otu_picking_params_97.txt -r /Users/caporaso/data/img_gg_otus_18may2012/rep_set/97_otus_img_gg_18may2012.fasta" | qsub -keo -N picrustspk -l pvmem=4gb -q memroute
echo "beta_diversity_through_plots.py -i /scratch/caporaso/picrust/otus/mammals/spike_control/ucrC97_subsampled0.300000/uclust_ref_picked_otus/otu_table.biom -o /scratch/caporaso/picrust/otus/mammals/spike_control/ucrC97_subsampled0.300000/bdiv_even298 -e 298 -t /Users/caporaso/data/img_gg_otus_18may2012/trees/97_otus_img_gg_18may2012.tre -m /scratch/caporaso/picrust/otus/mammals/map.txt" | qsub -keo -N spkbdiv -l pvmem=4gb -q memroute
Input md5 sums
7eae6b90a80419c402930e16ba790f48 /scratch/caporaso/picrust/otus/mammals/ucrC88.strict/uclust_ref_picked_otus/otu_table.biom
a8f087f46afcfc0ec48dd9a69d4faffa /scratch/caporaso/gg_otus_4feb2011/taxonomies/greengenes_tax.txt
81b2a0624bac27eee6a59ab3e3d55bbf /scratch/caporaso/picrust/otus/mammals/seqs.fna
7efa9b27e4eeb3985ca5614269cb7cb8 /Users/caporaso/data/img_gg_otus_18may2012/rep_set/97_otus_img_gg_18may2012.fasta
afea7f81e67b77d3eb4917ede7b7f914 /scratch/caporaso/gg_otus_4feb2011/rep_set/gg_97_otus_4feb2011_aligned.fasta
4c6fa9441fac31f4df448469e3b44506 /Users/caporaso/data/img_gg_otus_18may2012/trees/97_otus_img_gg_18may2012.tre
f52d162d44591c9cd8d81379a89ad240 /scratch/caporaso/picrust/otus/mammals/map.txt
Initial investigation was performed using HMP, but that data was problematic for several reasons. Those notes, which formed the basis of this script, are below.
Step 1. Will work with the HMP v35 mock community data as we seem to be getting a higher fraction of read assignments with that data relative to the v13 data.
compy:/scratch/caporaso/picrust/otus/hmp/seqs_v35.fna : 710e2aa798629d2c3e591a83dd3db009
count_seqs.py -i seqs_v35.fna
30336944 : seqs_v35.fna (Sequence lengths (mean +/- std): 528.0545 +/- 16.3430)
echo "subsample_fasta.py -i /scratch/caporaso/picrust/otus/hmp/seqs_v35.fna -o /scratch/caporaso/picrust/otus/hmp/spiked_control/seqs_v35.subsample003.fna -p 0.003" | qsub -keo -N subsamplehmp
count_seqs.py -i seqs_v35.subsample003.fna
91006 : seqs_v35.subsample003.fna (Sequence lengths (mean +/- std): 528.1065 +/- 16.4102)
Step 2. Select only the IMG OTUs
echo "filter_otus_from_otu_table.py -i /scratch/caporaso/picrust/otus/hmp/ucrC88_v35_new/uclust_ref_picked_otus/otu_table.biom -o /scratch/caporaso/picrust/otus/hmp/spiked_control/otu_table_img_only.biom -e /scratch/caporaso/gg_otus_4feb2011/taxonomies/greengenes_tax.txt" | qsub -keo -N filter_hmp -l pvmem=32gb -q memroute
#Filter the OTU table to include samples with at least 3000 observations (we’ll drop the rest from the final table).
filter_samples_from_otu_table.py -i otu_table_img_only.biom -o otu_table_img_only.mc3000.biom -n 3000
#Identify the most abundant IMG OTUs on a per-sample basis.
In [1]: from biom.parse import parse_biom_table
In [2]: t = parse_biom_table(open('./otu_table_img_only.mc3000.biom'))
In [16]: def get_abundant_oids(v):
....: x = [(e,i) for i,e in enumerate(v)]
....: x.sort()
....: x.reverse()
....: result = [(t.ObservationIds[i],e) for e,i in x if e > 0]
....: return result
In [22]: abundant_otus = {}
In [23]: for sv, si, sm in t.iterSamples():
abundant_otus[si] = get_abundant_oids(sv)
Write the set of identifiers of abundant OTUs to file:
In [26]: most_abundant = set()
In [27]: for sid, a in abundant_otus.items():
....: most_abundant.update(set([e[0] for e in a]))
....:
In [28]: len(most_abundant)
Out[28]: 332
In [30]: f = open('most_abundant_seq_ids.txt','w')
In [31]: f.write('\n'.join(most_abundant))
In [32]: f.close()
#Filter the abundant OTUs from the reference set:
img_gg_otus_18may2012/rep_set/97_otus_img_gg_18may2012.fasta
(md5: 7efa9b27e4eeb3985ca5614269cb7cb8)
filter_fasta.py -f /Users/caporaso/data/img_gg_otus_18may2012/rep_set/97_otus_img_gg_18may2012.fasta -o /scratch/caporaso/picrust/otus/hmp/spiked_control/abundant_img_sequences.fasta -s /scratch/caporaso/picrust/otus/hmp/spiked_control/most_abundant_seq_ids.txt
#Align those sequences
align_seqs.py -i /scratch/caporaso/picrust/otus/hmp/spiked_control/abundant_img_sequences.fasta -t /scratch/caporaso/gg_otus_4feb2011/rep_set/gg_97_otus_4feb2011_aligned.fasta
#Trim to the v35 region using the HMP primers (slice_aligned_region.py is a PrimerProspector script). The primers appear to be 357F (CCTACGGGAGGCAGCAG) and 926R (CCGTCAATTCMTTTRAGT) (found these here (see page 11))
slice_aligned_region.py -f abundant_img_sequences_aligned.fasta -x CTGCTGCCTCCCGTAGG -y CCGTCAATTCMTTTRAG
Step 3. Generate fasta of sequences to spike into the subsample
In [1]: from cogent.parse.fasta import MinimalFastaParser
In [2]: sequences_to_spike = {}
In [3]: for seq_id, seq in MinimalFastaParser(open('pynast_aligned/abundant_img_sequences_aligned_alignment_region.fasta','U')):
sequences_to_spike[seq_id.split()[0]] = seq
In [5]: f = open('spike.fasta','w')
In [6]: i = 30336945
In [12]: for sid, o in abundant_otus.items():
for e in o[:2]:
for j in range(50):
try:
f.write('>%s_%d %s\n%s\n' % (sid,i,e[0],sequences_to_spike[e[0]]))
except KeyError:
pass
i += 1
....:
In [13]: f.close()
In [14]: !count_seqs.py -i spike.fasta
97750 : spike.fasta (Sequence lengths (mean +/- std): 545.1565 +/- 8.8109)
97750 : Total
Step 4: Spike the sequences into the subsample
cat seqs_v35.subsample003.fna spike.fasta >> seqs_v35.subsample003_plus_spike.fna
count_seqs.py -i seqs_v35.subsample003_plus_spike.fna
188756 : seqs_v35.subsample003_plus_spike.fna (Sequence lengths (mean +/- std): 536.9361 +/- 15.5764)
188756 : Total
Step 5: Pick OTUs and generate OTU table on spiked data
Ran this in serial as the queue was full at the time
echo "pick_reference_otus_through_otu_table.py -i /scratch/caporaso/picrust/otus/hmp/spiked_control/seqs_v35.subsample003_plus_spike.fna -o /scratch/caporaso/picrust/otus/hmp/spiked_control/ucrC97_v35.subsample003_plus_spike/ -p /scratch/caporaso/picrust/otus/otu_picking_params_97.txt -r /Users/caporaso/data/img_gg_otus_18may2012/rep_set/97_otus_img_gg_18may2012.fasta" | qsub -keo -N picrustspk -l pvmem=4gb -q memroute
Currently have too many samples to reasonably work with this many sequences. Need to filter the sequence collection on a per sample basis. Need the mapping file! If no mapping file, work with a different data set (which might be easier regardless, given the first issue noted here).