Skip to content

Instantly share code, notes, and snippets.

@gregcaporaso
Created October 27, 2012 00:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gregcaporaso/3962344 to your computer and use it in GitHub Desktop.
Save gregcaporaso/3962344 to your computer and use it in GitHub Desktop.
A quick and dirty script for generating "positive control" data for the PICRUST project

Notes by Greg Caporaso (gregcaporaso@gmail.com)

Analysis goals

From email to picrust-developers on 4 Oct 2012:

  1. Filter the HMP (not HMP-mock) data set to ~50-100k sequences at random to form a filtered dataset (for decreased run time).

  2. Select ~12 of the most abundant IMG-defined OTUs from the HMP, and slice the reference sequence to the amplified region in that dataset. "IMG-defined" here means that we have an IMG genome attached to the OTU, opposed to the Greengenes-defined OTUs where we don't have a genome for that specific OTU. "most abundant" will be somewhat arbitrary - I'm thinking something like a random 12 IMG-defined from the 25% most abundant OTUs in the dataset.

  3. Spike a combined total of 50k of the sliced reads from Step 2 into the filtered dataset from Step 1.

  4. Pick OTUs with available methods (uclust-fast, uclust-strict, BWA-short, BLAT, usearch, bowtie2 (when available)).

  5. Share OTU tables for PICRUST prediction accuracy analysis "positive control" (i.e., we expect these to achieve higher accuracy as we have enriched for genomes that we suspect are very likely to be present in the samples).

Analysis notes

Run on compy.colorado.edu on 26 Oct 2012 by Greg Caporaso (gregcaporaso@gmail.com)

python ../spike_control/picrust_spike_control.py -f /scratch/caporaso/picrust/otus/mammals/seqs.fna -i /scratch/caporaso/picrust/otus/mammals/ucrC88.strict/uclust_ref_picked_otus/otu_table.biom -o /scratch/caporaso/picrust/otus/mammals/spike_control/ -p 0.3 -g /scratch/caporaso/gg_otus_4feb2011/taxonomies/greengenes_tax.txt -r /Users/caporaso/data/img_gg_otus_18may2012/rep_set/97_otus_img_gg_18may2012.fasta -c /scratch/caporaso/gg_otus_4feb2011/rep_set/gg_97_otus_4feb2011_aligned.fasta

# w spiked sequences
echo "pick_reference_otus_through_otu_table.py -i /scratch/caporaso/picrust/otus/mammals/spike_control/seqs.subsampled0.300000.spiked.fna -o /scratch/caporaso/picrust/otus/mammals/spike_control/ucrC97_subsampled0.300000_plus_spike/ -p /scratch/caporaso/picrust/otus/otu_picking_params_97.txt -r /Users/caporaso/data/img_gg_otus_18may2012/rep_set/97_otus_img_gg_18may2012.fasta" | qsub -keo -N picrustspk -l pvmem=4gb -q memroute

echo "beta_diversity_through_plots.py -i /scratch/caporaso/picrust/otus/mammals/spike_control/ucrC97_subsampled0.300000_plus_spike/uclust_ref_picked_otus/otu_table.biom -o /scratch/caporaso/picrust/otus/mammals/spike_control/ucrC97_subsampled0.300000_plus_spike/bdiv_even488 -e 488 -t /Users/caporaso/data/img_gg_otus_18may2012/trees/97_otus_img_gg_18may2012.tre -m /scratch/caporaso/picrust/otus/mammals/map.txt" | qsub -keo -N spkbdiv -l pvmem=4gb -q memroute

# w/o spiked sequences
echo "pick_reference_otus_through_otu_table.py -i /scratch/caporaso/picrust/otus/mammals/spike_control/seqs.subsampled0.300000.fna -o /scratch/caporaso/picrust/otus/mammals/spike_control/ucrC97_subsampled0.300000/ -p /scratch/caporaso/picrust/otus/otu_picking_params_97.txt -r /Users/caporaso/data/img_gg_otus_18may2012/rep_set/97_otus_img_gg_18may2012.fasta" | qsub -keo -N picrustspk -l pvmem=4gb -q memroute

echo "beta_diversity_through_plots.py -i /scratch/caporaso/picrust/otus/mammals/spike_control/ucrC97_subsampled0.300000/uclust_ref_picked_otus/otu_table.biom -o /scratch/caporaso/picrust/otus/mammals/spike_control/ucrC97_subsampled0.300000/bdiv_even298 -e 298 -t /Users/caporaso/data/img_gg_otus_18may2012/trees/97_otus_img_gg_18may2012.tre -m /scratch/caporaso/picrust/otus/mammals/map.txt" | qsub -keo -N spkbdiv -l pvmem=4gb -q memroute

Input md5 sums

7eae6b90a80419c402930e16ba790f48  /scratch/caporaso/picrust/otus/mammals/ucrC88.strict/uclust_ref_picked_otus/otu_table.biom
a8f087f46afcfc0ec48dd9a69d4faffa  /scratch/caporaso/gg_otus_4feb2011/taxonomies/greengenes_tax.txt
81b2a0624bac27eee6a59ab3e3d55bbf  /scratch/caporaso/picrust/otus/mammals/seqs.fna
7efa9b27e4eeb3985ca5614269cb7cb8  /Users/caporaso/data/img_gg_otus_18may2012/rep_set/97_otus_img_gg_18may2012.fasta
afea7f81e67b77d3eb4917ede7b7f914  /scratch/caporaso/gg_otus_4feb2011/rep_set/gg_97_otus_4feb2011_aligned.fasta
4c6fa9441fac31f4df448469e3b44506  /Users/caporaso/data/img_gg_otus_18may2012/trees/97_otus_img_gg_18may2012.tre
f52d162d44591c9cd8d81379a89ad240  /scratch/caporaso/picrust/otus/mammals/map.txt

Initial investigations

Initial investigation was performed using HMP, but that data was problematic for several reasons. Those notes, which formed the basis of this script, are below.

Step 1. Will work with the HMP v35 mock community data as we seem to be getting a higher fraction of read assignments with that data relative to the v13 data.

compy:/scratch/caporaso/picrust/otus/hmp/seqs_v35.fna : 710e2aa798629d2c3e591a83dd3db009

count_seqs.py -i seqs_v35.fna
30336944  : seqs_v35.fna (Sequence lengths (mean +/- std): 528.0545 +/- 16.3430)

echo "subsample_fasta.py -i /scratch/caporaso/picrust/otus/hmp/seqs_v35.fna -o /scratch/caporaso/picrust/otus/hmp/spiked_control/seqs_v35.subsample003.fna -p 0.003" | qsub -keo -N subsamplehmp

count_seqs.py -i seqs_v35.subsample003.fna 
91006  : seqs_v35.subsample003.fna (Sequence lengths (mean +/- std): 528.1065 +/- 16.4102)

Step 2. Select only the IMG OTUs

echo "filter_otus_from_otu_table.py -i /scratch/caporaso/picrust/otus/hmp/ucrC88_v35_new/uclust_ref_picked_otus/otu_table.biom -o /scratch/caporaso/picrust/otus/hmp/spiked_control/otu_table_img_only.biom -e /scratch/caporaso/gg_otus_4feb2011/taxonomies/greengenes_tax.txt" | qsub -keo -N filter_hmp -l pvmem=32gb -q memroute

#Filter the OTU table to include samples with at least 3000 observations (we’ll drop the rest from the final table).

filter_samples_from_otu_table.py -i otu_table_img_only.biom -o otu_table_img_only.mc3000.biom -n 3000

#Identify the most abundant IMG OTUs on a per-sample basis.

In [1]: from biom.parse import parse_biom_table

In [2]: t = parse_biom_table(open('./otu_table_img_only.mc3000.biom'))

In [16]: def get_abundant_oids(v):
   ....:     x = [(e,i) for i,e in enumerate(v)]
   ....:     x.sort()
   ....:     x.reverse()
   ....:     result = [(t.ObservationIds[i],e) for e,i in x if e > 0]
   ....:     return result

In [22]: abundant_otus = {}

In [23]: for sv, si, sm in t.iterSamples():
    abundant_otus[si] = get_abundant_oids(sv)

Write the set of identifiers of abundant OTUs to file:
In [26]: most_abundant = set()

In [27]: for sid, a in abundant_otus.items():
   ....:     most_abundant.update(set([e[0] for e in a]))
   ....:     

In [28]: len(most_abundant)
Out[28]: 332

In [30]: f = open('most_abundant_seq_ids.txt','w')

In [31]: f.write('\n'.join(most_abundant))

In [32]: f.close()

#Filter the abundant OTUs from the reference set: 
img_gg_otus_18may2012/rep_set/97_otus_img_gg_18may2012.fasta 
(md5: 7efa9b27e4eeb3985ca5614269cb7cb8)

filter_fasta.py -f /Users/caporaso/data/img_gg_otus_18may2012/rep_set/97_otus_img_gg_18may2012.fasta -o /scratch/caporaso/picrust/otus/hmp/spiked_control/abundant_img_sequences.fasta -s /scratch/caporaso/picrust/otus/hmp/spiked_control/most_abundant_seq_ids.txt

#Align those sequences
align_seqs.py -i /scratch/caporaso/picrust/otus/hmp/spiked_control/abundant_img_sequences.fasta -t /scratch/caporaso/gg_otus_4feb2011/rep_set/gg_97_otus_4feb2011_aligned.fasta 

#Trim to the v35 region using the HMP primers (slice_aligned_region.py is a PrimerProspector script). The primers appear to be 357F (CCTACGGGAGGCAGCAG) and 926R (CCGTCAATTCMTTTRAGT) (found these here (see page 11))
slice_aligned_region.py -f abundant_img_sequences_aligned.fasta -x CTGCTGCCTCCCGTAGG -y CCGTCAATTCMTTTRAG

Step 3. Generate fasta of sequences to spike into the subsample

In [1]: from cogent.parse.fasta import MinimalFastaParser

In [2]: sequences_to_spike = {}

In [3]: for seq_id, seq in MinimalFastaParser(open('pynast_aligned/abundant_img_sequences_aligned_alignment_region.fasta','U')):
    sequences_to_spike[seq_id.split()[0]] = seq

In [5]: f = open('spike.fasta','w')

In [6]: i = 30336945

In [12]: for sid, o in abundant_otus.items():                                                                                    
    for e in o[:2]:
        for j in range(50):
            try:
                f.write('>%s_%d %s\n%s\n' % (sid,i,e[0],sequences_to_spike[e[0]]))
            except KeyError:
                pass
            i += 1
   ....:             

In [13]: f.close()

In [14]: !count_seqs.py -i spike.fasta

97750  : spike.fasta (Sequence lengths (mean +/- std): 545.1565 +/- 8.8109)
97750  : Total

Step 4: Spike the sequences into the subsample

cat seqs_v35.subsample003.fna spike.fasta >> seqs_v35.subsample003_plus_spike.fna
count_seqs.py -i seqs_v35.subsample003_plus_spike.fna

188756  : seqs_v35.subsample003_plus_spike.fna (Sequence lengths (mean +/- std): 536.9361 +/- 15.5764)
188756  : Total

Step 5: Pick OTUs and generate OTU table on spiked data

Ran this in serial as the queue was full at the time
echo "pick_reference_otus_through_otu_table.py -i /scratch/caporaso/picrust/otus/hmp/spiked_control/seqs_v35.subsample003_plus_spike.fna -o /scratch/caporaso/picrust/otus/hmp/spiked_control/ucrC97_v35.subsample003_plus_spike/ -p /scratch/caporaso/picrust/otus/otu_picking_params_97.txt -r /Users/caporaso/data/img_gg_otus_18may2012/rep_set/97_otus_img_gg_18may2012.fasta" | qsub -keo -N picrustspk -l pvmem=4gb -q memroute

Currently have too many samples to reasonably work with this many sequences. Need to filter the sequence collection on a per sample basis. Need the mapping file! If no mapping file, work with a different data set (which might be easier regardless, given the first issue noted here).

#!/usr/bin/env python
# File created on 26 Oct 2012
from __future__ import division
__author__ = "Greg Caporaso"
__copyright__ = "Copyright 2011, The QIIME project"
__credits__ = ["Greg Caporaso"]
__license__ = "GPL"
__version__ = "1.5.0-dev"
__maintainer__ = "Greg Caporaso"
__email__ = "gregcaporaso@gmail.com"
__status__ = "Development"
from os.path import split, splitext
from cogent.parse.fasta import MinimalFastaParser
from qiime.util import parse_command_line_parameters, make_option
from qiime.util import qiime_system_call, create_dir
from qiime.filter import filter_fasta
from biom.parse import parse_biom_table
script_info = {}
script_info['brief_description'] = ""
script_info['script_description'] = ""
script_info['script_usage'] = [("","","")]
script_info['output_description']= ""
script_info['required_options'] = [
make_option('-f','--input_fasta_fp',type="existing_filepath",help='the input filepath'),
make_option('-i','--input_otu_table_fp',type="existing_filepath",help='the input filepath'),
make_option('-o','--output_dir',type="new_dirpath",help='the output directory'),
make_option('-p','--percent_subsample',action='store',type='float',
help='Specify the percentage of sequences to subsample'),
make_option('-g','--greengenes_tax_fp',type="existing_filepath",
help='Path to Greengenes OTU taxonomy file'),
make_option('-r','--picrust_reference_seqs_fp',type="existing_filepath",
help='Path to PICRUST reference sequences file'),
make_option('-c','--reference_alignment_fp',type="existing_filepath",
help='Path to reference alignment file'),
]
script_info['optional_options'] = [\
# Example optional option
#make_option('-o','--output_dir',type="new_dirpath",help='the output directory [default: %default]'),\
]
script_info['version'] = __version__
def command_call(cmd):
stdout, stderr, return_code = qiime_system_call(cmd)
if return_code != 0:
raise ValueError, "STDOUT:\n%s\nSTDERR:\n%s\n" % (stdout, stderr)
def get_abundant_oids(t,v):
x = [(e,i) for i,e in enumerate(v)]
x.sort()
x.reverse()
result = [(t.ObservationIds[i],e) for e,i in x if e > 0]
return result
def picrust_spike_control(input_fasta_fp,
input_otu_table_fp,
output_dir,
percent_subsample,
greengenes_tax_fp,
picrust_reference_seqs_fp,
reference_alignment_fp):
# subsample the full fasta file for reduced run time
input_fasta_basename, input_fasta_ext = splitext(split(input_fasta_fp)[1])
subsampled_fasta_fp = "%s/%s.subsampled%f%s" % (output_dir,
input_fasta_basename,
percent_subsample,
input_fasta_ext
)
subsample_cmd = "subsample_fasta.py -i %s -o %s -p %f" % \
(input_fasta_fp,subsampled_fasta_fp,percent_subsample)
command_call(subsample_cmd)
# Select only the IMG OTUs
img_otu_table_fp = '%s/otu_table_img_only.biom' % output_dir
filter_otus_from_otu_table_cmd =\
"filter_otus_from_otu_table.py -i %s -o %s -e %s" % (input_otu_table_fp,
img_otu_table_fp,
greengenes_tax_fp)
# Identify the most abundant IMG OTUs on a per-sample basis.
command_call(filter_otus_from_otu_table_cmd)
t = parse_biom_table(open(img_otu_table_fp))
abundant_otus = {}
for sv, si, sm in t.iterSamples():
abundant_otus[si] = get_abundant_oids(t,sv)
# compile the set of most abundant OTU ids
most_abundant = set()
for sid, a in abundant_otus.items():
most_abundant.update(set([e[0] for e in a]))
# Filter the abundant OTUs from the reference set
abundant_img_sequences_fp = '%s/abundant_img_sequences.fasta' % output_dir
filter_fasta(MinimalFastaParser(open(picrust_reference_seqs_fp)),
open(abundant_img_sequences_fp,'w'),
most_abundant)
# Align those sequences
pynast_output_dir = "%s/pynast_aligned/" % output_dir
align_seqs_cmd = \
"align_seqs.py -i %s -t %s -o %s" % (abundant_img_sequences_fp,
reference_alignment_fp,
pynast_output_dir)
command_call(align_seqs_cmd)
# Trim to the amplified region (these primers are 8F/338R, as listed in
# Muegge et al supplementary material:
# http://www.sciencemag.org/content/suppl/2011/05/18/332.6032.970.DC1/Muegge.SOM.pdf
# with the forward primer reverse complemented
abundant_img_sequences_aligned_fp =\
"%s/pynast_aligned/abundant_img_sequences_aligned.fasta" % output_dir
slice_aligned_region_cmd = "slice_aligned_region.py -f %s -x CTGAGCCAGGATCAAACTCT -y TGCTGCCTCCCGTAGGAGT -o %s" % (abundant_img_sequences_aligned_fp,output_dir)
abundant_img_sequences_aligned_trimmed_fp = \
"%s/abundant_img_sequences_aligned_alignment_region.fasta" % output_dir
command_call(slice_aligned_region_cmd)
# Generate fasta of sequences to spike into the subsample
sequences_to_spike = {}
for seq_id, seq in MinimalFastaParser(open(abundant_img_sequences_aligned_trimmed_fp,'U')):
sequences_to_spike[seq_id.split()[0]] = seq
i = 2000000
spike_fasta_fp = '%s/spike.%s' % (output_dir,input_fasta_ext)
f = open(spike_fasta_fp,'w')
for sid, o in abundant_otus.items():
for e in o[:2]:
for j in range(200):
try:
f.write('>%s_%d %s\n%s\n' % (sid,i,e[0],sequences_to_spike[e[0]]))
except KeyError:
pass
i += 1
f.close()
# Spike the sequences into the subsample
spiked_output_fasta_fp =\
"%s/%s.subsampled%f.spiked%s" % (output_dir,
input_fasta_basename,
percent_subsample,
input_fasta_ext
)
cat_cmd = "cat %s %s >> %s" % (subsampled_fasta_fp,
spike_fasta_fp,
spiked_output_fasta_fp)
command_call(cat_cmd)
def main():
option_parser, opts, args =\
parse_command_line_parameters(**script_info)
create_dir(opts.output_dir,fail_on_exist=True)
picrust_spike_control(opts.input_fasta_fp,
opts.input_otu_table_fp,
opts.output_dir,
opts.percent_subsample,
opts.greengenes_tax_fp,
opts.picrust_reference_seqs_fp,
opts.reference_alignment_fp)
if __name__ == "__main__":
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment