This depends on biom version >= 2.1.5, < 2.2.0 and vsearch >= 1.7.0.
Please note that all of this is highly experimental. I'm keeping these notes as I work with this approach. For published work, I still recommend using standard pipelines, such as those in QIIME 1.9.1.
$ biom --version biom, version 2.1.5 $ vsearch --version vsearch v1.7.0_osx_x86_64, 16.0GB RAM, 8 cores https://github.com/torognes/vsearch
First, dereplicate your sequences with
--relabel_sha1 --relabel_keep parameters are option, but map your original sequence identifiers onto sequence sha1 values. We'll come back to that in a minute. This takes about a minute on a MiSeq run.
$ vsearch --derep_fulllength seqs --output rep-set.fna --uc vsearch-derep.uc --relabel_sha1 --relabel_keep vsearch v1.7.0_osx_x86_64, 16.0GB RAM, 8 cores https://github.com/torognes/vsearch Reading file seqs 100% 26469 nt in 200 seqs, min 116, max 152, avg 132 Dereplicating 100% Sorting 100% 43 unique sequences, avg cluster 4.7, median 2, max 67 Writing output file 100% Writing uc file, first part 100% Writing uc file, second part 100%
biom from-uc to generate a biom file.
$ biom from-uc -i vsearch-derep.uc -o vsearch-derep.biom --rep-set-fp rep-set.fna
Then, we can
biom head the table to get a quick look at the data.
$ biom head -i vsearch-derep.biom # Constructed from biom file #OTU ID f2 f1 f3 f4 p1 e84fcf85a6a4065231dcf343bb862f1cb32abae6 13.0 16.0 16.0 22.0 0.0 5525fb6dab7b6577960147574465990c6df070ad 0.0 0.0 0.0 0.0 19.0 eb3564a35320b53cef22a77288838c7446357327 0.0 0.0 0.0 0.0 0.0 418f1d469f08c99976b313028cf6d3f18f61dd55 0.0 0.0 0.0 0.0 0.0 2e3b2c075901640c4de739473f9246385430b1ed 0.0 3.0 3.0 0.0 0.0
Notice that the OTU identifiers (first field in each row after the first one) are different than what you're probably used to seeing. These are sha1 hashs of the representative sequences for each OTU. Because this is dereplication, all sequences in the OTU are 100% identical, so have the same sha1. By naming your OTUs this way, your OTU identifiers will be consistent across runs of
vesearch --derep_fulllength, so you can safely merge tables (provided that your sample identifiers don't overlap unless they're supposed to.
You now have a biom file
vsearch-derep.biom and the corresponding represenative sequences file
rep-set.fna. These can be plugged into QIIME, or other programs that support biom. From here, we generally also want a tree for phylogenetic diversity calculations and taxonomic assignments for each dereplicated sequence.
Filtering OTUs that are present in only a single sample, and then dropping those representative sequences
This is important for the efficiency of the next steps. In my few experiences, this will reduce the number of OTUs (and therefore representative sequences) by 1-2 orders of magnitude. So, that obviously results in much faster alignment, tree building, and taxonomy assignment.
I do these steps with QIIME 1.9.1 as follows (I use
ms2 as an abbreviation for minimum samples = 2):
filter_otus_from_otu_table.py -i vsearch-derep.biom -o vsearch-derep.ms2.biom -s 2 filter_fasta.py -f rep-set.fna -o rep-set.ms2.fna -b vsearch-derep.ms2.biom
Tree building for phylogenetic diversity calculations
To build a tree, I've been aligning sequences and masking the alignment with
ssu-align, and then building a tree from the resulting bacterial sequences with
FastTreeMP. Here are my steps:
ssu-align rep-set.ms2.fna ssu-align-ms2/ # mask the alignment and output as fasta # as far as I can tell, ssu-mask can only write to current directory, hence my # pair of cd commands here to keep results of ssu-* contained cd ssu-align-ms2 ; ssu-mask --afa -a ssu-align-ms2.bacteria.stk ; cd .. # build the tree export OMP_NUM_THREADS=29; FastTreeMP -nosupport -fastest -nt ssu-align-ms2/ssu-align-ms2.bacteria.mask.afa > rep-set.ms2.tre
With about 700,000 sequences in
ssu-align step took about 16 hours, and the ssu-mask about 30 seconds. FastTreeMP with 32 threads took 2.5 hours.
I then use the resulting
rep-set.ms2.tre for diversity analyses.
I use QIIME's RDP Classifier wrapper for this right now, training against the 99% Greengenes OTUs, and then add those assignments to my biom file.
parallel_assign_taxonomy_rdp.py -r /home/gregcaporaso/gg_13_8_otus/rep_set/99_otus.fasta -t /home/gregcaporaso/gg_13_8_otus/taxonomy/99_otu_taxonomy.txt -i rep-set.ms2.fna -o rdp-assigned-tax-99/ -O 20 biom add-metadata -i vsearch-derep.ms2.biom -o vsearch-derep.ms2.rdp-tax-99.biom --observation-metadata-fp rdp-assigned-tax-99/rep-set.ms2_tax_assignments.txt --observation-header ID,taxonomy --sc-separated taxonomy
I then use the resulting
vsearch-derep.ms2.rdp-tax-99.biom for diversity analyses.
I tried to convert uc to biom from the below command
biom from-uc -i vsearch-derep.uc -o vsearch-derep.biom --rep-set-fp rep-set.fna
and got this error ValueError: A query sequence was encountered that does not have an underscore. An underscore is required in all query sequence identifiers to indicate the sample identifier.
Any idea as to how to go about it/solve? Thanks for your assistance