Skip to content

Instantly share code, notes, and snippets.

@grabear
Last active September 18, 2020 13:01
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save grabear/018e86413b19b62a6bb8e72a9adba349 to your computer and use it in GitHub Desktop.
Save grabear/018e86413b19b62a6bb8e72a9adba349 to your computer and use it in GitHub Desktop.
Parsing .biom files with SILVA formatted annotations using phyloseq
#' @examples
#' taxvec1 = c("Root", "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Staphylococcaceae")
#' parse_taxonomy_default(taxvec1)
parse_taxonomy_default = function(char.vec){
# Remove any leading empty space
char.vec = gsub("^[[:space:]]{1,}", "", char.vec)
# Remove any trailing space
char.vec = gsub("[[:space:]]{1,}$", "", char.vec)
if( length(char.vec) > 0 ){
# Add dummy element (rank) name
names(char.vec) = paste("Rank", 1:length(char.vec), sep="")
} else {
warning("Empty taxonomy vector encountered.")
}
return(char.vec)
}
#' @examples
#' taxvec3 = c("D_0__Bacteria", "D_1__Firmicutes", "D_2__Bacilli", "D_3__Staphylococcaceae")
#' parse_taxonomy_silva(taxvec3)
parse_taxonomy_silva_128 <- function(char.vec){
# Use default to assign names to elements in case problem with greengenes prefix
char.vec = parse_taxonomy_default(char.vec)
# Check for unassigned taxa
if (char.vec["Rank1"] == "Unassigned") {
char.vec <- c(Rank1="D_0__Unassigned", Rank2="D_1__Unassigned", Rank3="D_2__Unassigned", Rank4="D_3__Unassigned",
Rank5="D_4__Unassigned", Rank6="D_5__Unassigned", Rank7="D_6__Unassigned")
}
# Define the meaning of each prefix according to SILVA taxonomy
Tranks = c(D_0="Kingdom", D_1="Phylum", D_2="Class", D_3="Order", D_4="Family", D_5="Genus", D_6="Species")
# Check for prefix using regexp, warn if there were none. trim indices, ti
ti = grep("[[:alpha:]]\\_[[:digit:]]{1}\\_\\_", char.vec)
if( length(ti) == 0L ){
warning(
"No silva prefixes were found. \n",
"Consider using parse_taxonomy_delfault() instead if true for all OTUs. \n",
"Dummy ranks may be included among taxonomic ranks now."
)
# Will want to return without further modifying char.vec
taxvec = char.vec
# Replace names of taxvec according to prefix, if any present...
} else {
# Format character vectors for Ambiguous taxa
if( length(ti) < 7 ){
for (key in names(char.vec)) {
if ( char.vec[key] == "Ambiguous_taxa" ) {
tax_no <- (as.numeric(substr(key, 5, 5)) - 1)
char.vec[key] = sprintf("D_%s__Ambiguous_taxa", tax_no)
}
}
# Reset the trimmed indicies if Ambiguous taxa
ti = grep("[[:alpha:]]\\_[[:digit:]]{1}\\_\\_", char.vec)
}
# Remove prefix using sub-"" regexp, call result taxvec
taxvec = gsub("[[:alpha:]]\\_[[:digit:]]{1}\\_\\_", "", char.vec)
# Define the ranks that will be replaced
repranks = Tranks[substr(char.vec[ti], 1, 3)]
# Replace, being sure to avoid prefixes notK present in Tranks
names(taxvec)[ti[!is.na(repranks)]] = repranks[!is.na(repranks)]
}
return(taxvec)
}
# This functionality has been tested and a PR has been pulled with phyloseq here:
# https://github.com/joey711/phyloseq/pull/854
#
# While the function has been vetted, the maintainers are very busy and the PR has not
# yet been added to the main package. Below i've added some detail to explain how to parse
# your silva data. It's quite easy....
source("parse_silva_taxonomy_128")
# import biom data
silva_biom <- system.file("extdata", "SILVA_OTU_table.biom", package="phyloseq")
# Create phyloseq object using silva parseing function
silva_phyloseq <- import_biom(BIOMfilename = silva_biom, parseFunction = parse_taxonomy_silva_128)
@grabear
Copy link
Author

grabear commented Apr 27, 2018

@grabear
Copy link
Author

grabear commented Apr 27, 2018

@tarunaaggarwal If you used this could you give me a +1 on the Pull Request?

@samd1993
Copy link

samd1993 commented Dec 7, 2019

Does this work with Silva132? And also I am curious what the analytical implications are for this in terms of the standard microbiome type analyses? I understand that this sort of conglomerates taxa in a more sensible way (correct me if I am wrong), so my guess is this has significant effects on downstream analyses. I am also guessing this is why I see 32,000 taxa in my phyloseq object when really it is more like 500 taxa.

Any info would much be appreciated!
Thanks,
Sam

@gcagle1
Copy link

gcagle1 commented Dec 12, 2019

This worked beautifully with my Silva132 classifications done in QIIME2 vsearch, thanks a lot!! I did get extra ranks added at the end so the col names in my otu table were
"Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species" "Rank5" "Rank6" "Rank7" "Rank4" "Rank3" "Rank2"

I've had extra ranks show up before with parse_taxonomy_greengenes depending on how I classify my sequences. It doesn't matter really since they are easily removed tax_table(physeq) = tax_table(physeq)[,1:7]

@grabear
Copy link
Author

grabear commented Dec 12, 2019

Nice. I'm not sure what the Silva132 data looks like as I haven't done any work with it yet. These may be relevant ranks that the Silva132 database uses, but my function just doesn't parse it out. I'd be interested to see what your Silva132 data looks like. Feel free to post a sample of your data here (or a link to it) @gcagle1 @samd1993. You can also find a ton of great functions (including this one) in a package that I authored a while back https://github.com/vallenderlab/MicrobiomeR. The documentation website is currently down, but once I get it working again I will let you guys know here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment