So, a few people liked this example
of using rentrez
to investigate the taxonomic distribution of sequences in
Genbank. I though it might be fun to extend it a little. Specifically:
- I can't just look at boring old chordates, so let's look at all animals.
- The real power of the Treemap plot I used in the earlier example is to show nested relationships (like taxonomic ones), so let's break teh counts out by Phylum > Class > Order
##Every animal order in NCBI taxonomy
Let's start with all the orders
library(rentrez)
animal_orders <- entrez_search(db="taxonomy", term="Animals[SBTR] AND Order[RANK]")
animal_orders
## Entrez search result with 516 hits (object contains 20 IDs and no cookie)
That's.. alot. Let's save the NCBI some sever-load as us some time by using the
WebEnv
feature of Entrez to store IDs on the NCBI's-end.
animal_orders <- entrez_search(db="taxonomy", term="Animals[SBTR] AND Order[RANK]", usehistory="y")
animal_orders
## Entrez search result with 516 hits (object contains 20 IDs and a cookie)
animal_orders$WebEnv
## [1] "NCID_1_58283008_130.14.22.215_9001_1425786158_320233684_0MetA0_S_MegaStore_F_1"
The WebEnv
attribute is our ticket to these stored IDs
##Fetch taxonomy records
The summary records we used last time don't have the complete taxonomic lineage
of each of our animal order, for that we will need the "full" xml records that
entrez_fetch
can grab:
tax_recs <- entrez_fetch(db="taxonomy",
WebEnv=animal_orders$WebEnv,
query_key=animal_orders$QueryKey,
rettype="xml")
typeof(tax_recs)
## [1] "character"
nchar(tax_recs)
## [1] 1651519
As you can see, at least for
now entrez_fetch
gives us a
big character containing a raw xml file. We can parse it with the XML
package:
library(XML)
parsed_recs <- xmlTreeParse(tax_recs, useInternalNodes=TRUE)
class(parsed_recs)
## [1] "XMLInternalDocument" "XMLAbstractDocument"
#Extract Name, Phylum, Order for each class
Now the fun bit! XML
let's you use xpath
to get around XML records. Here's what I came up with to extract this
information
get_taxon <- function(rec, rank){
xp <-paste0("LineageEx/Taxon/Rank[.='", rank,"']/../ScientificName")
res <- xpathSApply(rec, xp, xmlValue)
if(is.null(res)){
return(NA)
}
res
}
extract_parents <- function(rec){
phy <- get_taxon(rec, "phylum")
cls <- get_taxon(rec, "class")
ord <- xpathSApply(rec, "ScientificName", xmlValue)
structure( c(phy, cls, ord), names=c("Phylum", "Class", "Order"))
}
taxa <- parsed_recs["/TaxaSet/Taxon"]
tax_df <-as.data.frame(t(sapply(taxa,extract_parents)))
head(tax_df)
## Phylum Class Order
## 1 Mollusca Bivalvia Lucinoida
## 2 Chordata Mammalia Litopterna
## 3 Chordata Actinopteri Chaetodontiformes
## 4 Chordata Actinopteri Holocentriformes
## 5 Chordata Actinopteri Centrarchiformes
## 6 Chordata Actinopteri Pempheriformes
#Count the sequences
We can use the names to build queries for entrez_search
, the actual search
will take some time as rentrez
enforces NCBI's limit of 3 requests per second:
queries <- paste0(tax_df$Order, "[ORGN]")
nseq <-sapply(queries, function(q) entrez_search(db="nuccore", term=q)$count)
tax_df$nseq <- as.numeric(nseq)
head(tax_df)
## Phylum Class Order nseq
## 1 Mollusca Bivalvia Lucinoida 594
## 2 Chordata Mammalia Litopterna 0
## 3 Chordata Actinopteri Chaetodontiformes 3274
## 4 Chordata Actinopteri Holocentriformes 1206
## 5 Chordata Actinopteri Centrarchiformes 520562
## 6 Chordata Actinopteri Pempheriformes 1769
#make the pretty plot!
library(treemap)
plot_title <- paste0("Genbank sequences by Phylum > Class > Order (n = ",
round(sum(tax_df$nseq) / 1e6, 1),
"million)")
treemap(tax_df,
index=c("Phylum", "Class", "Order"),
vSize="nseq", vColor="Phylum",
type='categorical',
position.legend="none",
title=plot_title,
border.col=c("white","white","black")
)
As of 2017-04-06