Skip to content

Instantly share code, notes, and snippets.

@rossmounce
Forked from dwinter/nested_taxa.Rmd
Last active April 6, 2017 10:53
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rossmounce/ace4cad73654fc7d2e35fcf940f6032b to your computer and use it in GitHub Desktop.
Save rossmounce/ace4cad73654fc7d2e35fcf940f6032b to your computer and use it in GitHub Desktop.

and now, all animals and with nested taxonomic ranks

So, a few people liked this example of using rentrez to investigate the taxonomic distribution of sequences in Genbank. I though it might be fun to extend it a little. Specifically:

  • I can't just look at boring old chordates, so let's look at all animals.
  • The real power of the Treemap plot I used in the earlier example is to show nested relationships (like taxonomic ones), so let's break teh counts out by Phylum > Class > Order

##Every animal order in NCBI taxonomy

Let's start with all the orders

library(rentrez)
animal_orders <- entrez_search(db="taxonomy", term="Animals[SBTR] AND Order[RANK]")
animal_orders
## Entrez search result with 516 hits (object contains 20 IDs and no cookie)

That's.. alot. Let's save the NCBI some sever-load as us some time by using the WebEnv feature of Entrez to store IDs on the NCBI's-end.

animal_orders <- entrez_search(db="taxonomy", term="Animals[SBTR] AND Order[RANK]", use_history=TRUE)
animal_orders
## Entrez search result with 516 hits (object contains 20 IDs and a cookie)
animal_orders$web_history$WebEnv
## [1] "NCID_1_58283008_130.14.22.215_9001_1425786158_320233684_0MetA0_S_MegaStore_F_1"

The WebEnv attribute is our ticket to these stored IDs

##Fetch taxonomy records

The summary records we used last time don't have the complete taxonomic lineage of each of our animal order, for that we will need the "full" xml records that entrez_fetch can grab:

tax_recs <- entrez_fetch(db="taxonomy", 
                         web_history = animal_orders$web_history,
                         rettype="xml")
typeof(tax_recs)
## [1] "character"
nchar(tax_recs)
## [1] 1651519

As you can see, at least for now entrez_fetch gives us a big character containing a raw xml file. We can parse it with the XML package:

library(XML)
parsed_recs <- xmlTreeParse(tax_recs, useInternalNodes=TRUE)
class(parsed_recs)
## [1] "XMLInternalDocument" "XMLAbstractDocument"

#Extract Name, Phylum, Order for each class

Now the fun bit! XML let's you use xpath to get around XML records. Here's what I came up with to extract this information

get_taxon <- function(rec, rank){
    xp  <-paste0("LineageEx/Taxon/Rank[.='", rank,"']/../ScientificName")
    res <- xpathSApply(rec, xp, xmlValue)
    if(is.null(res)){
        return(NA)
    }
    res
}

extract_parents <- function(rec){
    phy <- get_taxon(rec, "phylum")
    cls <- get_taxon(rec, "class")
    ord <-  xpathSApply(rec, "ScientificName", xmlValue)
    structure( c(phy, cls, ord), names=c("Phylum", "Class", "Order"))
}

taxa <- parsed_recs["/TaxaSet/Taxon"]
tax_df <-as.data.frame(t(sapply(taxa,extract_parents)))
head(tax_df)
##     Phylum       Class             Order
## 1 Mollusca    Bivalvia         Lucinoida
## 2 Chordata    Mammalia        Litopterna
## 3 Chordata Actinopteri Chaetodontiformes
## 4 Chordata Actinopteri  Holocentriformes
## 5 Chordata Actinopteri  Centrarchiformes
## 6 Chordata Actinopteri    Pempheriformes

#Count the sequences

We can use the names to build queries for entrez_search, the actual search will take some time as rentrez enforces NCBI's limit of 3 requests per second:

queries <- paste0(tax_df$Order, "[ORGN]")
nseq <-sapply(queries, function(q) entrez_search(db="nuccore", term=q)$count)
tax_df$nseq <- as.numeric(nseq)
head(tax_df)
##     Phylum       Class             Order   nseq
## 1 Mollusca    Bivalvia         Lucinoida    594
## 2 Chordata    Mammalia        Litopterna      0
## 3 Chordata Actinopteri Chaetodontiformes   3274
## 4 Chordata Actinopteri  Holocentriformes   1206
## 5 Chordata Actinopteri  Centrarchiformes 520562
## 6 Chordata Actinopteri    Pempheriformes   1769

#make the pretty plot!

library(treemap)
plot_title <- paste0("Genbank sequences by Phylum > Class > Order (n = ", 
                    round(sum(tax_df$nseq) / 1e6, 1),
                    "million)")
treemap(tax_df, 
        index=c("Phylum", "Class", "Order"),
        vSize="nseq", vColor="Phylum", 
        type='categorical',
        position.legend="none", 
        title=plot_title,
        border.col=c("white","white","black")
)

plot of chunk unnamed-chunk-8

```{r, echo=FALSE}
knitr::opts_knit$set(upload.fun = knitr::imgur_upload, base.url = NULL)
```
# and now, all animals and with nested taxonomic ranks
So, a few people liked [this example](https://gist.github.com/dwinter/8d7bde0579daf7466508)
of using `rentrez` to investigate the taxonomic distribution of sequences in
Genbank. I though it might be fun to extend it a little. Specifically:
* I can't just look at boring old chordates, so let's look at all animals.
* The real power of the Treemap plot I used in the earlier example is to
show nested relationships (like taxonomic ones), so let's break teh counts
out by Phylum > Class > Order
##Every animal order in NCBI taxonomy
Let's start with all the orders
```{r}
library(rentrez)
animal_orders <- entrez_search(db="taxonomy", term="Animals[SBTR] AND Order[RANK]")
animal_orders
```
That's.. alot. Let's save the NCBI some sever-load as us some time by using the
`WebEnv` feature of Entrez to store IDs on the NCBI's-end.
```{r}
animal_orders <- entrez_search(db="taxonomy", term="Animals[SBTR] AND Order[RANK]", use_history=TRUE)
animal_orders
animal_orders$web_history$WebEnv
```
The `WebEnv` attribute is our ticket to these stored IDs
##Fetch taxonomy records
The summary records we used last time don't have the complete taxonomic lineage
of each of our animal order, for that we will need the "full" xml records that
`entrez_fetch` can grab:
```{r, cache=TRUE}
tax_recs <- entrez_fetch(db="taxonomy",
web_history = animal_orders$web_history,
rettype="xml")
typeof(tax_recs)
nchar(tax_recs)
```
As you can see, [at least for
now](https://github.com/ropensci/rentrez/issues/35) `entrez_fetch` gives us a
big character containing a _raw_ xml file. We can parse it with the `XML`
package:
```{r}
library(XML)
parsed_recs <- xmlTreeParse(tax_recs, useInternalNodes=TRUE)
class(parsed_recs)
```
#Extract Name, Phylum, Order for each class
Now the fun bit! `XML` let's you use [xpath](http://en.wikipedia.org/wiki/XPath)
to get around XML records. Here's what I came up with to extract this
information
```{r}
get_taxon <- function(rec, rank){
xp <-paste0("LineageEx/Taxon/Rank[.='", rank,"']/../ScientificName")
res <- xpathSApply(rec, xp, xmlValue)
if(is.null(res)){
return(NA)
}
res
}
extract_parents <- function(rec){
phy <- get_taxon(rec, "phylum")
cls <- get_taxon(rec, "class")
ord <- xpathSApply(rec, "ScientificName", xmlValue)
structure( c(phy, cls, ord), names=c("Phylum", "Class", "Order"))
}
taxa <- parsed_recs["/TaxaSet/Taxon"]
tax_df <-as.data.frame(t(sapply(taxa,extract_parents)))
head(tax_df)
```
#Count the sequences
We can use the names to build queries for `entrez_search`, the actual search
will take some time as `rentrez` enforces NCBI's limit of 3 requests per second:
```{r, cache=TRUE}
queries <- paste0(tax_df$Order, "[ORGN]")
nseq <-sapply(queries, function(q) entrez_search(db="nuccore", term=q)$count)
tax_df$nseq <- as.numeric(nseq)
head(tax_df)
```
#make the pretty plot!
```{r,cache=TRUE}
library(treemap)
plot_title <- paste0("Genbank sequences by Phylum > Class > Order (n = ",
round(sum(tax_df$nseq) / 1e6, 1),
"million)")
treemap(tax_df,
index=c("Phylum", "Class", "Order"),
vSize="nseq", vColor="Phylum",
type='categorical',
position.legend="none",
title=plot_title,
border.col=c("white","white","black")
)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment