dwinter/nested_taxa.Rmd

## nested_taxa.md

      
    Raw
  

              nested_taxa.md
            
          
    and now, all animals and with nested taxonomic ranks

So, a few people liked this example
of using rentrez to investigate the taxonomic distribution of sequences in
Genbank. I though it might be fun to extend it a little. Specifically:

I can't just look at boring old chordates, so let's look at all animals.
The real power of the Treemap plot I used in the earlier example is to
show nested relationships (like taxonomic ones), so let's break teh counts
out by Phylum > Class > Order

##Every animal order in NCBI taxonomy
Let's start with all the orders
library(rentrez)
animal_orders <- entrez_search(db="taxonomy", term="Animals[SBTR] AND Order[RANK]")
animal_orders
## Entrez search result with 516 hits (object contains 20 IDs and no cookie)

That's.. alot. Let's save the NCBI some sever-load as us some time by using the
WebEnv feature of Entrez to store IDs on the NCBI's-end.
animal_orders <- entrez_search(db="taxonomy", term="Animals[SBTR] AND Order[RANK]", usehistory="y")
animal_orders
## Entrez search result with 516 hits (object contains 20 IDs and a cookie)

animal_orders$WebEnv
## [1] "NCID_1_58283008_130.14.22.215_9001_1425786158_320233684_0MetA0_S_MegaStore_F_1"

The WebEnv attribute is our ticket to these stored IDs
##Fetch taxonomy records
The summary records we used last time don't have the complete taxonomic lineage
of each of our animal order, for that we will need the "full" xml records that
entrez_fetch can grab:
tax_recs <- entrez_fetch(db="taxonomy", 
                         WebEnv=animal_orders$WebEnv, 
                         query_key=animal_orders$QueryKey, 
                         rettype="xml")
typeof(tax_recs)
## [1] "character"

nchar(tax_recs)
## [1] 1651519

As you can see, at least for
now entrez_fetch gives us a
big character containing a raw xml file. We can parse it with the XML
package:
library(XML)
parsed_recs <- xmlTreeParse(tax_recs, useInternalNodes=TRUE)
class(parsed_recs)
## [1] "XMLInternalDocument" "XMLAbstractDocument"

#Extract Name, Phylum, Order for each class
Now the fun bit! XML let's you use xpath
to get around XML records. Here's what I came up with to extract this
information
get_taxon <- function(rec, rank){
    xp  <-paste0("LineageEx/Taxon/Rank[.='", rank,"']/../ScientificName")
    res <- xpathSApply(rec, xp, xmlValue)
    if(is.null(res)){
        return(NA)
    }
    res
}

extract_parents <- function(rec){
    phy <- get_taxon(rec, "phylum")
    cls <- get_taxon(rec, "class")
    ord <-  xpathSApply(rec, "ScientificName", xmlValue)
    structure( c(phy, cls, ord), names=c("Phylum", "Class", "Order"))
}

taxa <- parsed_recs["/TaxaSet/Taxon"]
tax_df <-as.data.frame(t(sapply(taxa,extract_parents)))
head(tax_df)
##     Phylum       Class             Order
## 1 Mollusca    Bivalvia         Lucinoida
## 2 Chordata    Mammalia        Litopterna
## 3 Chordata Actinopteri Chaetodontiformes
## 4 Chordata Actinopteri  Holocentriformes
## 5 Chordata Actinopteri  Centrarchiformes
## 6 Chordata Actinopteri    Pempheriformes

#Count the sequences
We can use the names to build queries for entrez_search, the actual search
will take some time as rentrez enforces NCBI's limit of 3 requests per second:
queries <- paste0(tax_df$Order, "[ORGN]")
nseq <-sapply(queries, function(q) entrez_search(db="nuccore", term=q)$count)
tax_df$nseq <- as.numeric(nseq)
head(tax_df)
##     Phylum       Class             Order   nseq
## 1 Mollusca    Bivalvia         Lucinoida    594
## 2 Chordata    Mammalia        Litopterna      0
## 3 Chordata Actinopteri Chaetodontiformes   3274
## 4 Chordata Actinopteri  Holocentriformes   1206
## 5 Chordata Actinopteri  Centrarchiformes 520562
## 6 Chordata Actinopteri    Pempheriformes   1769

#make the pretty plot!
library(treemap)
plot_title <- paste0("Genbank sequences by Phylum > Class > Order (n = ", 
                    round(sum(tax_df$nseq) / 1e6, 1),
                    "million)")
treemap(tax_df, 
        index=c("Phylum", "Class", "Order"),
        vSize="nseq", vColor="Phylum", 
        type='categorical',
        position.legend="none", 
        title=plot_title,
        border.col=c("white","white","black")
)


## nested_taxa.Rmd
```{r, echo=FALSE}
knitr::opts_knit$set(upload.fun = knitr::imgur_upload, base.url = NULL)
```


# and now, all animals and with nested taxonomic ranks

So, a few people liked [this example](https://gist.github.com/dwinter/8d7bde0579daf7466508)
of using `rentrez` to investigate the taxonomic distribution of sequences in
Genbank. I though it might be fun to extend it a little. Specifically:

* I can't just look at boring old chordates, so let's look at all animals.
* The real power of the Treemap plot I used in the earlier example is to
      show nested relationships (like taxonomic ones), so let's break teh counts
      out by Phylum > Class > Order

##Every animal order in NCBI taxonomy

Let's start with all the orders

```{r}
library(rentrez)
animal_orders <- entrez_search(db="taxonomy", term="Animals[SBTR] AND Order[RANK]")
animal_orders
```

That's.. alot. Let's save the NCBI some sever-load as us some time by using the
`WebEnv` feature of Entrez to store IDs on the NCBI's-end.

```{r}
animal_orders <- entrez_search(db="taxonomy", term="Animals[SBTR] AND Order[RANK]", usehistory="y")
animal_orders
animal_orders$WebEnv
```

The `WebEnv` attribute is our ticket to these stored IDs

##Fetch taxonomy records

The summary records we used last time don't have the complete taxonomic lineage
of each of our animal order, for that we will need the "full" xml records that
`entrez_fetch` can grab:


```{r, cache=TRUE}
tax_recs <- entrez_fetch(db="taxonomy",
                         WebEnv=animal_orders$WebEnv,
                         query_key=animal_orders$QueryKey,
                         rettype="xml")
typeof(tax_recs)
nchar(tax_recs)
```

As you can see, [at least for
now](https://github.com/ropensci/rentrez/issues/35) `entrez_fetch` gives us a
big character containing a _raw_ xml file. We can parse it with the `XML`
package:

```{r}
library(XML)
parsed_recs <- xmlTreeParse(tax_recs, useInternalNodes=TRUE)
class(parsed_recs)
```

#Extract Name, Phylum, Order for each class

Now the fun bit! `XML` let's you use [xpath](http://en.wikipedia.org/wiki/XPath)
to get around XML records. Here's what I came up with to extract this
information


```{r}
get_taxon <- function(rec, rank){
    xp  <-paste0("LineageEx/Taxon/Rank[.='", rank,"']/../ScientificName")
    res <- xpathSApply(rec, xp, xmlValue)
    if(is.null(res)){
        return(NA)
    }
    res
}

extract_parents <- function(rec){
    phy <- get_taxon(rec, "phylum")
    cls <- get_taxon(rec, "class")
    ord <-  xpathSApply(rec, "ScientificName", xmlValue)
    structure( c(phy, cls, ord), names=c("Phylum", "Class", "Order"))
}

taxa <- parsed_recs["/TaxaSet/Taxon"]
tax_df <-as.data.frame(t(sapply(taxa,extract_parents)))
head(tax_df)
```

#Count the sequences

We can use the names to build queries for `entrez_search`, the actual search
will take some time as `rentrez` enforces NCBI's limit of 3 requests per second:


```{r, cache=TRUE}
queries <- paste0(tax_df$Order, "[ORGN]")
nseq <-sapply(queries, function(q) entrez_search(db="nuccore", term=q)$count)
tax_df$nseq <- as.numeric(nseq)
head(tax_df)
```
#make the pretty plot!

```{r,cache=TRUE}
library(treemap)
plot_title <- paste0("Genbank sequences by Phylum > Class > Order (n = ",
                    round(sum(tax_df$nseq) / 1e6, 1),
                    "million)")
treemap(tax_df,
        index=c("Phylum", "Class", "Order"),
        vSize="nseq", vColor="Phylum",
        type='categorical',
        position.legend="none",
        title=plot_title,
        border.col=c("white","white","black")
)

```
	```{r, echo=FALSE}
	knitr::opts_knit$set(upload.fun = knitr::imgur_upload, base.url = NULL)
	```


	# and now, all animals and with nested taxonomic ranks

	So, a few people liked [this example](https://gist.github.com/dwinter/8d7bde0579daf7466508)
	of using `rentrez` to investigate the taxonomic distribution of sequences in
	Genbank. I though it might be fun to extend it a little. Specifically:

	* I can't just look at boring old chordates, so let's look at all animals.
	* The real power of the Treemap plot I used in the earlier example is to
	show nested relationships (like taxonomic ones), so let's break teh counts
	out by Phylum > Class > Order

	##Every animal order in NCBI taxonomy

	Let's start with all the orders

	```{r}
	library(rentrez)
	animal_orders <- entrez_search(db="taxonomy", term="Animals[SBTR] AND Order[RANK]")
	animal_orders
	```

	That's.. alot. Let's save the NCBI some sever-load as us some time by using the
	`WebEnv` feature of Entrez to store IDs on the NCBI's-end.

	```{r}
	animal_orders <- entrez_search(db="taxonomy", term="Animals[SBTR] AND Order[RANK]", usehistory="y")
	animal_orders
	animal_orders$WebEnv
	```

	The `WebEnv` attribute is our ticket to these stored IDs

	##Fetch taxonomy records

	The summary records we used last time don't have the complete taxonomic lineage
	of each of our animal order, for that we will need the "full" xml records that
	`entrez_fetch` can grab:


	```{r, cache=TRUE}
	tax_recs <- entrez_fetch(db="taxonomy",
	WebEnv=animal_orders$WebEnv,
	query_key=animal_orders$QueryKey,
	rettype="xml")
	typeof(tax_recs)
	nchar(tax_recs)
	```

	As you can see, [at least for
	now](https://github.com/ropensci/rentrez/issues/35) `entrez_fetch` gives us a
	big character containing a _raw_ xml file. We can parse it with the `XML`
	package:

	```{r}
	library(XML)
	parsed_recs <- xmlTreeParse(tax_recs, useInternalNodes=TRUE)
	class(parsed_recs)
	```

	#Extract Name, Phylum, Order for each class

	Now the fun bit! `XML` let's you use [xpath](http://en.wikipedia.org/wiki/XPath)
	to get around XML records. Here's what I came up with to extract this
	information


	```{r}
	get_taxon <- function(rec, rank){
	xp <-paste0("LineageEx/Taxon/Rank[.='", rank,"']/../ScientificName")
	res <- xpathSApply(rec, xp, xmlValue)
	if(is.null(res)){
	return(NA)
	}
	res
	}

	extract_parents <- function(rec){
	phy <- get_taxon(rec, "phylum")
	cls <- get_taxon(rec, "class")
	ord <- xpathSApply(rec, "ScientificName", xmlValue)
	structure( c(phy, cls, ord), names=c("Phylum", "Class", "Order"))
	}

	taxa <- parsed_recs["/TaxaSet/Taxon"]
	tax_df <-as.data.frame(t(sapply(taxa,extract_parents)))
	head(tax_df)
	```

	#Count the sequences

	We can use the names to build queries for `entrez_search`, the actual search
	will take some time as `rentrez` enforces NCBI's limit of 3 requests per second:


	```{r, cache=TRUE}
	queries <- paste0(tax_df$Order, "[ORGN]")
	nseq <-sapply(queries, function(q) entrez_search(db="nuccore", term=q)$count)
	tax_df$nseq <- as.numeric(nseq)
	head(tax_df)
	```
	#make the pretty plot!

	```{r,cache=TRUE}
	library(treemap)
	plot_title <- paste0("Genbank sequences by Phylum > Class > Order (n = ",
	round(sum(tax_df$nseq) / 1e6, 1),
	"million)")
	treemap(tax_df,
	index=c("Phylum", "Class", "Order"),
	vSize="nseq", vColor="Phylum",
	type='categorical',
	position.legend="none",
	title=plot_title,
	border.col=c("white","white","black")
	)

	```