#Finding the number of sequence records by taxonomic group
Ricardo wants to know how to find the number of sequence records associated with sub-groups within a given taxon . This example grew a bit too big to make into a comment, so here it is in gist form.
So, let's find out how many DNA sequences are present in genbank for each Order within the phylum Chordata (vertebrates and a few of their relatives).
(Note, this expample uses the development version on rentrez, which can be installed
via devtools
:
library(devtools)
install_github("ropensci/rentrez")
However, all the the code should work with the latest release on CRAN (the printed output will be different).
##Step one: How can we narrow a search of the taxonomy database?
Since version 0.4
rentrez has some helper functions to make it easier to
make the most out of the Eutils API. For instance, we can get some information
about the special terms available for the taxonomy database:
library(rentrez)
entrez_db_searchable("taxonomy")
## Searchable fields for database 'taxonomy'
## ALL All terms from all searchable fields
## UID Unique number assigned to publication
## FILT Limits the records
## SCIN Scientific name of organism
## COMN Common name of organism
## TXSY Synonym of organism name
## ALLN All aliases for organism
## NXLV Immediate parent in taxonomic hierarchy
## SBTR Any parent node in taxonomic hierarchy
## LNGE Lineage in taxonomic hierarchy
## GC Nuclear genetic code
## MGC Mitochondrial genetic code
## PGC Mitochondrial genetic code
## TXDV GenBank division
## RANK Hierarchical position (e.g., order, genus)
## EDAT Date record first accessible through Entrez
## MDAT Date of last update
## PROP Property defined on particular node (e.g., terminal node)
## WORD Free text associated with record
## NTOK Name tokens associated with organism names
So, SUBTR
(subtree) can help us limit a search to include only taxonomy
records that are part of a given group, and RANK
will let us get only those
records matching a given taxonomy rank.
##Step two: Make a query
We can use these terms, and the entrez query format, to find taxonomy records for each Order within Chordata.First, how many records are there....
entrez_search(db="taxonomy", term="Chordata[SBTR] AND Order[RANK]", retmax=0)
## Entrez search result with 157 hits (object contains 0 IDs and no cookie)
We are going to have to process all of these IDs with R, we have to download all of them:
orders <- entrez_search(db="taxonomy",
term="Chordata[SBTR] AND Order[RANK]",
retmax=157)
##Search the nucleotide database for each taxonomic class
We can now use the taxonomy IDs to limit our search of the nucleotide
(nuccore
) database, again, let's find out how to make the search.
entrez_db_searchable("nuccore")
## Searchable fields for database 'nuccore'
## ALL All terms from all searchable fields
## UID Unique number assigned to each sequence
## FILT Limits the records
## WORD Free text associated with record
## TITL Words in definition line
## KYWD Nonstandardized terms provided by submitter
## AUTH Author(s) of publication
## JOUR Journal abbreviation of publication
## VOL Volume number of publication
## ISS Issue number of publication
## PAGE Page number(s) of publication
## ORGN Scientific and common names of organism, and all higher levels of taxonomy
## ACCN Accession number of sequence
## PACC Does not include retired secondary accessions
## GENE Name of gene associated with sequence
## PROT Name of protein associated with sequence
## ECNO EC number for enzyme or CAS registry number
## PDAT Date sequence added to GenBank
## MDAT Date of last update
## SUBS CAS chemical name or MEDLINE Substance Name
## PROP Classification by source qualifiers and molecule type
## SQID String identifier for sequence
## GPRJ BioProject
## SLEN Length of sequence
## FKEY Feature annotated on sequence
## PORG Scientific and common names of primary organism, and all higher levels of taxonomy
## COMP Component accessions for an assembly
## ASSM Assembly
## DIV Division
## STRN Strain
## ISOL Isolate
## CULT Cultivar
## BRD Breed
## BIOS BioSample
ORGN
is the key one here. It takes a little bit of work to use a taxonomy ID
(rather than a name) to limit the searches,so we first have to build our queries
by adding "taxid" to the fron of them:
queries <- paste0("txid", orders$ids, "[ORGN]")
head(queries)
## [1] "txid1563124[ORGN]" "txid1545895[ORGN]" "txid1490028[ORGN]"
## [4] "txid1489940[ORGN]" "txid1489939[ORGN]" "txid1489937[ORGN]"
Now we can use the count
attribute of the object returned by entrez_search
to see how many sequences are available for each order:
seq_count <- sapply(queries,
function(q) entrez_search(db="nucleotide", term=q, retmax=0)$count)
head(seq_count)
## txid1563124[ORGN] txid1545895[ORGN] txid1490028[ORGN] txid1489940[ORGN]
## "0" "3274" "1206" "520562"
## txid1489939[ORGN] txid1489937[ORGN]
## "1769" "3428"
##Get the taxon names from summary records
Thats the numbers taken care of, but we can't yet match those numbers to a name. To do that we can fetch summary records for each order:
tax_summs <- entrez_summary(db="taxonomy", id=orders$ids)
tax_summs[[1]]
## esummary result with 12 items:
## [1] uid status rank division
## [5] scientificname commonname taxid akataxid
## [9] genus species subsp modificationdate
The summary records come back as a big list, simple records like those that come
back from taxonomy can be converted to data.frames
too (a general-purpose
data.frame transformation might make it into the next version of rentrez
):
tax_df <- do.call(rbind.data.frame, tax_summs)
head(tax_df)
## uid status rank division scientificname commonname
## 1563124 1563124 active order placentals Litopterna
## 1545895 1545895 active order bony fishes Chaetodontiformes
## 1490028 1490028 active order bony fishes Holocentriformes
## 1489940 1489940 active order bony fishes Centrarchiformes
## 1489939 1489939 active order bony fishes Pempheriformes
## 1489937 1489937 active order bony fishes Acanthuriformes
## taxid akataxid genus species subsp modificationdate
## 1563124 1563124 0 2014/11/03 00:00
## 1545895 1545895 0 2014/09/16 00:00
## 1490028 1490028 0 2014/04/18 00:00
## 1489940 1489940 0 2014/04/18 00:00
## 1489939 1489939 0 2014/04/18 00:00
## 1489937 1489937 0 2014/04/17 00:00
## genbankdivision
## 1563124 Mammals
## 1545895 Vertebrates
## 1490028 Vertebrates
## 1489940 Vertebrates
## 1489939 Vertebrates
## 1489937 Vertebrates
Now we can combine the names and the sequence counts to get our final result (both vectors are ordered by the vector of ids that was used to generate them, so tehy will align nicely):
seqs_by_class <- data.frame(taxon=tax_df$scientificname,
nseqs=as.numeric(seq_count)
)
head(seqs_by_class)
## taxon nseqs
## 1 Litopterna 0
## 2 Chaetodontiformes 3274
## 3 Holocentriformes 1206
## 4 Centrarchiformes 520562
## 5 Pempheriformes 1769
## 6 Acanthuriformes 3428
#make a pretty graph
We can't stop there! We also have to make a pretty graph. Using the very nice
treemap
package:
treemap::treemap(seqs_by_class, index= "taxon", vSize= "nseqs", border.lwds= 1 )