Skip to content

Instantly share code, notes, and snippets.

@dwinter
Last active August 29, 2015 14:16
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dwinter/8d7bde0579daf7466508 to your computer and use it in GitHub Desktop.
Save dwinter/8d7bde0579daf7466508 to your computer and use it in GitHub Desktop.

#Finding the number of sequence records by taxonomic group

Ricardo wants to know how to find the number of sequence records associated with sub-groups within a given taxon . This example grew a bit too big to make into a comment, so here it is in gist form.

So, let's find out how many DNA sequences are present in genbank for each Order within the phylum Chordata (vertebrates and a few of their relatives).

(Note, this expample uses the development version on rentrez, which can be installed via devtools:

library(devtools)
install_github("ropensci/rentrez")

However, all the the code should work with the latest release on CRAN (the printed output will be different).

##Step one: How can we narrow a search of the taxonomy database?

Since version 0.4 rentrez has some helper functions to make it easier to make the most out of the Eutils API. For instance, we can get some information about the special terms available for the taxonomy database:

library(rentrez)
entrez_db_searchable("taxonomy")
## Searchable fields for database 'taxonomy'
##   ALL 	 All terms from all searchable fields 
##   UID 	 Unique number assigned to publication 
##   FILT 	 Limits the records 
##   SCIN 	 Scientific name of organism 
##   COMN 	 Common name of organism 
##   TXSY 	 Synonym of organism name 
##   ALLN 	 All aliases for organism 
##   NXLV 	 Immediate parent in taxonomic hierarchy 
##   SBTR 	 Any parent node in taxonomic hierarchy 
##   LNGE 	 Lineage in taxonomic hierarchy 
##   GC 	 Nuclear genetic code 
##   MGC 	 Mitochondrial genetic code 
##   PGC 	 Mitochondrial genetic code 
##   TXDV 	 GenBank division 
##   RANK 	 Hierarchical position (e.g., order, genus) 
##   EDAT 	 Date record first accessible through Entrez 
##   MDAT 	 Date of last update 
##   PROP 	 Property defined on particular node (e.g., terminal node) 
##   WORD 	 Free text associated with record 
##   NTOK 	 Name tokens associated with organism names

So, SUBTR (subtree) can help us limit a search to include only taxonomy records that are part of a given group, and RANK will let us get only those records matching a given taxonomy rank.

##Step two: Make a query

We can use these terms, and the entrez query format, to find taxonomy records for each Order within Chordata.First, how many records are there....

entrez_search(db="taxonomy", term="Chordata[SBTR] AND Order[RANK]", retmax=0)
## Entrez search result with 157 hits (object contains 0 IDs and no cookie)

We are going to have to process all of these IDs with R, we have to download all of them:

orders <- entrez_search(db="taxonomy", 
                               term="Chordata[SBTR] AND Order[RANK]",
                               retmax=157) 

##Search the nucleotide database for each taxonomic class

We can now use the taxonomy IDs to limit our search of the nucleotide (nuccore) database, again, let's find out how to make the search.

entrez_db_searchable("nuccore")
## Searchable fields for database 'nuccore'
##   ALL 	 All terms from all searchable fields 
##   UID 	 Unique number assigned to each sequence 
##   FILT 	 Limits the records 
##   WORD 	 Free text associated with record 
##   TITL 	 Words in definition line 
##   KYWD 	 Nonstandardized terms provided by submitter 
##   AUTH 	 Author(s) of publication 
##   JOUR 	 Journal abbreviation of publication 
##   VOL 	 Volume number of publication 
##   ISS 	 Issue number of publication 
##   PAGE 	 Page number(s) of publication 
##   ORGN 	 Scientific and common names of organism, and all higher levels of taxonomy 
##   ACCN 	 Accession number of sequence 
##   PACC 	 Does not include retired secondary accessions 
##   GENE 	 Name of gene associated with sequence 
##   PROT 	 Name of protein associated with sequence 
##   ECNO 	 EC number for enzyme or CAS registry number 
##   PDAT 	 Date sequence added to GenBank 
##   MDAT 	 Date of last update 
##   SUBS 	 CAS chemical name or MEDLINE Substance Name 
##   PROP 	 Classification by source qualifiers and molecule type 
##   SQID 	 String identifier for sequence 
##   GPRJ 	 BioProject 
##   SLEN 	 Length of sequence 
##   FKEY 	 Feature annotated on sequence 
##   PORG 	 Scientific and common names of primary organism, and all higher levels of taxonomy 
##   COMP 	 Component accessions for an assembly 
##   ASSM 	 Assembly 
##   DIV 	 Division 
##   STRN 	 Strain 
##   ISOL 	 Isolate 
##   CULT 	 Cultivar 
##   BRD 	 Breed 
##   BIOS 	 BioSample

ORGN is the key one here. It takes a little bit of work to use a taxonomy ID (rather than a name) to limit the searches,so we first have to build our queries by adding "taxid" to the fron of them:

queries <- paste0("txid", orders$ids, "[ORGN]")
head(queries)
## [1] "txid1563124[ORGN]" "txid1545895[ORGN]" "txid1490028[ORGN]"
## [4] "txid1489940[ORGN]" "txid1489939[ORGN]" "txid1489937[ORGN]"

Now we can use the count attribute of the object returned by entrez_search to see how many sequences are available for each order:

seq_count <- sapply(queries, 
              function(q) entrez_search(db="nucleotide", term=q, retmax=0)$count)
head(seq_count)
## txid1563124[ORGN] txid1545895[ORGN] txid1490028[ORGN] txid1489940[ORGN] 
##               "0"            "3274"            "1206"          "520562" 
## txid1489939[ORGN] txid1489937[ORGN] 
##            "1769"            "3428"

##Get the taxon names from summary records

Thats the numbers taken care of, but we can't yet match those numbers to a name. To do that we can fetch summary records for each order:

tax_summs <- entrez_summary(db="taxonomy", id=orders$ids)
tax_summs[[1]]
## esummary result with 12 items:
##  [1] uid              status           rank             division        
##  [5] scientificname   commonname       taxid            akataxid        
##  [9] genus            species          subsp            modificationdate

The summary records come back as a big list, simple records like those that come back from taxonomy can be converted to data.frames too (a general-purpose data.frame transformation might make it into the next version of rentrez):

tax_df <- do.call(rbind.data.frame, tax_summs)
head(tax_df)
##             uid status  rank    division    scientificname commonname
## 1563124 1563124 active order  placentals        Litopterna           
## 1545895 1545895 active order bony fishes Chaetodontiformes           
## 1490028 1490028 active order bony fishes  Holocentriformes           
## 1489940 1489940 active order bony fishes  Centrarchiformes           
## 1489939 1489939 active order bony fishes    Pempheriformes           
## 1489937 1489937 active order bony fishes   Acanthuriformes           
##           taxid akataxid genus species subsp modificationdate
## 1563124 1563124        0                     2014/11/03 00:00
## 1545895 1545895        0                     2014/09/16 00:00
## 1490028 1490028        0                     2014/04/18 00:00
## 1489940 1489940        0                     2014/04/18 00:00
## 1489939 1489939        0                     2014/04/18 00:00
## 1489937 1489937        0                     2014/04/17 00:00
##         genbankdivision
## 1563124         Mammals
## 1545895     Vertebrates
## 1490028     Vertebrates
## 1489940     Vertebrates
## 1489939     Vertebrates
## 1489937     Vertebrates

Now we can combine the names and the sequence counts to get our final result (both vectors are ordered by the vector of ids that was used to generate them, so tehy will align nicely):

seqs_by_class <- data.frame(taxon=tax_df$scientificname,
                            nseqs=as.numeric(seq_count)
)
head(seqs_by_class)
##               taxon  nseqs
## 1        Litopterna      0
## 2 Chaetodontiformes   3274
## 3  Holocentriformes   1206
## 4  Centrarchiformes 520562
## 5    Pempheriformes   1769
## 6   Acanthuriformes   3428

#make a pretty graph

We can't stop there! We also have to make a pretty graph. Using the very nice treemap package:

treemap::treemap(seqs_by_class, index= "taxon", vSize= "nseqs", border.lwds= 1 )

plot of chunk unnamed-chunk-11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment