Skip to content

Instantly share code, notes, and snippets.

@sckott
Created May 6, 2014 18:22
Show Gist options
  • Save sckott/9b4d0d5296ddc845c3c4 to your computer and use it in GitHub Desktop.
Save sckott/9b4d0d5296ddc845c3c4 to your computer and use it in GitHub Desktop.
---
title: rentrez tutorial
layout: tutorial
packge_version: 0.2.1
---
```{r, eval=TRUE, echo=FALSE}
opts_chunk$set(fig.path="../assets/tutorial-images/rentrez/")
```
rentrez tutorial
-------------
`rentrez` is an R package that helps users query the NCBI's databases to download genetic and bibliographic data.`rentrez` is now on CRAN, so can be installed by using `install.packages("rentrez")`. The source code is also avaliablefrom the [ROpenSci github repository](https://github.com/ropensci/rentez).
What rentrez does
-----------------
At present the functions provided by rentrez cover the entire Eutils API. Basically, the functions take arguments provided by a user, produce the URL needed to query the NCBI’s API and fetches the resulting data. In most cases the functions return lists that contain the parts of the resulting file that are most likely to be useful as items. When the returned file is XML the list contains the XML file for those that want to dig deeper.
`rentrez` presumes you already know your way around the Eutils' API, [which is well documented](http://www.ncbi.nlm.nih.gov/books/NBK25500/). Make sure you read the documentation, and in particular, be aware of the NCBI's usage policies and try to limit very large requests to off peak (USA) times.
Examples
--------
Lately, I've been working on a little meta-analysis of phylogenies. In particualr, we're interested in why sometimes different genes tell different stories about the relationships between species from which the come. In terms of being able to get the individual gene trees I need to do these analyses there are good, rather less good and quite bad papers out there. In the best cases I can just download the trees as nice, parsable newick files from [TreeBase](treebase.org), which has already been [wrapped by ROpenSci](http://github.com/ropensci/treebase). Sometimes I need to print out the trees from a paper and work with pencil and paper, which I can handle. In a few cases people haven't actually published their individual gene trees, if I want to included these papers I need to replicate their work by downloading the gene sequences, aligning them and making new trees.
So, here's an example of how I've been using `rentrez` to automate some of that process. I'm going to use a slightly convaluted process to get all the data, but that's just so I can walk though a bunch of the `rentrez` functions. Let's get started.
<section id="installation">
## Installation
```{r install, eval=FALSE}
install.packages("rentrez")
```
<section id="usage">
## Usage
## Finding papers in Pubmed
Reece et al (2010, [doi:10.1016/j.ympev.2010.07.013](http://dx.doi.org/10.1016/j.ympev.2010.07.013)) presented a phylogeny of moray eels using four different genes, but didn't publish the gene trees. I want to get the sequences underlying their analyses, which will be in the NCBI's databases, so I can reproduce their results. To get data associated with this paper from the NCBI I need the PMID (pubmed ID), which I can find using the `rentrez` function `entrez_search` to query the pubmed database with the paper's doi:
```{r load, message=FALSE, warning=FALSE}
library("rentrez")
library("XML")
```
```{r pubmed_search, message=FALSE, warning=FALSE, comment=NA, cache=FALSE}
pubmed_search <- entrez_search(db = "pubmed", term = "10.1016/j.ympev.2010.07.013[doi]")
pubmed_search$ids
```
All the functions in `rentrez` create a URL to get data from the NCBI, then fetch the resulting document, usually as an XML file. In most cases the functions will parse the most relevant sections of the XML file out and present them to you as items in a list (`ids` being one item of the `pubmed_search` list in this case).
## Find data related to a particular record in an NCBI database
OK, now we have the PMID, what data does NCBI have for this paper? The `entrez_link` function lets us find out. In this case the `db` argument can be used to limit the number of data sources to check, but I want to see every data source here so I'll set this paramater to “all”:
```{r entrez_link, message=FALSE, warning=FALSE, comment=NA, cache=FALSE}
NCBI_data <- entrez_link(dbfrom = "pubmed", id = pubmed_search$ids, db = "all")
str(NCBI_data)
```
## Fetch a summary of a record
The most relevant data here is the from the [popset](ncbi.nlm.nih.gov/popset) database, which containts population and phylogenetic datasets. If I want to see what each of the four popset datasets associated with this paper are about I can use `entrez_summary` to have a look. This function can collect summaries from a lot of different databases, and, because the XML return by those databases isn't conisitant doesn't make any attempt to parse information from the resulting file. Instead you get a `XMLInternalDocument` object from the `XML` library, which you have to further process yourself. In this case, a little xpath gets the name of each dataset:
```{r entrez_summary, message=FALSE, warning=FALSE, comment=NA, cache=FALSE}
data_summaries <- entrez_summary(db = "popset", id = NCBI_data$pubmed_popset)
sapply(data_summaries, "[[", "Title")
```
## Fetch data
Ok, since we might expect nuclear and mitochondrial genes to hav different histories, let's get sequences from each genome (the the COI and RAG1 datasets) using `entrez_fetch`. By specifying `file_format="fasta"` we will get characater vectors in the fasta format:
```{r entrez_fetch, message=FALSE, warning=FALSE, comment=NA, cache=FALSE}
coi <- entrez_fetch(db = "popset", rettype = 'fasta', ids = NCBI_data$pubmed_popset[1], file_format = "fasta")
rag1 <- entrez_fetch(db = "popset", rettype = 'fasta', ids = NCBI_data$pubmed_popset[3], file_format = "fasta")
write(coi, "~/moray_coi_raw.fasta")
write(rag1, "~/moray_rag1_raw.fasta")
```
So I've got the data on hand - that's all the I need `rentrez` for, but I might as well align these sequences and make gene trees for each. I'll just do a quick and diry neighbor-joining tree using `ape` and we can clean up the long OTU names with the help of `stingr`. (I put the fussy work of cleaning the names and rooting the trees into a function `clean_and_root`):
```{r muscle, eval=FALSE}
library(ape)
library(stringr)
clean_and_root <- function(tr, outgroup, resolved = TRUE) {
tr$tip.label <- sapply(str_split(tr$tip.label, " "), function(x) paste(x[2:3],
collapse = "_"))
return(root(tr, outgroup, resolve.root = resolved))
}
par(mfrow = c(1, 2))
coi_ali <- muscle(read.dna("~/moray_coi_raw.fasta", "fasta"))
coi_tr <- nj(dist.dna(coi_ali, "k81"))
clean_coi_tr <- clean_and_root(coi_tr, "Uropterygius_macrocephalus")
plot(clean_coi_tr, direction = "rightwards", cex = 0.5)
rag_ali <- muscle(read.dna("~/moray_rag1_raw.fasta", "fasta"))
rag_tr <- nj(dist.dna(rag_ali, "k81"))
clean_rag_tr <- clean_and_root(rag_tr, "Uropterygius_macrocephalus")
plot(clean_rag_tr, direction = "leftward", cex = 0.5)
```
![](../assets/img/vignettes/rentrez/rentrez_tree.png)
Working with WebEnvs
--------------------
The NCBI provides search history features, which can be useful for dealing with alrge lists of IDs (which will not fit in a single URL) or repeated searches. As an example, we will go searching for COI sequences from all the land snail (Stylommatophora) species we can find in the nucleotide database:
```{r webenvs1, message=FALSE, warning=FALSE, comment=NA, cache=FALSE}
snail_search <- entrez_search(db = "nuccore", "Gastropoda[Organism] AND COI[Gene]", usehistory = "y")
```
Because we set usehistory to “y” the `snail_search` object contains a unique ID for the search (`WebEnv`) and the particular query in that search history (`QueryKey`). Instead of using the hundreds of ids we turned up to make a new URL and fetch the sequences we can use the webhistory features.
```{r webenvs2, message=FALSE, warning=FALSE, comment=NA, cache=FALSE}
cookie <- snail_search$WebEnv
qk <- snail_search$QueryKey
snail_coi <- entrez_fetch(db = "nuccore", WebEnv = cookie, query_key = qk, file_format = "fasta", retmax = 10)
```
In that case we used `retmax` to limit the number of queries we downloaded. There are actually thousands of records. If we wanted to download all of them it would probably be a good idea to downlaod them in batches (both to give the NCBI's severs a break and to make sure a corrupted download doesn't ruin your whole process. Using a for loop in conjunction with the terms `restart` and `retmax` we can download the sequences 50 at a time:
```{r webenvs3, message=FALSE, warning=FALSE, comment=NA, cache=FALSE}
for (start_rec in seq(0, 200, 50)) {
fname <- paste("snail_coi_", start_rec, ".fasta", sep = "")
recs <- entrez_fetch(db = "nuccore", WebEnv = cookie, query_key = qk, file_format = "fasta",
retstart = start_rec, retmax = 50)
write(recs, fname)
print(paste("wrote records to ", fname, sep = ""))
}
```
<section id="citing">
## Citing
To cite `rentrez` in publications use:
<br>
> David Winter (2014). rentrez: Entrez in R. R package version 0.2.3.
<section id="license_bugs">
## License and bugs
* License: [MIT](http://opensource.org/licenses/MIT)
* Report bugs at [our Github repo for rentrez](https://github.com/ropensci/rentrez/issues?state=open)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment