Create a gist now

Instantly share code, notes, and snippets.

@dwinter /rentrez.Rmd Secret
Created Aug 18, 2012

source for code section of Rentrez post


Lately, I've been working on a little meta-analysis of phylogenies. In particualr, we're interested in why sometimes different genes tell different stories about the relationships between species from which the come. In terms of being able to get the individual gene trees I need to do these analyses there are good, rather less good and quite bad papers out there. In the best cases I can just download the trees as nice, parsable newick files fromTreeBase (which has already been wrapped by ROpenSci). Sometimes I need to print out the trees from a paper and work with pencil and paper, which I can handle. In a few cases people haven't actually published their individual gene trees, if I want to included these papers I need to replicate their work by downloading the gene sequences, aligning them and making new trees.

So, here's an example of how I've been using rentrez to automate some of that process. I'm going to use a slightly convaluted process to get all the data, but that's just so I can walk though a bunch of the rentrez functions. Let's get started. Reece et al (2010, doi:10.1016/j.ympev.2010.07.013) presented a phylogeny of moray eels using four different genes, but didn't publish the gene trees. I want to get the sequences underlying their analyses, which will be in the NCBI's databases, so I can reproduce their results. To get data associated with this paper from the NCBI I need the PMID (pubmed ID), which I can find using the rentrez function entrez_search to query the pubmed database with the paper's doi:

pubmed_search <- entrez_search(db="pubmed", term="10.1016/j.ympev.2010.07.013[doi]")

All the functions in rentrez create a URL to get data from the NCBI, then fetch the resulting document, usually as an XML file. In most cases the functions will parse the most relevant sections of the XML file out and present them to you as items in a list (ids being one item of the pubmed_search list in this case).

OK, now we have the PMID, what data does NCBI have for this paper? The entrez_link function lets us find out. In this case the db argument can be used to limit the number of data sources to check, but I want to see every data source here so I'll set this paramater to "all":

NCBI_data <- entrez_link(dbfrom="pubmed", id=pubmed_search$ids, db="all")

The most relevant data here is the from the popset database, which containts population and phylogenetic datasets. If I want to see what each of the four popset datasets associated with this paper are about I can use entrez_summary to have a look. This function can collect summaries from a lot of different databases, and, because the XML return by those databases isn't conisitant doesn't make any attempt to parse information from the resulting file. Instead you get a XMLInternalDocument object from the XML library, which you have to further process yourself. In this case, a little xpath gets the name of each dataset:

data_summaries <- entrez_summary(db="popset", ids=NCBI_data$pubmed_popset)
xpathSApply(data_summaries, "//Item[@Name='Title']", xmlValue)

Ok, since we might expect nuclear and mitochondrial genes to hav different histories, let's get sequences from each genome (the the COI and RAG1 datasets) using entrez_fetch. By specifying file_format="fasta" we will get characater vectors in the fasta format:

coi <- entrez_fetch(db="popset", ids=NCBI_data$pubmed_popset[1], file_format="fasta" )
rag1 <- entrez_fetch(db="popset", ids=NCBI_data$pubmed_popset[3], file_format="fasta")
write(coi, "moray_coi_raw.fasta")
write(rag1, "moray_rag1_raw.fasta")

So I've got the data on hand - that's all the I need rentrez for, but I might as well align these sequences and make gene trees for each. I'll just do a quick and diry neighbor-joining tree using ape and we can clean up the long OTU names with the help of stingr. (I put the fussy work of cleaning the names and rooting the trees into a function clean_and_root):

clean_and_root <- function(tr, outgroup, resolved=TRUE){
  tr$tip.label <- sapply(str_split(tr$tip.label, " "), function(x) paste(x[2:3], collapse="_"))
  return(root(tr, outgroup, resolve.root=resolved))
coi_ali <- muscle(read.dna("moray_coi_raw.fasta", "fasta"))
coi_tr <- nj(dist.dna(coi_ali, "k81"))
clean_coi_tr <- clean_and_root(coi_tr, "Uropterygius_macrocephalus" )
plot(clean_coi_tr, direction="rightwards", cex=0.5)
rag_ali <- muscle(read.dna("moray_rag1_raw.fasta", "fasta"))
rag_tr <- nj(dist.dna(rag_ali, "k81"))
clean_rag_tr <- clean_and_root(rag_tr, "Uropterygius_macrocephalus" )
plot(clean_rag_tr, direction="leftward", cex=0.5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment