dwinter/rentrez.Rmd Secret

## rentrez.Rmd
##Example

Lately, I've been working on a little meta-analysis of phylogenies. In particualr,
we're interested in why sometimes different genes tell different stories about
the relationships between species from which the come. In terms of being able
to get the individual gene trees I need to do these analyses there are good,
rather less good and quite bad papers out there. In the best cases I can just
download the trees as nice, parsable newick files from[TreeBase](treebase.org)
(which has already been [wrapped by ROpenSci](github.com/ROpenSci/treebase)).
Sometimes I need to print out the trees from a paper and work with pencil and paper,
which I can handle. In a few cases people haven't actually published their individual
gene trees, if I want to included these papers I need to replicate their work by
downloading the gene sequences, aligning them and making new trees.

So, here's an example of how I've been using `rentrez` to automate some of that
process. I'm going to use a slightly convaluted process to get all the data, but
that's just so I can walk though a bunch of the `rentrez` functions. Let's get
started. Reece et al (2010, [doi:10.1016/j.ympev.2010.07.013](http://dx.doi.org/10.1016/j.ympev.2010.07.013))
presented a phylogeny of moray eels using four different genes, but didn't
publish the gene trees. I want to get the sequences underlying their analyses,
which will be in the NCBI's databases, so I can reproduce their results. To get
data associated with this paper from the NCBI I need the PMID (pubmed ID), which
I can find using the `rentrez` function `entrez_search` to query the pubmed
database with the paper's doi:

```{r get_pmid, message=FALSE}
library(rentrez)
pubmed_search <- entrez_search(db="pubmed", term="10.1016/j.ympev.2010.07.013[doi]")
pubmed_search$ids
```

All the functions in `rentrez` create a URL to get data from the NCBI, then fetch
the resulting document, usually as an XML file. In most cases the functions will
parse the most relevant sections of the XML file out and present them to you
as items in a list (`ids` being one item of the `pubmed_search` list in this case).

OK, now we have the PMID, what data does NCBI have for this paper? The
`entrez_link` function lets us find out. In this case the `db` argument can be
used to limit the number of data sources to check, but I want to see every data
source here so I'll set this paramater to "all":

```{r get_data}
NCBI_data <- entrez_link(dbfrom="pubmed", id=pubmed_search$ids, db="all")
str(NCBI_data)
```

The most relevant data here is the from the [popset](ncbi.nlm.nih.gov/popset)
database, which containts population and phylogenetic datasets. If I want to
see what each of the four popset datasets associated with this paper are about I
can use `entrez_summary` to have a look. This function can collect summaries
from a lot of different databases, and, because the XML return by those databases
isn't conisitant doesn't make any attempt to parse information from the resulting
file. Instead you get a `XMLInternalDocument` object from the `XML` library, which
you have to further process yourself. In this case, a little xpath gets the name
of each dataset:

```{r summary}
data_summaries <- entrez_summary(db="popset", ids=NCBI_data$pubmed_popset)
xpathSApply(data_summaries, "//Item[@Name='Title']", xmlValue)
```

Ok, since we might expect nuclear and mitochondrial genes to hav different
histories, let's get sequences from each genome (the the COI and RAG1 datasets)
using `entrez_fetch`. By specifying `file_format="fasta"` we will get characater
vectors in the fasta format:

``` {r fetch}
coi <- entrez_fetch(db="popset", ids=NCBI_data$pubmed_popset[1], file_format="fasta" )
rag1 <- entrez_fetch(db="popset", ids=NCBI_data$pubmed_popset[3], file_format="fasta")
write(coi, "moray_coi_raw.fasta")
write(rag1, "moray_rag1_raw.fasta")
```

So I've got the data on hand - that's all the I need `rentrez` for, but I might
as well align these  sequences and make gene trees for each. I'll just do a
quick and diry neighbor-joining tree using `ape` and we can clean up the long
OTU names with the help of `stingr`. (I put the fussy work of cleaning the names
 and rooting the trees into a function `clean_and_root`):

```{r trees}
library(ape)
library(stringr)
clean_and_root <- function(tr, outgroup, resolved=TRUE){
  tr$tip.label <- sapply(str_split(tr$tip.label, " "), function(x) paste(x[2:3], collapse="_"))
  return(root(tr, outgroup, resolve.root=resolved))
}
par(mfrow=c(1,2))
coi_ali <- muscle(read.dna("moray_coi_raw.fasta", "fasta"))
coi_tr <- nj(dist.dna(coi_ali, "k81"))
clean_coi_tr <- clean_and_root(coi_tr, "Uropterygius_macrocephalus" )
plot(clean_coi_tr, direction="rightwards", cex=0.5)
rag_ali <- muscle(read.dna("moray_rag1_raw.fasta", "fasta"))
rag_tr <- nj(dist.dna(rag_ali, "k81"))
clean_rag_tr <- clean_and_root(rag_tr, "Uropterygius_macrocephalus" )
plot(clean_rag_tr, direction="leftward", cex=0.5)
```
	##Example

	Lately, I've been working on a little meta-analysis of phylogenies. In particualr,
	we're interested in why sometimes different genes tell different stories about
	the relationships between species from which the come. In terms of being able
	to get the individual gene trees I need to do these analyses there are good,
	rather less good and quite bad papers out there. In the best cases I can just
	download the trees as nice, parsable newick files from[TreeBase](treebase.org)
	(which has already been [wrapped by ROpenSci](github.com/ROpenSci/treebase)).
	Sometimes I need to print out the trees from a paper and work with pencil and paper,
	which I can handle. In a few cases people haven't actually published their individual
	gene trees, if I want to included these papers I need to replicate their work by
	downloading the gene sequences, aligning them and making new trees.

	So, here's an example of how I've been using `rentrez` to automate some of that
	process. I'm going to use a slightly convaluted process to get all the data, but
	that's just so I can walk though a bunch of the `rentrez` functions. Let's get
	started. Reece et al (2010, [doi:10.1016/j.ympev.2010.07.013](http://dx.doi.org/10.1016/j.ympev.2010.07.013))
	presented a phylogeny of moray eels using four different genes, but didn't
	publish the gene trees. I want to get the sequences underlying their analyses,
	which will be in the NCBI's databases, so I can reproduce their results. To get
	data associated with this paper from the NCBI I need the PMID (pubmed ID), which
	I can find using the `rentrez` function `entrez_search` to query the pubmed
	database with the paper's doi:

	```{r get_pmid, message=FALSE}
	library(rentrez)
	pubmed_search <- entrez_search(db="pubmed", term="10.1016/j.ympev.2010.07.013[doi]")
	pubmed_search$ids
	```

	All the functions in `rentrez` create a URL to get data from the NCBI, then fetch
	the resulting document, usually as an XML file. In most cases the functions will
	parse the most relevant sections of the XML file out and present them to you
	as items in a list (`ids` being one item of the `pubmed_search` list in this case).

	OK, now we have the PMID, what data does NCBI have for this paper? The
	`entrez_link` function lets us find out. In this case the `db` argument can be
	used to limit the number of data sources to check, but I want to see every data
	source here so I'll set this paramater to "all":

	```{r get_data}
	NCBI_data <- entrez_link(dbfrom="pubmed", id=pubmed_search$ids, db="all")
	str(NCBI_data)
	```

	The most relevant data here is the from the [popset](ncbi.nlm.nih.gov/popset)
	database, which containts population and phylogenetic datasets. If I want to
	see what each of the four popset datasets associated with this paper are about I
	can use `entrez_summary` to have a look. This function can collect summaries
	from a lot of different databases, and, because the XML return by those databases
	isn't conisitant doesn't make any attempt to parse information from the resulting
	file. Instead you get a `XMLInternalDocument` object from the `XML` library, which
	you have to further process yourself. In this case, a little xpath gets the name
	of each dataset:

	```{r summary}
	data_summaries <- entrez_summary(db="popset", ids=NCBI_data$pubmed_popset)
	xpathSApply(data_summaries, "//Item[@Name='Title']", xmlValue)
	```

	Ok, since we might expect nuclear and mitochondrial genes to hav different
	histories, let's get sequences from each genome (the the COI and RAG1 datasets)
	using `entrez_fetch`. By specifying `file_format="fasta"` we will get characater
	vectors in the fasta format:

	``` {r fetch}
	coi <- entrez_fetch(db="popset", ids=NCBI_data$pubmed_popset[1], file_format="fasta" )
	rag1 <- entrez_fetch(db="popset", ids=NCBI_data$pubmed_popset[3], file_format="fasta")
	write(coi, "moray_coi_raw.fasta")
	write(rag1, "moray_rag1_raw.fasta")
	```

	So I've got the data on hand - that's all the I need `rentrez` for, but I might
	as well align these sequences and make gene trees for each. I'll just do a
	quick and diry neighbor-joining tree using `ape` and we can clean up the long
	OTU names with the help of `stingr`. (I put the fussy work of cleaning the names
	and rooting the trees into a function `clean_and_root`):

	```{r trees}
	library(ape)
	library(stringr)
	clean_and_root <- function(tr, outgroup, resolved=TRUE){
	tr$tip.label <- sapply(str_split(tr$tip.label, " "), function(x) paste(x[2:3], collapse="_"))
	return(root(tr, outgroup, resolve.root=resolved))
	}
	par(mfrow=c(1,2))
	coi_ali <- muscle(read.dna("moray_coi_raw.fasta", "fasta"))
	coi_tr <- nj(dist.dna(coi_ali, "k81"))
	clean_coi_tr <- clean_and_root(coi_tr, "Uropterygius_macrocephalus" )
	plot(clean_coi_tr, direction="rightwards", cex=0.5)
	rag_ali <- muscle(read.dna("moray_rag1_raw.fasta", "fasta"))
	rag_tr <- nj(dist.dna(rag_ali, "k81"))
	clean_rag_tr <- clean_and_root(rag_tr, "Uropterygius_macrocephalus" )
	plot(clean_rag_tr, direction="leftward", cex=0.5)
	```