|
--- |
|
title: Differential expression analysis |
|
author: John Blischak |
|
date: 2015-04-21 |
|
output: |
|
html_document: |
|
toc: true |
|
--- |
|
|
|
```{r settings, include = FALSE} |
|
library("knitcitations") |
|
library("knitr") |
|
opts_chunk$set(tidy = FALSE, fig.width = 8, fig.height = 8, fig.pos = "center", |
|
cache = FALSE) |
|
``` |
|
|
|
In this tutorial, we will perform a basic differential expression analysis with RNA sequencing data using R/Bioconductor. |
|
Before class, please download the data set and install the software as explained in the following section. |
|
|
|
## Download data and install software |
|
|
|
For this tutorial, we will use the data set generated by the Sequencing Quality Control ([SEQC][]) project. |
|
Please download the data by running the line of R code below. |
|
If this fails, you can try changing the `method` parameter, e.g. `method = "auto"`, `method = "wget"` or `method = "curl"`. |
|
And as a last resort, you can download it by clicking on this [link][geo]. |
|
|
|
[SEQC]: http://www.fda.gov/ScienceResearch/BioinformaticsTools/MicroarrayQualityControlProject/ |
|
[geo]: https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE49712&format=file&file=GSE49712_HTSeq.txt.gz |
|
|
|
```{r download-data} |
|
fname <- "http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE49712&format=file&file=GSE49712_HTSeq.txt.gz" |
|
download.file(fname, destfile = "GSE49712_HTSeq.txt.gz") |
|
``` |
|
|
|
Next, we need to download and install the two [Bioconductor][bioc] packages we will use to perform the analysis. |
|
|
|
[bioc]: http://www.bioconductor.org |
|
|
|
```{r install-packages, eval = FALSE} |
|
source("http://bioconductor.org/biocLite.R") |
|
biocLite("edgeR") |
|
``` |
|
|
|
## Introduction |
|
|
|
As you learned in class, RNA sequencing (RNA-seq) is an experimental method for measuring RNA expression levels via high-throughput sequencing of small adapter-ligated fragments (see figure below from `r citet(c(Oshlack2010="10.1038/nrg2484"))`). |
|
The number of reads that map to a gene is the proxy for its expression level. |
|
|
|
![RNA-seq experiment](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2949280/bin/nihms229948f1.jpg) |
|
|
|
There are many steps between receiving the raw reads from the sequencer and gaining biological insight. |
|
In this tutorial, we will start with a "Table of counts" and end with a "List of differentially expressed genes", as diagrammed in the RNA-seq analysis pipeline below (from `r citet(c(Wang2009="10.1186/gb-2010-11-12-220"))`). |
|
|
|
![RNA-seq analysis pipeline](http://genomebiology.com/content/figures/gb-2010-11-12-220-1.jpg) |
|
|
|
Unlike the continuous data that is generated by microarrays, RNA-seq generates counts. |
|
While it is possible to transform the counts into a continuous distribution and perform an analysis similar to microarrays, e.g. linear regression, multiple approaches have been developed to directly model the data as counts (`r citet(c(Marioni2008="10.1101/gr.079558.108"))`). |
|
We will perform a simple analysis using one popular Bioconductor package for differential expression analysis, [edgeR][]. |
|
edgeR (`r citet(c(Robinson2007="10.1093/bioinformatics/btm453", Robinson2010="10.1093/bioinformatics/btp616"))`) models the count data with a [negative binomial model][nb]. |
|
|
|
[edgeR]: http://www.bioconductor.org/packages/release/bioc/html/edgeR.html |
|
[nb]: http://en.wikipedia.org/wiki/Negative_binomial_distribution |
|
|
|
The SEQC data set will we analyze contains two different groups for comparison. |
|
Group A is five technical replicates of the [Stratagene Universal Human Reference RNA][a]. |
|
It is an equal mixture of RNA from ten different human cell lines. |
|
Group B is five techincal replicates of [Ambion’s Human Brain Reference RNA][b]. |
|
It is RNA that was pooled from several donors from several different regions of the brain. |
|
|
|
[a]: http://www.chem.agilent.com/Library/usermanuals/Public/740000.pdf |
|
[b]: http://www.lifetechnologies.com/order/catalog/product/AM6050 |
|
|
|
## Setup |
|
|
|
First, we need to load the package into R. |
|
|
|
```{r load-package} |
|
library("edgeR") |
|
``` |
|
|
|
Second, we need to load the data into R. |
|
In order to use the code below, we need to ensure the data file is in R's working directory. |
|
We can check our current working directory with the command `getwd()`, and we can change it using `setwd("path/to/file")` or using RStudio (type Ctrl-Shift-K). |
|
|
|
```{r import-data} |
|
data_raw <- read.table("GSE49712_HTSeq.txt.gz", header = TRUE) |
|
``` |
|
|
|
## Quality control |
|
|
|
Let's quickly explore the data. |
|
Each column corresponds to a sample, and each row corresponds to a gene. |
|
|
|
```{r view} |
|
dim(data_raw) |
|
head(data_raw) |
|
tail(data_raw) |
|
``` |
|
|
|
Notice that the last five lines contain summary statistics and thus need to be removed prior to testing. |
|
|
|
```{r remove-sum-lines} |
|
data_clean <- data_raw[1:(nrow(data_raw) - 5), ] |
|
``` |
|
|
|
We should also remove genes that are unexpressed or very lowly expressed in the samples. |
|
One simple method to do this is to choose a cutoff based on the median log2-transformed counts per gene per million mapped reads (cpm). |
|
edgeR provides the function, `cpm`, to compute the counts-per-million. |
|
|
|
```{r expr-cutoff} |
|
cpm_log <- cpm(data_clean, log = TRUE) |
|
median_log2_cpm <- apply(cpm_log, 1, median) |
|
hist(median_log2_cpm) |
|
expr_cutoff <- -1 |
|
abline(v = expr_cutoff, col = "red", lwd = 3) |
|
sum(median_log2_cpm > expr_cutoff) |
|
data_clean <- data_clean[median_log2_cpm > expr_cutoff, ] |
|
``` |
|
|
|
After removing all genes with a median log2 cpm below `r expr_cutoff`, we have `r sum(median_log2_cpm > expr_cutoff)` genes remaining. |
|
A good rule of thumb when analyzing RNA-seq data from a single cell type is to expect 9-12 thousand expressed genes. |
|
In this case, we have many more because the samples include RNA collected from many different cell types, and thus it is not surprising that many more genes are robustly expressed. |
|
|
|
**Question:** |
|
What are some ways that we could improve this simple cutoff? |
|
What information are we ignoring? |
|
|
|
After filtering lowly expressed genes, we recalculate the log cpm. |
|
|
|
```{r} |
|
cpm_log <- cpm(data_clean, log = TRUE) |
|
``` |
|
|
|
We can also explore the relationships between the samples by visualizing a heatmap of the correlation matrix. |
|
The heatmap result corresponds to what we know about the data set. |
|
First, the samples in group A and B come from very different cell populations, so the two groups are very different. |
|
Second, since the samples in each group are technical replicates, the within group variance is very low. |
|
|
|
```{r heatmap} |
|
heatmap(cor(cpm_log)) |
|
``` |
|
|
|
Another method to view the relationships between samples is principal components analysis (PCA). |
|
|
|
```{r pca} |
|
pca <- prcomp(t(cpm_log), scale. = TRUE) |
|
plot(pca$x[, 1], pca$x[, 2], pch = ".", xlab = "PC1", ylab = "PC2") |
|
text(pca$x[, 1], pca$x[, 2], labels = colnames(cpm_log)) |
|
summary(pca) |
|
``` |
|
|
|
## Two group comparison |
|
|
|
Now we start preparing the data for the the test of differential expression. |
|
We create a vector called, `group`, that labels each of the columns as belonging to group A or B. |
|
We then use this vector and the gene counts to create a DGEList, which is the object that edgeR uses for storing the data from a differntial expression experiment. |
|
|
|
```{r make-groups-edgeR} |
|
group <- substr(colnames(data_clean), 1, 1) |
|
group |
|
y <- DGEList(counts = data_clean, group = group) |
|
y |
|
``` |
|
|
|
|
|
edgeR normalizes the genes counts using the method TMM (trimmed means of m values) developed by `r citet(c(Robinson2010="10.1186/gb-2010-11-3-r25"))`. |
|
Recall from lecture that the read counts for moderately to lowly expressed genes can be strongly influenced by small fluctuations in the expression level of highly expressed genes. |
|
In other words, small differences in expression of highly expressed genes between samples can give the appearance that many lower expressed genes are differentially expressed between conditions. |
|
TMM adjusts for this by removing the extremely lowly and highly expressed genes and also those genes that are very different across samples. |
|
It then compares the total counts for this subset of genes between the two samples to get the scaling factor (this is a simplification). |
|
Similar to normalization methods for microarray data, this method assumes the majority of genes are not differentially expressed between any two samples. |
|
|
|
```{r normalize-edgeR} |
|
y <- calcNormFactors(y) |
|
y$samples |
|
``` |
|
|
|
The next step is to model the variance of the read counts per gene. |
|
A natural method for modeling gene counts is the Poisson distribution. |
|
However, the Poisson assumes the mean and variance are identical, but it has been found emipirically that the variance in RNA-seq measurements of gene expression are larger than the mean (termed "overdispersion"). |
|
So instead the negative binomial distribution is used, which has a dispersion parameter for modeling the increase in variance from a Poisson process. |
|
edgeR treats the Poisson variance as simple sampling variance, and refers to the dispersion estimate as the "biolocial coefficient of variation." |
|
Though it should be mentioned that any technical biases are also included in this estimate. |
|
edgeR shares information across genes to determine a common dispersion. |
|
It then calculates a dispersion esitmate per gene and shrinks it towards the common dispersion. |
|
The gene-specific (referred to in edgeR as tagwise) dispersion estimates are used in the test for differential expression. |
|
|
|
**Note:** edgeR also has a trended dispersion to use in place of the common dispersion for more complicated experimental designs. |
|
|
|
```{r model-edgeR} |
|
y <- estimateCommonDisp(y, verbose = TRUE) |
|
y <- estimateTagwiseDisp(y) |
|
sqrt(y$common.dispersion) # biological coefficient of variation |
|
plotBCV(y) |
|
``` |
|
|
|
The biological coefficient of variation is lower than normally seen in human studies (~0.4) because the samples are technical replicates. |
|
|
|
edgeR tests for differential expression between two classes using a modified version of the Fisher's Exact Test. |
|
|
|
**Note:** It also provides a generalized linear model frameworks for more complicated experimental designs. |
|
|
|
```{r test-edgeR} |
|
et <- exactTest(y) |
|
results_edgeR <- topTags(et, n = nrow(data_clean), sort.by = "none") |
|
head(results_edgeR$table) |
|
``` |
|
|
|
How many genes are differentially expressed at an FDR of 10%? |
|
|
|
```{r count-de-edgeR} |
|
sum(results_edgeR$table$FDR < .1) |
|
plotSmear(et, de.tags = rownames(results_edgeR)[results_edgeR$table$FDR < .01]) |
|
abline(h = c(-2, 2), col = "blue") |
|
``` |
|
|
|
As expected from the description of the samples and the heatmap, there are many differentially expressed genes. |
|
The [MA plot][ma] above plots the log2 fold change on the y-axis versus the average log2 counts-per-million on the x-axis. |
|
The red dots are genes with an FDR less than 10%. |
|
The blue lines represent a four-fold change in expression. |
|
|
|
[ma]: https://en.wikipedia.org/wiki/MA_plot |
|
|
|
## Adding covariates |
|
|
|
The previous example was a two group comparison, but if you have additional covariates, you'll need to use a generalized linear model (GLM) framework. |
|
Let's say we processed the samples in two batches, the samples were collected from males and females, and from a range of ages. |
|
|
|
```{r} |
|
set.seed(123) |
|
batch <- sample(c("one", "two"), 10, replace = TRUE) |
|
sex <- sample(c("M", "F"), 10, replace = TRUE) |
|
age <- sample(18:50, 10, replace = TRUE) |
|
``` |
|
|
|
We need to create a design matrix to describe the statistical model. |
|
|
|
```{r} |
|
y <- DGEList(data_clean) |
|
y <- calcNormFactors(y) |
|
design <- model.matrix(~group + batch + sex + age) |
|
design |
|
``` |
|
|
|
And now we test. |
|
|
|
```{r} |
|
y <- estimateGLMCommonDisp(y, design) |
|
y <- estimateGLMTrendedDisp(y, design) |
|
y <- estimateGLMTagwiseDisp(y, design) |
|
fit <- glmFit(y, design) |
|
lrt <- glmLRT(fit, coef = 2) |
|
topTags(lrt) |
|
``` |
|
|
|
And here is an example gene. |
|
|
|
```{r} |
|
boxplot(as.numeric(data_clean["HBB", ]) ~ group) |
|
``` |
|
|
|
|
|
## To learn more |
|
|
|
Check out the manual for [edgeR][erman], search the Bioconductor mailing list [archives][], and consider joining the Bioconductor mailing [list][]. |
|
For a rigorous comparison RNA-seq methods, see `r citep(c(Rapaport2013="10.1186/gb-2013-14-9-r95", Soneson2013="10.1186/1471-2105-14-91"))`. |
|
The [source code][gist] for this lesson is availabe as a GitHub Gist. |
|
|
|
[erman]: http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf |
|
[archives]: http://dir.gmane.org/gmane.science.biology.informatics.conductor |
|
[list]: http://www.bioconductor.org/help/mailing-list/ |
|
[gist]: https://gist.github.com/jdblischak/11384914 |
|
|
|
## References |
|
|
|
```{r refs, results = 'asis', echo = FALSE} |
|
bibliography() |
|
``` |
|
|
|
## Session information |
|
|
|
```{r info} |
|
sessionInfo() |
|
``` |
Great tutorial, you are teaching your students some great techniques!
Under 'How many genes are differentially expressed at an FDR of 10%?', do you mean to plot it as:
plotSmear(et, de.tags = rownames(results_edgeR)[results_edgeR$table$FDR < .1]) # Switched from <0.01