Skip to content

Instantly share code, notes, and snippets.

@cbare
Created October 11, 2012 12:08
Show Gist options
  • Save cbare/3871900 to your computer and use it in GitHub Desktop.
Save cbare/3871900 to your computer and use it in GitHub Desktop.
Predictive modeling demo Knitr-ized
Predictive Modeling: Drug Response in Cancer Cell Lines
=======================================================
This is a demo of **Knitr**, an R package for authoring executable documents, documents that mix formatted text, source code and graphical output. I've used In Sock's demos of drug response in [CCLE][1] data, but I've probably gotten most of the analysis mixed up. Apologies to In Sock.
Analysis of cancer cell lines for drug sensitity using:
* [Cancer Cell Line Encyclopedia][1]
* [Gene Set Enrichment Analysis][2]
Workflow
--------
The predictive modeling workflow is composed of three consecutive steps:
1. Applying predictive models
2. Bootstrapping feature selection
3. Analysis with selected features
Several predictive models (Lasso, Elastic Net, Ridge, SVM, PCR, PLS, SPLS, etc.) are trained and cross-validated. Bootstrapping selects features that are biologically relevant with respect to prior knowledge such as pathways. Gold standard predictors and gene-set enrichment analysis confirm selected features against KEGG, BIOCARTA, REACTOME and GO term enrichment.
![predictive modeling workflow](figures/workflow.png)
Load data from Synapse
----------------------
First, we load the [synapseClient][3] package and log in.
```{r results='hide', message=FALSE, warning=FALSE}
library(RJSONIO)
library(synapseClient)
```
```{r synapseLogin}
synapseLogin('christopherbare@gmail.com','san juan island')
```
Next, we load several datasets from Synapse from the Cancer Cell Line Project - (syn170143). This project holds datasets for gene expression, copy number variation, mutation and drug sensitivity.
```{r download-data-from-synapse}
# copy number variation
cnv_entity <- loadEntity('syn269019')
cnv <- cnv_entity$objects$eSet_copy
# mutation data
oncomap_entity <- loadEntity('syn269021')
oncomap <- oncomap_entity$objects$eSet_oncomap
# gene expression
expr_entity <- loadEntity('syn269056')
expr <- expr_entity$objects$eSet_expr
# drug
drug_entity <- loadEntity('syn269024')
adf_drug <- drug_entity$objects$adf_drug
```
Preprocessing
-------------
```{r results='hide', message=FALSE}
require(predictiveModeling)
```
```{r source-insock-code, echo=FALSE, results='hide', message=FALSE}
source("~/COMPBIO/trunk/users/jang/R5/myEnetModel.R")
source("~/COMPBIO/trunk/users/jang/R5/bootstrapPredictiveModel.R")
```
Let's concatenate the molecular feature dataset(exp, cnv, and mut), filter out 'NA's and match the sample names with drug data. The we'll filter out redundant and low variance data and scale the remainder.
```{r preprocess, tidy=FALSE, eval=FALSE}
# combine all features into one matrix
featureData <- createAggregateFeatureDataSet(list(expr=expr, copy=cnv, mut=oncomap))
# filter rows with NAs
featureData_filtered <- filterNasFromMatrix(featureData, filterBy = "rows")
dataSets_ccle <- createFeatureAndResponseDataList(t(featureData_filtered),adf_drug)
# pick just the first drug
kk=1
# to make computation fast, we prefilter observation matrix with following thresholds
filteredData <- filterPredictiveModelData(dataSets_ccle$featureData,
dataSets_ccle$responseData[,kk,drop=FALSE],
featureVarianceThreshold = 0.01,
corPValThresh = 0.1)
# In CNV, there might be redundant features thanks to sharing the same position
filteredFeatureData <- filteredData$featureData
filteredFeatureData <- unique(filteredFeatureData, MARGIN=2)
filteredResponseData <- filteredData$responseData
## scale these data
filteredFeatureDataScaled <- scale(filteredFeatureData)
filteredResponseDataScaled <- scale(filteredResponseData)
```
Predictive Modeling
-------------------
Here, we'll make use of [foreach and doMC][4] to spread the boostrapping runs over multiple cores.
```{r results='hide', message=FALSE}
require(foreach)
require(doMC)
registerDoMC()
```
We'll do 100 runs of elastic net (which takes a long time!). Elastic net works like this:
$latex \beta = \underset{\beta}{\operatorname{argmax}}(\left \| y-X\beta \right \|^2 + \lambda (\alpha \left \| \beta \right \|_2^2 + (1-\alpha) \left \| \beta \right \|_1))$
```{r bootstrap, eval=FALSE}
# For Elastic Net, we set two parameters: alpha and lambda
alphas <- unique(createENetTuneGrid()[,1])
lambdas <- createENetTuneGrid(alphas = 1)[,2]
# perform 100 bootstrapping runs
resultsModel <-bootstrapPredictiveModel(filteredFeatureDataScaled,
filteredResponseDataScaled,
model = myEnetModel$new(),
numBootstrap=10,
alpha=alphas, lambda = lambdas)
```
Select predictors that are consistently highly ranked over the bootstrapping runs.
```{r totally-cheating, echo=FALSE, results='hide', message=FALSE}
e <- loadEntity('syn1437787')
resultsModel <- e$objects$resultsModel
```
```{r rank-predictors}
# Rank predictors and combine average over bootstrapping runs
RankTotals<-c()
for(k in 1:length(resultsModel)){
RankTotals<-cbind(RankTotals,rank(abs(as.numeric(resultsModel[[k]][-1])),ties.method="min")/length(resultsModel[[k]]))
}
# select top influences
rownames(RankTotals)<-rownames(resultsModel[[k]])[-1]
reference <- apply(RankTotals,1,sum)
X<-sort(reference, decreasing =T, index.return =T)
sigCoefName<-names(reference)[X$ix]
```
Let's look at the distribution of rankings among the predictors.
```{r histogram}
hist(reference, main="Ranking predictors", xlab="predictor rank averaged over 100 bootstrapping runs", col="light grey")
```
Every analysis in which gene expression plays a part must contain at least one heatmap. This is a very important rule which we don't want to break. So, here it is:
```{r heatmap}
genes <- unique(gsub(pattern="[_expr|_copy|_mut]",replacement="", x=sigCoefName[1:100]))
genes <- intersect(genes, rownames(assayData(expr)$exprs))
m <- (assayData(expr)$exprs)[genes,]
heatmap(m, col=topo.colors(100), cexRow=0.4)
```
Conclusion
----------
Knitr is cool!
* mixes formatted text, code and graphics
* outputs HTML or PDF.
* produces documents programmatically, which means they can be shared, re-run and versioned.
* works with markdown, which I'm hoping will mesh nicely with the wiki features coming to Synapse.
* still has some kinks. I had issues with Latex equation rendering and debugging is hard.
* long running analyses should be pre-computed and stored in Synapse.
[1]: http://www.broadinstitute.org/software/cprg/?q=node/11 "CCLE"
[2]: http://www.broadinstitute.org/gsea/ "GSEA"
[3]: https://sagebionetworks.jira.com/wiki/display/SYNR/Home
[4]: http://cran.r-project.org/web/packages/doMC/vignettes/gettingstartedMC.pdf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment