cbare/predictive_modeling_ccle.Rmd

## predictive_modeling_ccle.Rmd
Predictive Modeling: Drug Response in Cancer Cell Lines
=======================================================

This is a demo of **Knitr**, an R package for authoring executable documents, documents that mix formatted text, source code and graphical output. I've used In Sock's demos of drug response in [CCLE][1] data, but I've probably gotten most of the analysis mixed up. Apologies to In Sock.

Analysis of cancer cell lines for drug sensitity using:
  * [Cancer Cell Line Encyclopedia][1]
  * [Gene Set Enrichment Analysis][2]

Workflow
--------
The predictive modeling workflow is composed of three consecutive steps:
  1. Applying predictive models
  2. Bootstrapping feature selection
  3. Analysis with selected features

Several predictive models (Lasso, Elastic Net, Ridge, SVM, PCR, PLS, SPLS, etc.) are trained and cross-validated. Bootstrapping selects features that are biologically relevant with respect to prior knowledge such as pathways. Gold standard predictors and gene-set enrichment analysis confirm selected features against KEGG, BIOCARTA, REACTOME and GO term enrichment.

![predictive modeling workflow](figures/workflow.png)

Load data from Synapse
----------------------

First, we load the [synapseClient][3] package and log in.

```{r results='hide', message=FALSE, warning=FALSE}
library(RJSONIO)
library(synapseClient)
```

```{r synapseLogin}
synapseLogin('christopherbare@gmail.com','san juan island')
```

Next, we load several datasets from Synapse from the Cancer Cell Line Project - (syn170143). This project holds datasets for gene expression, copy number variation, mutation and drug sensitivity.

```{r download-data-from-synapse}
# copy number variation
cnv_entity <- loadEntity('syn269019')
cnv <- cnv_entity$objects$eSet_copy

# mutation data
oncomap_entity <- loadEntity('syn269021')
oncomap <- oncomap_entity$objects$eSet_oncomap

# gene expression
expr_entity <- loadEntity('syn269056')
expr <- expr_entity$objects$eSet_expr

# drug
drug_entity <- loadEntity('syn269024')
adf_drug <- drug_entity$objects$adf_drug
```
Preprocessing
-------------

```{r results='hide', message=FALSE}
require(predictiveModeling)
```

```{r source-insock-code, echo=FALSE, results='hide', message=FALSE}
source("~/COMPBIO/trunk/users/jang/R5/myEnetModel.R")
source("~/COMPBIO/trunk/users/jang/R5/bootstrapPredictiveModel.R")
```

Let's concatenate the molecular feature dataset(exp, cnv, and mut), filter out 'NA's and match the sample names with drug data. The we'll filter out redundant and low variance data and scale the remainder.

```{r preprocess, tidy=FALSE, eval=FALSE}
# combine all features into one matrix
featureData <- createAggregateFeatureDataSet(list(expr=expr, copy=cnv, mut=oncomap))

# filter rows with NAs
featureData_filtered <- filterNasFromMatrix(featureData, filterBy = "rows")
dataSets_ccle <- createFeatureAndResponseDataList(t(featureData_filtered),adf_drug)

# pick just the first drug
kk=1

# to make computation fast, we prefilter observation matrix with following thresholds
filteredData <- filterPredictiveModelData(dataSets_ccle$featureData,
                                          dataSets_ccle$responseData[,kk,drop=FALSE],
                                          featureVarianceThreshold = 0.01,
                                          corPValThresh = 0.1)

# In CNV, there might be redundant features thanks to sharing the same position
filteredFeatureData  <- filteredData$featureData
filteredFeatureData  <- unique(filteredFeatureData, MARGIN=2)
filteredResponseData <- filteredData$responseData

## scale these data
filteredFeatureDataScaled <- scale(filteredFeatureData)
filteredResponseDataScaled <- scale(filteredResponseData)
```

Predictive Modeling
-------------------
Here, we'll make use of [foreach and doMC][4] to spread the boostrapping runs over multiple cores.

```{r results='hide', message=FALSE}
require(foreach)
require(doMC)
registerDoMC()
```

We'll do 100 runs of elastic net (which takes a long time!). Elastic net works like this:

$latex \beta = \underset{\beta}{\operatorname{argmax}}(\left \| y-X\beta \right \|^2 + \lambda (\alpha \left \| \beta \right \|_2^2 + (1-\alpha) \left \| \beta \right \|_1))$


```{r bootstrap, eval=FALSE}
# For Elastic Net, we set two parameters: alpha and lambda
alphas <- unique(createENetTuneGrid()[,1])
lambdas <- createENetTuneGrid(alphas = 1)[,2]

# perform 100 bootstrapping runs
resultsModel <-bootstrapPredictiveModel(filteredFeatureDataScaled,
                                        filteredResponseDataScaled,
                                        model = myEnetModel$new(),
                                        numBootstrap=10,
                                        alpha=alphas, lambda = lambdas)
```

Select predictors that are consistently highly ranked over the bootstrapping runs.

```{r totally-cheating, echo=FALSE, results='hide', message=FALSE}
e <- loadEntity('syn1437787')
resultsModel <- e$objects$resultsModel
```

```{r rank-predictors}
# Rank predictors and combine average over bootstrapping runs
RankTotals<-c()
for(k in 1:length(resultsModel)){
  RankTotals<-cbind(RankTotals,rank(abs(as.numeric(resultsModel[[k]][-1])),ties.method="min")/length(resultsModel[[k]]))
}

# select top influences
rownames(RankTotals)<-rownames(resultsModel[[k]])[-1]
reference <- apply(RankTotals,1,sum)
X<-sort(reference, decreasing =T, index.return =T)
sigCoefName<-names(reference)[X$ix]
```
Let's look at the distribution of rankings among the predictors.

```{r histogram}
hist(reference, main="Ranking predictors", xlab="predictor rank averaged over 100 bootstrapping runs", col="light grey")
```

Every analysis in which gene expression plays a part must contain at least one heatmap. This is a very important rule which we don't want to break. So, here it is:

```{r heatmap}
genes <- unique(gsub(pattern="[_expr|_copy|_mut]",replacement="", x=sigCoefName[1:100]))
genes <- intersect(genes, rownames(assayData(expr)$exprs))
m <- (assayData(expr)$exprs)[genes,]
heatmap(m, col=topo.colors(100), cexRow=0.4)
```

Conclusion
----------

Knitr is cool!

  * mixes formatted text, code and graphics
  * outputs HTML or PDF.
  * produces documents programmatically, which means they can be shared, re-run and versioned.
  * works with markdown, which I'm hoping will mesh nicely with the wiki features coming to Synapse.
  * still has some kinks. I had issues with Latex equation rendering and debugging is hard.
  * long running analyses should be pre-computed and stored in Synapse.


[1]: http://www.broadinstitute.org/software/cprg/?q=node/11 "CCLE"
[2]: http://www.broadinstitute.org/gsea/ "GSEA"
[3]: https://sagebionetworks.jira.com/wiki/display/SYNR/Home
[4]: http://cran.r-project.org/web/packages/doMC/vignettes/gettingstartedMC.pdf
	Predictive Modeling: Drug Response in Cancer Cell Lines
	=======================================================

	This is a demo of Knitr, an R package for authoring executable documents, documents that mix formatted text, source code and graphical output. I've used In Sock's demos of drug response in [CCLE][1] data, but I've probably gotten most of the analysis mixed up. Apologies to In Sock.

	Analysis of cancer cell lines for drug sensitity using:
	* [Cancer Cell Line Encyclopedia][1]
	* [Gene Set Enrichment Analysis][2]

	Workflow
	--------
	The predictive modeling workflow is composed of three consecutive steps:
	1. Applying predictive models
	2. Bootstrapping feature selection
	3. Analysis with selected features

	Several predictive models (Lasso, Elastic Net, Ridge, SVM, PCR, PLS, SPLS, etc.) are trained and cross-validated. Bootstrapping selects features that are biologically relevant with respect to prior knowledge such as pathways. Gold standard predictors and gene-set enrichment analysis confirm selected features against KEGG, BIOCARTA, REACTOME and GO term enrichment.

	![predictive modeling workflow](figures/workflow.png)

	Load data from Synapse
	----------------------

	First, we load the [synapseClient][3] package and log in.

	```{r results='hide', message=FALSE, warning=FALSE}
	library(RJSONIO)
	library(synapseClient)
	```

	```{r synapseLogin}
	synapseLogin('christopherbare@gmail.com','san juan island')
	```

	Next, we load several datasets from Synapse from the Cancer Cell Line Project - (syn170143). This project holds datasets for gene expression, copy number variation, mutation and drug sensitivity.

	```{r download-data-from-synapse}
	# copy number variation
	cnv_entity <- loadEntity('syn269019')
	cnv <- cnv_entity$objects$eSet_copy

	# mutation data
	oncomap_entity <- loadEntity('syn269021')
	oncomap <- oncomap_entity$objects$eSet_oncomap

	# gene expression
	expr_entity <- loadEntity('syn269056')
	expr <- expr_entity$objects$eSet_expr

	# drug
	drug_entity <- loadEntity('syn269024')
	adf_drug <- drug_entity$objects$adf_drug
	```
	Preprocessing
	-------------

	```{r results='hide', message=FALSE}
	require(predictiveModeling)
	```

	```{r source-insock-code, echo=FALSE, results='hide', message=FALSE}
	source("~/COMPBIO/trunk/users/jang/R5/myEnetModel.R")
	source("~/COMPBIO/trunk/users/jang/R5/bootstrapPredictiveModel.R")
	```

	Let's concatenate the molecular feature dataset(exp, cnv, and mut), filter out 'NA's and match the sample names with drug data. The we'll filter out redundant and low variance data and scale the remainder.

	```{r preprocess, tidy=FALSE, eval=FALSE}
	# combine all features into one matrix
	featureData <- createAggregateFeatureDataSet(list(expr=expr, copy=cnv, mut=oncomap))

	# filter rows with NAs
	featureData_filtered <- filterNasFromMatrix(featureData, filterBy = "rows")
	dataSets_ccle <- createFeatureAndResponseDataList(t(featureData_filtered),adf_drug)

	# pick just the first drug
	kk=1

	# to make computation fast, we prefilter observation matrix with following thresholds
	filteredData <- filterPredictiveModelData(dataSets_ccle$featureData,
	dataSets_ccle$responseData[,kk,drop=FALSE],
	featureVarianceThreshold = 0.01,
	corPValThresh = 0.1)

	# In CNV, there might be redundant features thanks to sharing the same position
	filteredFeatureData <- filteredData$featureData
	filteredFeatureData <- unique(filteredFeatureData, MARGIN=2)
	filteredResponseData <- filteredData$responseData

	## scale these data
	filteredFeatureDataScaled <- scale(filteredFeatureData)
	filteredResponseDataScaled <- scale(filteredResponseData)
	```

	Predictive Modeling
	-------------------
	Here, we'll make use of [foreach and doMC][4] to spread the boostrapping runs over multiple cores.

	```{r results='hide', message=FALSE}
	require(foreach)
	require(doMC)
	registerDoMC()
	```

	We'll do 100 runs of elastic net (which takes a long time!). Elastic net works like this:

	$latex \beta = \underset{\beta}{\operatorname{argmax}}(\left \\| y-X\beta \right \\|^2 + \lambda (\alpha \left \\| \beta \right \\|_2^2 + (1-\alpha) \left \\| \beta \right \\|_1))$


	```{r bootstrap, eval=FALSE}
	# For Elastic Net, we set two parameters: alpha and lambda
	alphas <- unique(createENetTuneGrid()[,1])
	lambdas <- createENetTuneGrid(alphas = 1)[,2]

	# perform 100 bootstrapping runs
	resultsModel <-bootstrapPredictiveModel(filteredFeatureDataScaled,
	filteredResponseDataScaled,
	model = myEnetModel$new(),
	numBootstrap=10,
	alpha=alphas, lambda = lambdas)
	```

	Select predictors that are consistently highly ranked over the bootstrapping runs.

	```{r totally-cheating, echo=FALSE, results='hide', message=FALSE}
	e <- loadEntity('syn1437787')
	resultsModel <- e$objects$resultsModel
	```

	```{r rank-predictors}
	# Rank predictors and combine average over bootstrapping runs
	RankTotals<-c()
	for(k in 1:length(resultsModel)){
	RankTotals<-cbind(RankTotals,rank(abs(as.numeric(resultsModel[[k]][-1])),ties.method="min")/length(resultsModel[[k]]))
	}

	# select top influences
	rownames(RankTotals)<-rownames(resultsModel[[k]])[-1]
	reference <- apply(RankTotals,1,sum)
	X<-sort(reference, decreasing =T, index.return =T)
	sigCoefName<-names(reference)[X$ix]
	```
	Let's look at the distribution of rankings among the predictors.

	```{r histogram}
	hist(reference, main="Ranking predictors", xlab="predictor rank averaged over 100 bootstrapping runs", col="light grey")
	```

	Every analysis in which gene expression plays a part must contain at least one heatmap. This is a very important rule which we don't want to break. So, here it is:

	```{r heatmap}
	genes <- unique(gsub(pattern="[_expr\|_copy\|_mut]",replacement="", x=sigCoefName[1:100]))
	genes <- intersect(genes, rownames(assayData(expr)$exprs))
	m <- (assayData(expr)$exprs)[genes,]
	heatmap(m, col=topo.colors(100), cexRow=0.4)
	```

	Conclusion
	----------

	Knitr is cool!

	* mixes formatted text, code and graphics
	* outputs HTML or PDF.
	* produces documents programmatically, which means they can be shared, re-run and versioned.
	* works with markdown, which I'm hoping will mesh nicely with the wiki features coming to Synapse.
	* still has some kinks. I had issues with Latex equation rendering and debugging is hard.
	* long running analyses should be pre-computed and stored in Synapse.


	[1]: http://www.broadinstitute.org/software/cprg/?q=node/11 "CCLE"
	[2]: http://www.broadinstitute.org/gsea/ "GSEA"
	[3]: https://sagebionetworks.jira.com/wiki/display/SYNR/Home
	[4]: http://cran.r-project.org/web/packages/doMC/vignettes/gettingstartedMC.pdf