callahantiff/rare_disease_geo_analysis.Rmd

## rare_disease_geo_analysis.Rmd
---
title: 'MetaOmic: Transcriptomic Data Analysis'
author: 'Tiffany J. Callahan'
date: 'July 27, 2018'
output:
  html_notebook: default
  html_document: default
---

---

#### Contents:

* [Introduction](#introduction)
* [Accessing GEO Data](#data1)
* [Data Pre-Processing](#preprc)
* [Differential Expression Analysis](#dea)
* [Workflow Reproduciblity](#repo)


---

<a name="introduction"/>

# Introduction
This script is designed to download and process data from [GEO](https://www.ncbi.nlm.nih.gov/geo/). For each downloaded dataset, the raw data is processed and converted into an expression matrix (rows are genes and columns are samples). Additionally, using the [GEO Metadata Database](https://www.ncbi.nlm.nih.gov/geo/info/geo_paccess.html) specific features for each dataset are collected and kept for downstream processing. This script was developed from and adapts code originally developed as part of the R [`crossmeta`](https://github.com/callahantiff/crossmeta) library.


```{r message=FALSE, warning=FALSE}
# load scripts to process GEO genome wide gene expression data
source("~/Dropbox/GraduateSchool/PhD/LabWork/MetaOmic/transcriptomic/R/crossmeta/R/load_data.R")
```

<a name="data1"/>

### Accessing GEO Data
For this analysis, we will process data for patients with Sickle Cell Disease ([GEO35007](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35007)) and Cystic Fibrosis ([GSE71010](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE71010)).
```{r warning=FALSE, message= FALSE, eval=TRUE, cache=TRUE}
# download Sickle Cell Disease data (GEO35007)
data_dir_scd <- file.path(getwd(), "GEO_data", "SCD")
scd_gse_names <- c("GSE35007")

# download raw data (uncomment if data has already been downloaded)
# get_raw(scd_gse_names, data_dir_scd)

# reloads if previously run
scd_esets <- load_raw(scd_gse_names, data_dir_scd)
```


```{r cache=TRUE, echo=TRUE, error = FALSE, vmessage=FALSE, warning=FALSE, eval=TRUE}
# download Cystic Fibrosis data (GSE71010)
data_dir_cf <- file.path(getwd(), "GEO_data", "CF")
cf_gse_names <- c("GSE71010")

# download raw data (uncomment if data has already been downloaded)
# get_raw(cf_gse_names, data_dir_cf)

# reloads if previously called
cf_esets <- load_raw(cf_gse_names, data_dir_cf)
```


<a name="preprc"/>

## Data Pre-Processing
Pre-processing of downloaded GEO data includes several important steps. The specific steps that are applied to each dataset include: removing missing data, filtering probes with expression values below the first quantile, and defining groups for differential expression analysis. In some cases, additional cleaning or organization of the data is needed prior to downstream analysis.
```{r warning=FALSE, eval=TRUE, cache=TRUE}
## Sickle Cell Disease
# view dataset information
scd_exprs_data_clean <- Biobase::exprs(scd_esets[[1]])
scd_genes <- Biobase::fData(scd_esets[[1]])

# write data to disk
write.table(scd_exprs_data_clean, 'GEO_data/SCD/GSE35007/GSE35007_exprs_data.txt',
            row.names = FALSE,
            col.names = TRUE,
            quote = FALSE,
            sep="\t")

# remove missing data
scd_exprs_data_clean <- na.omit(scd_exprs_data_clean)

## Filtering
# summarize average gene expression values
scd_avg <- rowMeans(scd_exprs_data_clean)
scd_sum <- summary(scd_avg)

# filter out genes that "aren't expressed" - will appear red in the plot
scd_cutoff <- scd_sum[2][[1]] # remove genes below first quantile
scd_exprs <- EMA::expFilter(scd_exprs_data_clean, threshold = round(scd_sum[2][[1]], 4))

# filter gene list to match reducecd expression
scd_genes <- scd_genes[scd_genes$PROBE %in% row.names(scd_exprs), ]

# write data to disk
write.table(scd_exprs, 'GEO_data/SCD/GSE35007/GSE35007_exprs_data_noNA_filtered.txt',
            row.names = TRUE,
            col.names = TRUE,
            quote = FALSE,
            sep="\t")

## Prep for DE analysis
scd_p <- Biobase::phenoData(scd_esets[[1]])@data

# fix error in data where data got erroneously shifted
which(scd_p$geo_accession == 'GSM860268')
which(colnames(scd_p) == 'characteristics_ch1')
scd_p[62,11:19] <- scd_p[62,10:18]
scd_p[62,10] <- NA

# DEG comparison group definition
scd_phen_dat = scd_p$`hb phenotype:ch1`

# combine A, AS, and AA groups to be 1 control group
scd_grp <- ifelse(scd_p$`hb phenotype:ch1` == 'A' | scd_p$`hb phenotype:ch1` == 'AA' | scd_p$`hb phenotype:ch1` == 'AS',
                'Control', scd_phen_dat)

# append labels to data frame + sort
scd_p <- cbind(scd_grp, scd_p)

# remove group set to 'NA'
scd_p <- na.omit(scd_p)

# filter gene list to match reducecd expression
scd_exprs <- scd_exprs[,colnames(scd_exprs) %in% scd_p$geo_accession]

# bind columns of gene symbols
scd_exprs <- cbind(scd_genes, scd_exprs)

#set group for DEG analysis
write.table(scd_p, 'GEO_data/SCD/GSE35007/GSE35007_phenodata.txt',
            row.names = FALSE,
            col.names = TRUE,
            quote = FALSE,
            sep="\t")

# print results
scd_groups <- scd_p$scd_grp
table(scd_groups)
```

```{r cache=TRUE, echo=TRUE, error = FALSE, vmessage=FALSE, warning=FALSE, eval=TRUE}
## Cystic Fibrosis
# view dataset information
cf_exprs_data_clean <- Biobase::exprs(cf_esets[[1]])
cf_genes <- Biobase::fData(cf_esets[[1]])

# write data to disk
write.table(cf_exprs_data_clean, 'GEO_data/CF/GSE71010/GSE71010_exprs_data.txt',
            row.names = FALSE,
            col.names = TRUE,
            quote = FALSE,
            sep="\t")

# remove missing data
cf_exprs_data_clean <- na.omit(cf_exprs_data_clean)

## Filtering
# summarize average gene expression values
cf_avg <- rowMeans(cf_exprs_data_clean)
cf_sum <- summary(cf_avg)

# filter out genes that "aren't expressed" - will appear red in the plot
cf_cutoff <- cf_sum[2][[1]] # remove genes below first quantile
cf_exprs <- EMA::expFilter(cf_exprs_data_clean, threshold = round(cf_sum[2][[1]], 4))

# filter gene list to match reducecd expression
cf_genes <- cf_genes[cf_genes$PROBE %in% row.names(cf_exprs), ]

# write data to disk
write.table(cf_exprs, 'GEO_data/CF/GSE71010/GSE71010_exprs_data_noNA_filtered.txt',
            row.names = TRUE,
            col.names = TRUE,
            quote = FALSE,
            sep="\t")

## Prep for DE analysis
# DEG comparison group definition
cf_p <- Biobase::phenoData(cf_esets[[1]])@data
cf_grp <- ifelse(cf_p$characteristics_ch1.1 == 'disease: Healthy control', 'Control', cf_p$characteristics_ch1.1)
cf_grp <- ifelse(cf_grp == "1", 'CF', cf_grp)
cf_grp <- ifelse(cf_grp == "3", NA, cf_grp)

# append labels to data frame + sort
cf_p <- cbind(cf_grp, cf_p)

# remove group set to 'NA'
cf_p <- na.omit(cf_p)
cf_exprs <- cf_exprs[, colnames(cf_exprs) %in% cf_p$geo_accession]

# bind columns of gene symbols
cf_exprs <- cbind(cf_genes, cf_exprs)
cf_exprs <- na.omit(cf_exprs)

# write data to disk
write.table(cf_p, 'GEO_data/CF/GSE71010/GSE71010_phenodata.txt',
            row.names = FALSE,
            col.names = TRUE,
            quote = FALSE,
            sep="\t")

# print results
cf_groups <- cf_p$cf_grp
table(cf_groups)
```


<a name="dea"/>

## Differential Expression Analysis
Differential expression analysis is performed on the pre-processed microarray data using limma ([microarray](http://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf)).
```{r cache=TRUE, echo=TRUE, error = FALSE, vmessage=FALSE, warning=FALSE, eval=TRUE}
## Sickle Cell Disease
# create design matrix
scd_design <- stats::model.matrix(~0+ scd_groups, data.frame(scd_exprs[,6:length(scd_exprs)]))
colnames(scd_design) <- gsub('groups', '', colnames(scd_design))

# create constrast (the one we subtract is control, then we interpret results for cases)
scd_contr.matrix <- limma::makeContrasts(
  SCvsControl = scd_SC-scd_Control,
  SSvsControl = scd_SS-scd_Control,
  SSvsSC = scd_SS-scd_SC,
  levels = colnames(scd_design))
scd_contr.matrix

# fit model
scd_fit <- limma::lmFit(scd_exprs[,6:length(scd_exprs)], scd_design)
scd_fit <- limma::contrasts.fit(scd_fit, contrasts=scd_contr.matrix)
scd_efit <- limma::eBayes(scd_fit)

# get results adjusting for multiple testing using FDR
scd_results = limma::topTable(scd_efit, coef=1, adjust='fdr', number=Inf)
knitr::kable(head(scd_results, n = 10))

# write out file
write.table(scd_results, 'GEO_data/SCD/GSE35007/GSE35007_results_data.txt',
            row.names = TRUE,
            col.names = TRUE,
            quote = FALSE,
            sep="\t")

# add gene information to efit
scd_efit$genes <- scd_genes

# print number of up and down regulated genes between the samples
scd_top_dec_test <- limma::decideTests(scd_efit)
knitr::kable(summary(scd_top_dec_test))

## Process Results
limma::write.fit(scd_efit, scd_top_dec_test, adjust='fdr', file='GEO_data/SCD/GSE35007/GSE35007_DEgenes.txt')

# re-read data to do some QC
scd_res = read.table('GEO_data/SCD/GSE35007/GSE35007_DEgenes.txt', header=TRUE)
scd_res <- na.omit(scd_res)

# remove data with 0
scd_res$ss_grp <- ifelse(scd_res$Res.SSvsControl == -1, "HbSS_Down", "HbSS_Up")
scd_res$sc_grp <- ifelse(scd_res$Res.SCvsControl == -1, "HbSC_Down", "HbSC_Up")
scd_res$ss_sc <- ifelse(scd_res$Res.SSvsSC == -1, "HbSS_Down", "HbSS_Up")

# filter genes based on p-value
scd_results <- cbind(row.names(scd_res), scd_res)
names(scd_results)[1] <- 'PROBE'
scd_results_filtered <- scd_results
# scd_results_filtered <- scd_results[scd_results$p.value.adj.SSvsControl<0.001,]

# write out file
write.table(scd_results_filtered, 'GEO_data/SCD/GSE35007/GSE35007_DEgenes.txt',
            row.names = TRUE,
            col.names = TRUE,
            quote = FALSE,
            sep="\t")

# adjust exprs set
scd_exprs <- scd_exprs[scd_exprs$PROBE %in% scd_results_filtered$PROBE, ]

# write out file
write.table(scd_exprs, 'GEO_data/SCD/GSE35007/GSE35007_exprs_data_noNA_filtered_sig.txt',
            row.names = TRUE,
            col.names = TRUE,
            quote = FALSE,
            sep="\t")
```

```{r cache=TRUE, echo=TRUE, error = FALSE, vmessage=FALSE, warning=FALSE, eval=TRUE}
## Cystic Fibrosis
# create design matrix
cf_design <- stats::model.matrix(~0+ cf_groups, data.frame(cf_exprs[,6:length(cf_exprs)]))
colnames(cf_design) <- gsub('groups', '', colnames(cf_design))

# create constrast (the one we subtract is control, then we interpret results for cases)
cf_contr.matrix <- limma::makeContrasts(
   CFvsControl = cf_CF-cf_Control,
   levels = colnames(cf_design))
cf_contr.matrix

# fit model
cf_fit <- limma::lmFit(cf_exprs[,6:length(cf_exprs)], cf_design)
cf_fit <- limma::contrasts.fit(cf_fit, contrasts=cf_contr.matrix)
cf_efit <- limma::eBayes(cf_fit)

# get results adjusting for multiple testing using FDR
cf_results = limma::topTable(cf_efit, coef=1, adjust='fdr', number=Inf)
knitr::kable(head(cf_results, n = 10))

# write to disk
write.table(cf_results, 'GEO_data/CF/GSE71010/GSE71010_results_data.txt',
            row.names = TRUE,
            col.names = TRUE,
            quote = FALSE,
            sep="\t")

# add gene information to efit
cf_efit$genes <- cf_genes[cf_genes$PROBE %in% row.names(cf_results), ]

# print number of up and down regulated genes between the samples
cf_top_dec_test <- limma::decideTests(cf_efit)
knitr::kable(summary(cf_top_dec_test))

## Process Results
limma::write.fit(cf_efit, cf_top_dec_test, adjust='fdr', file='GEO_data/CF/GSE71010/GSE71010_DEgenes.txt', row.names=TRUE)

# re-read data in to cut out na
cf_res = read.table('GEO_data/CF/GSE71010/GSE71010_DEgenes.txt', header=TRUE)
cf_res <- na.omit(cf_res)

# rename columns
names(cf_res)[2] <- 'Coef.CFvsControl'
names(cf_res)[3] <- 't.CFvsControl'
names(cf_res)[4] <- 'p.value.CFvsControl'
names(cf_res)[5] <- 'p.value.adj.CFvsControl'
names(cf_res)[8] <- 'Res.CFvsControl'

# cut out rows that are not significant
# create grouping variable
cf_res$grp <- ifelse(cf_res$Res.CFvsControl == -1, "CF_Down", "CF_Up")

# filter genes based on p-value
cf_results <- cbind(row.names(cf_res), cf_res)
names(cf_results)[1] <- 'PROBE'
cf_results_filtered <- cf_results
# cf_results_filtered <- cf_results[cf_results$p.value.adj.CFvsControl<0.001,]

# write to disk
write.table(cf_results_filtered, 'GEO_data/CF/GSE71010/GSE71010_DEgenes.txt',
            row.names = TRUE,
            col.names = TRUE,
            quote = FALSE,
            sep="\t")

# adjust exprs set
cf_exprs <- cf_exprs[cf_exprs$PROBE %in% cf_results_filtered$PROBE, ]

# write out file
write.table(cf_exprs, 'GEO_data/CF/GSE71010/GSE71010_exprs_data_noNA_filtered_sig.txt',
            row.names = TRUE,
            col.names = TRUE,
            quote = FALSE,
            sep="\t")
```


<a name="repo"/>

#### Session Information
```{r eval=TRUE}
sessionInfo()
```
	---
	title: 'MetaOmic: Transcriptomic Data Analysis'
	author: 'Tiffany J. Callahan'
	date: 'July 27, 2018'
	output:
	html_notebook: default
	html_document: default
	---

	---

	#### Contents:

	* [Introduction](#introduction)
	* [Accessing GEO Data](#data1)
	* [Data Pre-Processing](#preprc)
	* [Differential Expression Analysis](#dea)
	* [Workflow Reproduciblity](#repo)



	---

	<a name="introduction"/>

	# Introduction
	This script is designed to download and process data from [GEO](https://www.ncbi.nlm.nih.gov/geo/). For each downloaded dataset, the raw data is processed and converted into an expression matrix (rows are genes and columns are samples). Additionally, using the [GEO Metadata Database](https://www.ncbi.nlm.nih.gov/geo/info/geo_paccess.html) specific features for each dataset are collected and kept for downstream processing. This script was developed from and adapts code originally developed as part of the R [`crossmeta`](https://github.com/callahantiff/crossmeta) library.


	```{r message=FALSE, warning=FALSE}
	# load scripts to process GEO genome wide gene expression data
	source("~/Dropbox/GraduateSchool/PhD/LabWork/MetaOmic/transcriptomic/R/crossmeta/R/load_data.R")
	```

	<a name="data1"/>

	### Accessing GEO Data
	For this analysis, we will process data for patients with Sickle Cell Disease ([GEO35007](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35007)) and Cystic Fibrosis ([GSE71010](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE71010)).
	```{r warning=FALSE, message= FALSE, eval=TRUE, cache=TRUE}
	# download Sickle Cell Disease data (GEO35007)
	data_dir_scd <- file.path(getwd(), "GEO_data", "SCD")
	scd_gse_names <- c("GSE35007")

	# download raw data (uncomment if data has already been downloaded)
	# get_raw(scd_gse_names, data_dir_scd)

	# reloads if previously run
	scd_esets <- load_raw(scd_gse_names, data_dir_scd)
	```


	```{r cache=TRUE, echo=TRUE, error = FALSE, vmessage=FALSE, warning=FALSE, eval=TRUE}
	# download Cystic Fibrosis data (GSE71010)
	data_dir_cf <- file.path(getwd(), "GEO_data", "CF")
	cf_gse_names <- c("GSE71010")

	# download raw data (uncomment if data has already been downloaded)
	# get_raw(cf_gse_names, data_dir_cf)

	# reloads if previously called
	cf_esets <- load_raw(cf_gse_names, data_dir_cf)
	```


	<a name="preprc"/>

	## Data Pre-Processing
	Pre-processing of downloaded GEO data includes several important steps. The specific steps that are applied to each dataset include: removing missing data, filtering probes with expression values below the first quantile, and defining groups for differential expression analysis. In some cases, additional cleaning or organization of the data is needed prior to downstream analysis.
	```{r warning=FALSE, eval=TRUE, cache=TRUE}
	## Sickle Cell Disease
	# view dataset information
	scd_exprs_data_clean <- Biobase::exprs(scd_esets[[1]])
	scd_genes <- Biobase::fData(scd_esets[[1]])

	# write data to disk
	write.table(scd_exprs_data_clean, 'GEO_data/SCD/GSE35007/GSE35007_exprs_data.txt',
	row.names = FALSE,
	col.names = TRUE,
	quote = FALSE,
	sep="\t")

	# remove missing data
	scd_exprs_data_clean <- na.omit(scd_exprs_data_clean)

	## Filtering
	# summarize average gene expression values
	scd_avg <- rowMeans(scd_exprs_data_clean)
	scd_sum <- summary(scd_avg)

	# filter out genes that "aren't expressed" - will appear red in the plot
	scd_cutoff <- scd_sum[2][[1]] # remove genes below first quantile
	scd_exprs <- EMA::expFilter(scd_exprs_data_clean, threshold = round(scd_sum[2][[1]], 4))

	# filter gene list to match reducecd expression
	scd_genes <- scd_genes[scd_genes$PROBE %in% row.names(scd_exprs), ]

	# write data to disk
	write.table(scd_exprs, 'GEO_data/SCD/GSE35007/GSE35007_exprs_data_noNA_filtered.txt',
	row.names = TRUE,
	col.names = TRUE,
	quote = FALSE,
	sep="\t")

	## Prep for DE analysis
	scd_p <- Biobase::phenoData(scd_esets[[1]])@data

	# fix error in data where data got erroneously shifted
	which(scd_p$geo_accession == 'GSM860268')
	which(colnames(scd_p) == 'characteristics_ch1')
	scd_p[62,11:19] <- scd_p[62,10:18]
	scd_p[62,10] <- NA

	# DEG comparison group definition
	scd_phen_dat = scd_p$`hb phenotype:ch1`

	# combine A, AS, and AA groups to be 1 control group
	scd_grp <- ifelse(scd_p$`hb phenotype:ch1` == 'A' \| scd_p$`hb phenotype:ch1` == 'AA' \| scd_p$`hb phenotype:ch1` == 'AS',
	'Control', scd_phen_dat)

	# append labels to data frame + sort
	scd_p <- cbind(scd_grp, scd_p)

	# remove group set to 'NA'
	scd_p <- na.omit(scd_p)

	# filter gene list to match reducecd expression
	scd_exprs <- scd_exprs[,colnames(scd_exprs) %in% scd_p$geo_accession]

	# bind columns of gene symbols
	scd_exprs <- cbind(scd_genes, scd_exprs)

	#set group for DEG analysis
	write.table(scd_p, 'GEO_data/SCD/GSE35007/GSE35007_phenodata.txt',
	row.names = FALSE,
	col.names = TRUE,
	quote = FALSE,
	sep="\t")

	# print results
	scd_groups <- scd_p$scd_grp
	table(scd_groups)
	```

	```{r cache=TRUE, echo=TRUE, error = FALSE, vmessage=FALSE, warning=FALSE, eval=TRUE}
	## Cystic Fibrosis
	# view dataset information
	cf_exprs_data_clean <- Biobase::exprs(cf_esets[[1]])
	cf_genes <- Biobase::fData(cf_esets[[1]])

	# write data to disk
	write.table(cf_exprs_data_clean, 'GEO_data/CF/GSE71010/GSE71010_exprs_data.txt',
	row.names = FALSE,
	col.names = TRUE,
	quote = FALSE,
	sep="\t")

	# remove missing data
	cf_exprs_data_clean <- na.omit(cf_exprs_data_clean)

	## Filtering
	# summarize average gene expression values
	cf_avg <- rowMeans(cf_exprs_data_clean)
	cf_sum <- summary(cf_avg)

	# filter out genes that "aren't expressed" - will appear red in the plot
	cf_cutoff <- cf_sum[2][[1]] # remove genes below first quantile
	cf_exprs <- EMA::expFilter(cf_exprs_data_clean, threshold = round(cf_sum[2][[1]], 4))

	# filter gene list to match reducecd expression
	cf_genes <- cf_genes[cf_genes$PROBE %in% row.names(cf_exprs), ]

	# write data to disk
	write.table(cf_exprs, 'GEO_data/CF/GSE71010/GSE71010_exprs_data_noNA_filtered.txt',
	row.names = TRUE,
	col.names = TRUE,
	quote = FALSE,
	sep="\t")

	## Prep for DE analysis
	# DEG comparison group definition
	cf_p <- Biobase::phenoData(cf_esets[[1]])@data
	cf_grp <- ifelse(cf_p$characteristics_ch1.1 == 'disease: Healthy control', 'Control', cf_p$characteristics_ch1.1)
	cf_grp <- ifelse(cf_grp == "1", 'CF', cf_grp)
	cf_grp <- ifelse(cf_grp == "3", NA, cf_grp)

	# append labels to data frame + sort
	cf_p <- cbind(cf_grp, cf_p)

	# remove group set to 'NA'
	cf_p <- na.omit(cf_p)
	cf_exprs <- cf_exprs[, colnames(cf_exprs) %in% cf_p$geo_accession]

	# bind columns of gene symbols
	cf_exprs <- cbind(cf_genes, cf_exprs)
	cf_exprs <- na.omit(cf_exprs)

	# write data to disk
	write.table(cf_p, 'GEO_data/CF/GSE71010/GSE71010_phenodata.txt',
	row.names = FALSE,
	col.names = TRUE,
	quote = FALSE,
	sep="\t")

	# print results
	cf_groups <- cf_p$cf_grp
	table(cf_groups)
	```


	<a name="dea"/>

	## Differential Expression Analysis
	Differential expression analysis is performed on the pre-processed microarray data using limma ([microarray](http://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf)).
	```{r cache=TRUE, echo=TRUE, error = FALSE, vmessage=FALSE, warning=FALSE, eval=TRUE}
	## Sickle Cell Disease
	# create design matrix
	scd_design <- stats::model.matrix(~0+ scd_groups, data.frame(scd_exprs[,6:length(scd_exprs)]))
	colnames(scd_design) <- gsub('groups', '', colnames(scd_design))

	# create constrast (the one we subtract is control, then we interpret results for cases)
	scd_contr.matrix <- limma::makeContrasts(
	SCvsControl = scd_SC-scd_Control,
	SSvsControl = scd_SS-scd_Control,
	SSvsSC = scd_SS-scd_SC,
	levels = colnames(scd_design))
	scd_contr.matrix

	# fit model
	scd_fit <- limma::lmFit(scd_exprs[,6:length(scd_exprs)], scd_design)
	scd_fit <- limma::contrasts.fit(scd_fit, contrasts=scd_contr.matrix)
	scd_efit <- limma::eBayes(scd_fit)

	# get results adjusting for multiple testing using FDR
	scd_results = limma::topTable(scd_efit, coef=1, adjust='fdr', number=Inf)
	knitr::kable(head(scd_results, n = 10))

	# write out file
	write.table(scd_results, 'GEO_data/SCD/GSE35007/GSE35007_results_data.txt',
	row.names = TRUE,
	col.names = TRUE,
	quote = FALSE,
	sep="\t")

	# add gene information to efit
	scd_efit$genes <- scd_genes

	# print number of up and down regulated genes between the samples
	scd_top_dec_test <- limma::decideTests(scd_efit)
	knitr::kable(summary(scd_top_dec_test))

	## Process Results
	limma::write.fit(scd_efit, scd_top_dec_test, adjust='fdr', file='GEO_data/SCD/GSE35007/GSE35007_DEgenes.txt')

	# re-read data to do some QC
	scd_res = read.table('GEO_data/SCD/GSE35007/GSE35007_DEgenes.txt', header=TRUE)
	scd_res <- na.omit(scd_res)

	# remove data with 0
	scd_res$ss_grp <- ifelse(scd_res$Res.SSvsControl == -1, "HbSS_Down", "HbSS_Up")
	scd_res$sc_grp <- ifelse(scd_res$Res.SCvsControl == -1, "HbSC_Down", "HbSC_Up")
	scd_res$ss_sc <- ifelse(scd_res$Res.SSvsSC == -1, "HbSS_Down", "HbSS_Up")

	# filter genes based on p-value
	scd_results <- cbind(row.names(scd_res), scd_res)
	names(scd_results)[1] <- 'PROBE'
	scd_results_filtered <- scd_results
	# scd_results_filtered <- scd_results[scd_results$p.value.adj.SSvsControl<0.001,]

	# write out file
	write.table(scd_results_filtered, 'GEO_data/SCD/GSE35007/GSE35007_DEgenes.txt',
	row.names = TRUE,
	col.names = TRUE,
	quote = FALSE,
	sep="\t")

	# adjust exprs set
	scd_exprs <- scd_exprs[scd_exprs$PROBE %in% scd_results_filtered$PROBE, ]

	# write out file
	write.table(scd_exprs, 'GEO_data/SCD/GSE35007/GSE35007_exprs_data_noNA_filtered_sig.txt',
	row.names = TRUE,
	col.names = TRUE,
	quote = FALSE,
	sep="\t")
	```

	```{r cache=TRUE, echo=TRUE, error = FALSE, vmessage=FALSE, warning=FALSE, eval=TRUE}
	## Cystic Fibrosis
	# create design matrix
	cf_design <- stats::model.matrix(~0+ cf_groups, data.frame(cf_exprs[,6:length(cf_exprs)]))
	colnames(cf_design) <- gsub('groups', '', colnames(cf_design))

	# create constrast (the one we subtract is control, then we interpret results for cases)
	cf_contr.matrix <- limma::makeContrasts(
	CFvsControl = cf_CF-cf_Control,
	levels = colnames(cf_design))
	cf_contr.matrix

	# fit model
	cf_fit <- limma::lmFit(cf_exprs[,6:length(cf_exprs)], cf_design)
	cf_fit <- limma::contrasts.fit(cf_fit, contrasts=cf_contr.matrix)
	cf_efit <- limma::eBayes(cf_fit)

	# get results adjusting for multiple testing using FDR
	cf_results = limma::topTable(cf_efit, coef=1, adjust='fdr', number=Inf)
	knitr::kable(head(cf_results, n = 10))

	# write to disk
	write.table(cf_results, 'GEO_data/CF/GSE71010/GSE71010_results_data.txt',
	row.names = TRUE,
	col.names = TRUE,
	quote = FALSE,
	sep="\t")

	# add gene information to efit
	cf_efit$genes <- cf_genes[cf_genes$PROBE %in% row.names(cf_results), ]

	# print number of up and down regulated genes between the samples
	cf_top_dec_test <- limma::decideTests(cf_efit)
	knitr::kable(summary(cf_top_dec_test))

	## Process Results
	limma::write.fit(cf_efit, cf_top_dec_test, adjust='fdr', file='GEO_data/CF/GSE71010/GSE71010_DEgenes.txt', row.names=TRUE)

	# re-read data in to cut out na
	cf_res = read.table('GEO_data/CF/GSE71010/GSE71010_DEgenes.txt', header=TRUE)
	cf_res <- na.omit(cf_res)

	# rename columns
	names(cf_res)[2] <- 'Coef.CFvsControl'
	names(cf_res)[3] <- 't.CFvsControl'
	names(cf_res)[4] <- 'p.value.CFvsControl'
	names(cf_res)[5] <- 'p.value.adj.CFvsControl'
	names(cf_res)[8] <- 'Res.CFvsControl'

	# cut out rows that are not significant
	# create grouping variable
	cf_res$grp <- ifelse(cf_res$Res.CFvsControl == -1, "CF_Down", "CF_Up")

	# filter genes based on p-value
	cf_results <- cbind(row.names(cf_res), cf_res)
	names(cf_results)[1] <- 'PROBE'
	cf_results_filtered <- cf_results
	# cf_results_filtered <- cf_results[cf_results$p.value.adj.CFvsControl<0.001,]

	# write to disk
	write.table(cf_results_filtered, 'GEO_data/CF/GSE71010/GSE71010_DEgenes.txt',
	row.names = TRUE,
	col.names = TRUE,
	quote = FALSE,
	sep="\t")

	# adjust exprs set
	cf_exprs <- cf_exprs[cf_exprs$PROBE %in% cf_results_filtered$PROBE, ]

	# write out file
	write.table(cf_exprs, 'GEO_data/CF/GSE71010/GSE71010_exprs_data_noNA_filtered_sig.txt',
	row.names = TRUE,
	col.names = TRUE,
	quote = FALSE,
	sep="\t")
	```


	<a name="repo"/>

	#### Session Information
	```{r eval=TRUE}
	sessionInfo()
	```