Last active
July 28, 2018 07:03
-
-
Save callahantiff/04b4d2a653e806b974be4efed41e484a to your computer and use it in GitHub Desktop.
Rare Disease GEO Analysis - R Notebook
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: 'MetaOmic: Transcriptomic Data Analysis' | |
author: 'Tiffany J. Callahan' | |
date: 'July 27, 2018' | |
output: | |
html_notebook: default | |
html_document: default | |
--- | |
--- | |
#### Contents: | |
* [Introduction](#introduction) | |
* [Accessing GEO Data](#data1) | |
* [Data Pre-Processing](#preprc) | |
* [Differential Expression Analysis](#dea) | |
* [Workflow Reproduciblity](#repo) | |
--- | |
<a name="introduction"/> | |
# Introduction | |
This script is designed to download and process data from [GEO](https://www.ncbi.nlm.nih.gov/geo/). For each downloaded dataset, the raw data is processed and converted into an expression matrix (rows are genes and columns are samples). Additionally, using the [GEO Metadata Database](https://www.ncbi.nlm.nih.gov/geo/info/geo_paccess.html) specific features for each dataset are collected and kept for downstream processing. This script was developed from and adapts code originally developed as part of the R [`crossmeta`](https://github.com/callahantiff/crossmeta) library. | |
```{r message=FALSE, warning=FALSE} | |
# load scripts to process GEO genome wide gene expression data | |
source("~/Dropbox/GraduateSchool/PhD/LabWork/MetaOmic/transcriptomic/R/crossmeta/R/load_data.R") | |
``` | |
<a name="data1"/> | |
### Accessing GEO Data | |
For this analysis, we will process data for patients with Sickle Cell Disease ([GEO35007](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35007)) and Cystic Fibrosis ([GSE71010](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE71010)). | |
```{r warning=FALSE, message= FALSE, eval=TRUE, cache=TRUE} | |
# download Sickle Cell Disease data (GEO35007) | |
data_dir_scd <- file.path(getwd(), "GEO_data", "SCD") | |
scd_gse_names <- c("GSE35007") | |
# download raw data (uncomment if data has already been downloaded) | |
# get_raw(scd_gse_names, data_dir_scd) | |
# reloads if previously run | |
scd_esets <- load_raw(scd_gse_names, data_dir_scd) | |
``` | |
```{r cache=TRUE, echo=TRUE, error = FALSE, vmessage=FALSE, warning=FALSE, eval=TRUE} | |
# download Cystic Fibrosis data (GSE71010) | |
data_dir_cf <- file.path(getwd(), "GEO_data", "CF") | |
cf_gse_names <- c("GSE71010") | |
# download raw data (uncomment if data has already been downloaded) | |
# get_raw(cf_gse_names, data_dir_cf) | |
# reloads if previously called | |
cf_esets <- load_raw(cf_gse_names, data_dir_cf) | |
``` | |
<a name="preprc"/> | |
## Data Pre-Processing | |
Pre-processing of downloaded GEO data includes several important steps. The specific steps that are applied to each dataset include: removing missing data, filtering probes with expression values below the first quantile, and defining groups for differential expression analysis. In some cases, additional cleaning or organization of the data is needed prior to downstream analysis. | |
```{r warning=FALSE, eval=TRUE, cache=TRUE} | |
## Sickle Cell Disease | |
# view dataset information | |
scd_exprs_data_clean <- Biobase::exprs(scd_esets[[1]]) | |
scd_genes <- Biobase::fData(scd_esets[[1]]) | |
# write data to disk | |
write.table(scd_exprs_data_clean, 'GEO_data/SCD/GSE35007/GSE35007_exprs_data.txt', | |
row.names = FALSE, | |
col.names = TRUE, | |
quote = FALSE, | |
sep="\t") | |
# remove missing data | |
scd_exprs_data_clean <- na.omit(scd_exprs_data_clean) | |
## Filtering | |
# summarize average gene expression values | |
scd_avg <- rowMeans(scd_exprs_data_clean) | |
scd_sum <- summary(scd_avg) | |
# filter out genes that "aren't expressed" - will appear red in the plot | |
scd_cutoff <- scd_sum[2][[1]] # remove genes below first quantile | |
scd_exprs <- EMA::expFilter(scd_exprs_data_clean, threshold = round(scd_sum[2][[1]], 4)) | |
# filter gene list to match reducecd expression | |
scd_genes <- scd_genes[scd_genes$PROBE %in% row.names(scd_exprs), ] | |
# write data to disk | |
write.table(scd_exprs, 'GEO_data/SCD/GSE35007/GSE35007_exprs_data_noNA_filtered.txt', | |
row.names = TRUE, | |
col.names = TRUE, | |
quote = FALSE, | |
sep="\t") | |
## Prep for DE analysis | |
scd_p <- Biobase::phenoData(scd_esets[[1]])@data | |
# fix error in data where data got erroneously shifted | |
which(scd_p$geo_accession == 'GSM860268') | |
which(colnames(scd_p) == 'characteristics_ch1') | |
scd_p[62,11:19] <- scd_p[62,10:18] | |
scd_p[62,10] <- NA | |
# DEG comparison group definition | |
scd_phen_dat = scd_p$`hb phenotype:ch1` | |
# combine A, AS, and AA groups to be 1 control group | |
scd_grp <- ifelse(scd_p$`hb phenotype:ch1` == 'A' | scd_p$`hb phenotype:ch1` == 'AA' | scd_p$`hb phenotype:ch1` == 'AS', | |
'Control', scd_phen_dat) | |
# append labels to data frame + sort | |
scd_p <- cbind(scd_grp, scd_p) | |
# remove group set to 'NA' | |
scd_p <- na.omit(scd_p) | |
# filter gene list to match reducecd expression | |
scd_exprs <- scd_exprs[,colnames(scd_exprs) %in% scd_p$geo_accession] | |
# bind columns of gene symbols | |
scd_exprs <- cbind(scd_genes, scd_exprs) | |
#set group for DEG analysis | |
write.table(scd_p, 'GEO_data/SCD/GSE35007/GSE35007_phenodata.txt', | |
row.names = FALSE, | |
col.names = TRUE, | |
quote = FALSE, | |
sep="\t") | |
# print results | |
scd_groups <- scd_p$scd_grp | |
table(scd_groups) | |
``` | |
```{r cache=TRUE, echo=TRUE, error = FALSE, vmessage=FALSE, warning=FALSE, eval=TRUE} | |
## Cystic Fibrosis | |
# view dataset information | |
cf_exprs_data_clean <- Biobase::exprs(cf_esets[[1]]) | |
cf_genes <- Biobase::fData(cf_esets[[1]]) | |
# write data to disk | |
write.table(cf_exprs_data_clean, 'GEO_data/CF/GSE71010/GSE71010_exprs_data.txt', | |
row.names = FALSE, | |
col.names = TRUE, | |
quote = FALSE, | |
sep="\t") | |
# remove missing data | |
cf_exprs_data_clean <- na.omit(cf_exprs_data_clean) | |
## Filtering | |
# summarize average gene expression values | |
cf_avg <- rowMeans(cf_exprs_data_clean) | |
cf_sum <- summary(cf_avg) | |
# filter out genes that "aren't expressed" - will appear red in the plot | |
cf_cutoff <- cf_sum[2][[1]] # remove genes below first quantile | |
cf_exprs <- EMA::expFilter(cf_exprs_data_clean, threshold = round(cf_sum[2][[1]], 4)) | |
# filter gene list to match reducecd expression | |
cf_genes <- cf_genes[cf_genes$PROBE %in% row.names(cf_exprs), ] | |
# write data to disk | |
write.table(cf_exprs, 'GEO_data/CF/GSE71010/GSE71010_exprs_data_noNA_filtered.txt', | |
row.names = TRUE, | |
col.names = TRUE, | |
quote = FALSE, | |
sep="\t") | |
## Prep for DE analysis | |
# DEG comparison group definition | |
cf_p <- Biobase::phenoData(cf_esets[[1]])@data | |
cf_grp <- ifelse(cf_p$characteristics_ch1.1 == 'disease: Healthy control', 'Control', cf_p$characteristics_ch1.1) | |
cf_grp <- ifelse(cf_grp == "1", 'CF', cf_grp) | |
cf_grp <- ifelse(cf_grp == "3", NA, cf_grp) | |
# append labels to data frame + sort | |
cf_p <- cbind(cf_grp, cf_p) | |
# remove group set to 'NA' | |
cf_p <- na.omit(cf_p) | |
cf_exprs <- cf_exprs[, colnames(cf_exprs) %in% cf_p$geo_accession] | |
# bind columns of gene symbols | |
cf_exprs <- cbind(cf_genes, cf_exprs) | |
cf_exprs <- na.omit(cf_exprs) | |
# write data to disk | |
write.table(cf_p, 'GEO_data/CF/GSE71010/GSE71010_phenodata.txt', | |
row.names = FALSE, | |
col.names = TRUE, | |
quote = FALSE, | |
sep="\t") | |
# print results | |
cf_groups <- cf_p$cf_grp | |
table(cf_groups) | |
``` | |
<a name="dea"/> | |
## Differential Expression Analysis | |
Differential expression analysis is performed on the pre-processed microarray data using limma ([microarray](http://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf)). | |
```{r cache=TRUE, echo=TRUE, error = FALSE, vmessage=FALSE, warning=FALSE, eval=TRUE} | |
## Sickle Cell Disease | |
# create design matrix | |
scd_design <- stats::model.matrix(~0+ scd_groups, data.frame(scd_exprs[,6:length(scd_exprs)])) | |
colnames(scd_design) <- gsub('groups', '', colnames(scd_design)) | |
# create constrast (the one we subtract is control, then we interpret results for cases) | |
scd_contr.matrix <- limma::makeContrasts( | |
SCvsControl = scd_SC-scd_Control, | |
SSvsControl = scd_SS-scd_Control, | |
SSvsSC = scd_SS-scd_SC, | |
levels = colnames(scd_design)) | |
scd_contr.matrix | |
# fit model | |
scd_fit <- limma::lmFit(scd_exprs[,6:length(scd_exprs)], scd_design) | |
scd_fit <- limma::contrasts.fit(scd_fit, contrasts=scd_contr.matrix) | |
scd_efit <- limma::eBayes(scd_fit) | |
# get results adjusting for multiple testing using FDR | |
scd_results = limma::topTable(scd_efit, coef=1, adjust='fdr', number=Inf) | |
knitr::kable(head(scd_results, n = 10)) | |
# write out file | |
write.table(scd_results, 'GEO_data/SCD/GSE35007/GSE35007_results_data.txt', | |
row.names = TRUE, | |
col.names = TRUE, | |
quote = FALSE, | |
sep="\t") | |
# add gene information to efit | |
scd_efit$genes <- scd_genes | |
# print number of up and down regulated genes between the samples | |
scd_top_dec_test <- limma::decideTests(scd_efit) | |
knitr::kable(summary(scd_top_dec_test)) | |
## Process Results | |
limma::write.fit(scd_efit, scd_top_dec_test, adjust='fdr', file='GEO_data/SCD/GSE35007/GSE35007_DEgenes.txt') | |
# re-read data to do some QC | |
scd_res = read.table('GEO_data/SCD/GSE35007/GSE35007_DEgenes.txt', header=TRUE) | |
scd_res <- na.omit(scd_res) | |
# remove data with 0 | |
scd_res$ss_grp <- ifelse(scd_res$Res.SSvsControl == -1, "HbSS_Down", "HbSS_Up") | |
scd_res$sc_grp <- ifelse(scd_res$Res.SCvsControl == -1, "HbSC_Down", "HbSC_Up") | |
scd_res$ss_sc <- ifelse(scd_res$Res.SSvsSC == -1, "HbSS_Down", "HbSS_Up") | |
# filter genes based on p-value | |
scd_results <- cbind(row.names(scd_res), scd_res) | |
names(scd_results)[1] <- 'PROBE' | |
scd_results_filtered <- scd_results | |
# scd_results_filtered <- scd_results[scd_results$p.value.adj.SSvsControl<0.001,] | |
# write out file | |
write.table(scd_results_filtered, 'GEO_data/SCD/GSE35007/GSE35007_DEgenes.txt', | |
row.names = TRUE, | |
col.names = TRUE, | |
quote = FALSE, | |
sep="\t") | |
# adjust exprs set | |
scd_exprs <- scd_exprs[scd_exprs$PROBE %in% scd_results_filtered$PROBE, ] | |
# write out file | |
write.table(scd_exprs, 'GEO_data/SCD/GSE35007/GSE35007_exprs_data_noNA_filtered_sig.txt', | |
row.names = TRUE, | |
col.names = TRUE, | |
quote = FALSE, | |
sep="\t") | |
``` | |
```{r cache=TRUE, echo=TRUE, error = FALSE, vmessage=FALSE, warning=FALSE, eval=TRUE} | |
## Cystic Fibrosis | |
# create design matrix | |
cf_design <- stats::model.matrix(~0+ cf_groups, data.frame(cf_exprs[,6:length(cf_exprs)])) | |
colnames(cf_design) <- gsub('groups', '', colnames(cf_design)) | |
# create constrast (the one we subtract is control, then we interpret results for cases) | |
cf_contr.matrix <- limma::makeContrasts( | |
CFvsControl = cf_CF-cf_Control, | |
levels = colnames(cf_design)) | |
cf_contr.matrix | |
# fit model | |
cf_fit <- limma::lmFit(cf_exprs[,6:length(cf_exprs)], cf_design) | |
cf_fit <- limma::contrasts.fit(cf_fit, contrasts=cf_contr.matrix) | |
cf_efit <- limma::eBayes(cf_fit) | |
# get results adjusting for multiple testing using FDR | |
cf_results = limma::topTable(cf_efit, coef=1, adjust='fdr', number=Inf) | |
knitr::kable(head(cf_results, n = 10)) | |
# write to disk | |
write.table(cf_results, 'GEO_data/CF/GSE71010/GSE71010_results_data.txt', | |
row.names = TRUE, | |
col.names = TRUE, | |
quote = FALSE, | |
sep="\t") | |
# add gene information to efit | |
cf_efit$genes <- cf_genes[cf_genes$PROBE %in% row.names(cf_results), ] | |
# print number of up and down regulated genes between the samples | |
cf_top_dec_test <- limma::decideTests(cf_efit) | |
knitr::kable(summary(cf_top_dec_test)) | |
## Process Results | |
limma::write.fit(cf_efit, cf_top_dec_test, adjust='fdr', file='GEO_data/CF/GSE71010/GSE71010_DEgenes.txt', row.names=TRUE) | |
# re-read data in to cut out na | |
cf_res = read.table('GEO_data/CF/GSE71010/GSE71010_DEgenes.txt', header=TRUE) | |
cf_res <- na.omit(cf_res) | |
# rename columns | |
names(cf_res)[2] <- 'Coef.CFvsControl' | |
names(cf_res)[3] <- 't.CFvsControl' | |
names(cf_res)[4] <- 'p.value.CFvsControl' | |
names(cf_res)[5] <- 'p.value.adj.CFvsControl' | |
names(cf_res)[8] <- 'Res.CFvsControl' | |
# cut out rows that are not significant | |
# create grouping variable | |
cf_res$grp <- ifelse(cf_res$Res.CFvsControl == -1, "CF_Down", "CF_Up") | |
# filter genes based on p-value | |
cf_results <- cbind(row.names(cf_res), cf_res) | |
names(cf_results)[1] <- 'PROBE' | |
cf_results_filtered <- cf_results | |
# cf_results_filtered <- cf_results[cf_results$p.value.adj.CFvsControl<0.001,] | |
# write to disk | |
write.table(cf_results_filtered, 'GEO_data/CF/GSE71010/GSE71010_DEgenes.txt', | |
row.names = TRUE, | |
col.names = TRUE, | |
quote = FALSE, | |
sep="\t") | |
# adjust exprs set | |
cf_exprs <- cf_exprs[cf_exprs$PROBE %in% cf_results_filtered$PROBE, ] | |
# write out file | |
write.table(cf_exprs, 'GEO_data/CF/GSE71010/GSE71010_exprs_data_noNA_filtered_sig.txt', | |
row.names = TRUE, | |
col.names = TRUE, | |
quote = FALSE, | |
sep="\t") | |
``` | |
<a name="repo"/> | |
#### Session Information | |
```{r eval=TRUE} | |
sessionInfo() | |
``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment