Skip to content

Instantly share code, notes, and snippets.

@callahantiff
Last active July 28, 2018 07:03
Show Gist options
  • Save callahantiff/04b4d2a653e806b974be4efed41e484a to your computer and use it in GitHub Desktop.
Save callahantiff/04b4d2a653e806b974be4efed41e484a to your computer and use it in GitHub Desktop.
Rare Disease GEO Analysis - R Notebook
---
title: 'MetaOmic: Transcriptomic Data Analysis'
author: 'Tiffany J. Callahan'
date: 'July 27, 2018'
output:
html_notebook: default
html_document: default
---
---
#### Contents:
* [Introduction](#introduction)
* [Accessing GEO Data](#data1)
* [Data Pre-Processing](#preprc)
* [Differential Expression Analysis](#dea)
* [Workflow Reproduciblity](#repo)
---
<a name="introduction"/>
# Introduction
This script is designed to download and process data from [GEO](https://www.ncbi.nlm.nih.gov/geo/). For each downloaded dataset, the raw data is processed and converted into an expression matrix (rows are genes and columns are samples). Additionally, using the [GEO Metadata Database](https://www.ncbi.nlm.nih.gov/geo/info/geo_paccess.html) specific features for each dataset are collected and kept for downstream processing. This script was developed from and adapts code originally developed as part of the R [`crossmeta`](https://github.com/callahantiff/crossmeta) library.
```{r message=FALSE, warning=FALSE}
# load scripts to process GEO genome wide gene expression data
source("~/Dropbox/GraduateSchool/PhD/LabWork/MetaOmic/transcriptomic/R/crossmeta/R/load_data.R")
```
<a name="data1"/>
### Accessing GEO Data
For this analysis, we will process data for patients with Sickle Cell Disease ([GEO35007](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35007)) and Cystic Fibrosis ([GSE71010](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE71010)).
```{r warning=FALSE, message= FALSE, eval=TRUE, cache=TRUE}
# download Sickle Cell Disease data (GEO35007)
data_dir_scd <- file.path(getwd(), "GEO_data", "SCD")
scd_gse_names <- c("GSE35007")
# download raw data (uncomment if data has already been downloaded)
# get_raw(scd_gse_names, data_dir_scd)
# reloads if previously run
scd_esets <- load_raw(scd_gse_names, data_dir_scd)
```
```{r cache=TRUE, echo=TRUE, error = FALSE, vmessage=FALSE, warning=FALSE, eval=TRUE}
# download Cystic Fibrosis data (GSE71010)
data_dir_cf <- file.path(getwd(), "GEO_data", "CF")
cf_gse_names <- c("GSE71010")
# download raw data (uncomment if data has already been downloaded)
# get_raw(cf_gse_names, data_dir_cf)
# reloads if previously called
cf_esets <- load_raw(cf_gse_names, data_dir_cf)
```
<a name="preprc"/>
## Data Pre-Processing
Pre-processing of downloaded GEO data includes several important steps. The specific steps that are applied to each dataset include: removing missing data, filtering probes with expression values below the first quantile, and defining groups for differential expression analysis. In some cases, additional cleaning or organization of the data is needed prior to downstream analysis.
```{r warning=FALSE, eval=TRUE, cache=TRUE}
## Sickle Cell Disease
# view dataset information
scd_exprs_data_clean <- Biobase::exprs(scd_esets[[1]])
scd_genes <- Biobase::fData(scd_esets[[1]])
# write data to disk
write.table(scd_exprs_data_clean, 'GEO_data/SCD/GSE35007/GSE35007_exprs_data.txt',
row.names = FALSE,
col.names = TRUE,
quote = FALSE,
sep="\t")
# remove missing data
scd_exprs_data_clean <- na.omit(scd_exprs_data_clean)
## Filtering
# summarize average gene expression values
scd_avg <- rowMeans(scd_exprs_data_clean)
scd_sum <- summary(scd_avg)
# filter out genes that "aren't expressed" - will appear red in the plot
scd_cutoff <- scd_sum[2][[1]] # remove genes below first quantile
scd_exprs <- EMA::expFilter(scd_exprs_data_clean, threshold = round(scd_sum[2][[1]], 4))
# filter gene list to match reducecd expression
scd_genes <- scd_genes[scd_genes$PROBE %in% row.names(scd_exprs), ]
# write data to disk
write.table(scd_exprs, 'GEO_data/SCD/GSE35007/GSE35007_exprs_data_noNA_filtered.txt',
row.names = TRUE,
col.names = TRUE,
quote = FALSE,
sep="\t")
## Prep for DE analysis
scd_p <- Biobase::phenoData(scd_esets[[1]])@data
# fix error in data where data got erroneously shifted
which(scd_p$geo_accession == 'GSM860268')
which(colnames(scd_p) == 'characteristics_ch1')
scd_p[62,11:19] <- scd_p[62,10:18]
scd_p[62,10] <- NA
# DEG comparison group definition
scd_phen_dat = scd_p$`hb phenotype:ch1`
# combine A, AS, and AA groups to be 1 control group
scd_grp <- ifelse(scd_p$`hb phenotype:ch1` == 'A' | scd_p$`hb phenotype:ch1` == 'AA' | scd_p$`hb phenotype:ch1` == 'AS',
'Control', scd_phen_dat)
# append labels to data frame + sort
scd_p <- cbind(scd_grp, scd_p)
# remove group set to 'NA'
scd_p <- na.omit(scd_p)
# filter gene list to match reducecd expression
scd_exprs <- scd_exprs[,colnames(scd_exprs) %in% scd_p$geo_accession]
# bind columns of gene symbols
scd_exprs <- cbind(scd_genes, scd_exprs)
#set group for DEG analysis
write.table(scd_p, 'GEO_data/SCD/GSE35007/GSE35007_phenodata.txt',
row.names = FALSE,
col.names = TRUE,
quote = FALSE,
sep="\t")
# print results
scd_groups <- scd_p$scd_grp
table(scd_groups)
```
```{r cache=TRUE, echo=TRUE, error = FALSE, vmessage=FALSE, warning=FALSE, eval=TRUE}
## Cystic Fibrosis
# view dataset information
cf_exprs_data_clean <- Biobase::exprs(cf_esets[[1]])
cf_genes <- Biobase::fData(cf_esets[[1]])
# write data to disk
write.table(cf_exprs_data_clean, 'GEO_data/CF/GSE71010/GSE71010_exprs_data.txt',
row.names = FALSE,
col.names = TRUE,
quote = FALSE,
sep="\t")
# remove missing data
cf_exprs_data_clean <- na.omit(cf_exprs_data_clean)
## Filtering
# summarize average gene expression values
cf_avg <- rowMeans(cf_exprs_data_clean)
cf_sum <- summary(cf_avg)
# filter out genes that "aren't expressed" - will appear red in the plot
cf_cutoff <- cf_sum[2][[1]] # remove genes below first quantile
cf_exprs <- EMA::expFilter(cf_exprs_data_clean, threshold = round(cf_sum[2][[1]], 4))
# filter gene list to match reducecd expression
cf_genes <- cf_genes[cf_genes$PROBE %in% row.names(cf_exprs), ]
# write data to disk
write.table(cf_exprs, 'GEO_data/CF/GSE71010/GSE71010_exprs_data_noNA_filtered.txt',
row.names = TRUE,
col.names = TRUE,
quote = FALSE,
sep="\t")
## Prep for DE analysis
# DEG comparison group definition
cf_p <- Biobase::phenoData(cf_esets[[1]])@data
cf_grp <- ifelse(cf_p$characteristics_ch1.1 == 'disease: Healthy control', 'Control', cf_p$characteristics_ch1.1)
cf_grp <- ifelse(cf_grp == "1", 'CF', cf_grp)
cf_grp <- ifelse(cf_grp == "3", NA, cf_grp)
# append labels to data frame + sort
cf_p <- cbind(cf_grp, cf_p)
# remove group set to 'NA'
cf_p <- na.omit(cf_p)
cf_exprs <- cf_exprs[, colnames(cf_exprs) %in% cf_p$geo_accession]
# bind columns of gene symbols
cf_exprs <- cbind(cf_genes, cf_exprs)
cf_exprs <- na.omit(cf_exprs)
# write data to disk
write.table(cf_p, 'GEO_data/CF/GSE71010/GSE71010_phenodata.txt',
row.names = FALSE,
col.names = TRUE,
quote = FALSE,
sep="\t")
# print results
cf_groups <- cf_p$cf_grp
table(cf_groups)
```
<a name="dea"/>
## Differential Expression Analysis
Differential expression analysis is performed on the pre-processed microarray data using limma ([microarray](http://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf)).
```{r cache=TRUE, echo=TRUE, error = FALSE, vmessage=FALSE, warning=FALSE, eval=TRUE}
## Sickle Cell Disease
# create design matrix
scd_design <- stats::model.matrix(~0+ scd_groups, data.frame(scd_exprs[,6:length(scd_exprs)]))
colnames(scd_design) <- gsub('groups', '', colnames(scd_design))
# create constrast (the one we subtract is control, then we interpret results for cases)
scd_contr.matrix <- limma::makeContrasts(
SCvsControl = scd_SC-scd_Control,
SSvsControl = scd_SS-scd_Control,
SSvsSC = scd_SS-scd_SC,
levels = colnames(scd_design))
scd_contr.matrix
# fit model
scd_fit <- limma::lmFit(scd_exprs[,6:length(scd_exprs)], scd_design)
scd_fit <- limma::contrasts.fit(scd_fit, contrasts=scd_contr.matrix)
scd_efit <- limma::eBayes(scd_fit)
# get results adjusting for multiple testing using FDR
scd_results = limma::topTable(scd_efit, coef=1, adjust='fdr', number=Inf)
knitr::kable(head(scd_results, n = 10))
# write out file
write.table(scd_results, 'GEO_data/SCD/GSE35007/GSE35007_results_data.txt',
row.names = TRUE,
col.names = TRUE,
quote = FALSE,
sep="\t")
# add gene information to efit
scd_efit$genes <- scd_genes
# print number of up and down regulated genes between the samples
scd_top_dec_test <- limma::decideTests(scd_efit)
knitr::kable(summary(scd_top_dec_test))
## Process Results
limma::write.fit(scd_efit, scd_top_dec_test, adjust='fdr', file='GEO_data/SCD/GSE35007/GSE35007_DEgenes.txt')
# re-read data to do some QC
scd_res = read.table('GEO_data/SCD/GSE35007/GSE35007_DEgenes.txt', header=TRUE)
scd_res <- na.omit(scd_res)
# remove data with 0
scd_res$ss_grp <- ifelse(scd_res$Res.SSvsControl == -1, "HbSS_Down", "HbSS_Up")
scd_res$sc_grp <- ifelse(scd_res$Res.SCvsControl == -1, "HbSC_Down", "HbSC_Up")
scd_res$ss_sc <- ifelse(scd_res$Res.SSvsSC == -1, "HbSS_Down", "HbSS_Up")
# filter genes based on p-value
scd_results <- cbind(row.names(scd_res), scd_res)
names(scd_results)[1] <- 'PROBE'
scd_results_filtered <- scd_results
# scd_results_filtered <- scd_results[scd_results$p.value.adj.SSvsControl<0.001,]
# write out file
write.table(scd_results_filtered, 'GEO_data/SCD/GSE35007/GSE35007_DEgenes.txt',
row.names = TRUE,
col.names = TRUE,
quote = FALSE,
sep="\t")
# adjust exprs set
scd_exprs <- scd_exprs[scd_exprs$PROBE %in% scd_results_filtered$PROBE, ]
# write out file
write.table(scd_exprs, 'GEO_data/SCD/GSE35007/GSE35007_exprs_data_noNA_filtered_sig.txt',
row.names = TRUE,
col.names = TRUE,
quote = FALSE,
sep="\t")
```
```{r cache=TRUE, echo=TRUE, error = FALSE, vmessage=FALSE, warning=FALSE, eval=TRUE}
## Cystic Fibrosis
# create design matrix
cf_design <- stats::model.matrix(~0+ cf_groups, data.frame(cf_exprs[,6:length(cf_exprs)]))
colnames(cf_design) <- gsub('groups', '', colnames(cf_design))
# create constrast (the one we subtract is control, then we interpret results for cases)
cf_contr.matrix <- limma::makeContrasts(
CFvsControl = cf_CF-cf_Control,
levels = colnames(cf_design))
cf_contr.matrix
# fit model
cf_fit <- limma::lmFit(cf_exprs[,6:length(cf_exprs)], cf_design)
cf_fit <- limma::contrasts.fit(cf_fit, contrasts=cf_contr.matrix)
cf_efit <- limma::eBayes(cf_fit)
# get results adjusting for multiple testing using FDR
cf_results = limma::topTable(cf_efit, coef=1, adjust='fdr', number=Inf)
knitr::kable(head(cf_results, n = 10))
# write to disk
write.table(cf_results, 'GEO_data/CF/GSE71010/GSE71010_results_data.txt',
row.names = TRUE,
col.names = TRUE,
quote = FALSE,
sep="\t")
# add gene information to efit
cf_efit$genes <- cf_genes[cf_genes$PROBE %in% row.names(cf_results), ]
# print number of up and down regulated genes between the samples
cf_top_dec_test <- limma::decideTests(cf_efit)
knitr::kable(summary(cf_top_dec_test))
## Process Results
limma::write.fit(cf_efit, cf_top_dec_test, adjust='fdr', file='GEO_data/CF/GSE71010/GSE71010_DEgenes.txt', row.names=TRUE)
# re-read data in to cut out na
cf_res = read.table('GEO_data/CF/GSE71010/GSE71010_DEgenes.txt', header=TRUE)
cf_res <- na.omit(cf_res)
# rename columns
names(cf_res)[2] <- 'Coef.CFvsControl'
names(cf_res)[3] <- 't.CFvsControl'
names(cf_res)[4] <- 'p.value.CFvsControl'
names(cf_res)[5] <- 'p.value.adj.CFvsControl'
names(cf_res)[8] <- 'Res.CFvsControl'
# cut out rows that are not significant
# create grouping variable
cf_res$grp <- ifelse(cf_res$Res.CFvsControl == -1, "CF_Down", "CF_Up")
# filter genes based on p-value
cf_results <- cbind(row.names(cf_res), cf_res)
names(cf_results)[1] <- 'PROBE'
cf_results_filtered <- cf_results
# cf_results_filtered <- cf_results[cf_results$p.value.adj.CFvsControl<0.001,]
# write to disk
write.table(cf_results_filtered, 'GEO_data/CF/GSE71010/GSE71010_DEgenes.txt',
row.names = TRUE,
col.names = TRUE,
quote = FALSE,
sep="\t")
# adjust exprs set
cf_exprs <- cf_exprs[cf_exprs$PROBE %in% cf_results_filtered$PROBE, ]
# write out file
write.table(cf_exprs, 'GEO_data/CF/GSE71010/GSE71010_exprs_data_noNA_filtered_sig.txt',
row.names = TRUE,
col.names = TRUE,
quote = FALSE,
sep="\t")
```
<a name="repo"/>
#### Session Information
```{r eval=TRUE}
sessionInfo()
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment