Skip to content

Instantly share code, notes, and snippets.

@jvwong
Last active November 11, 2016 01:06
Show Gist options
  • Save jvwong/32c23ac64138c59b1a150987b023d57d to your computer and use it in GitHub Desktop.
Save jvwong/32c23ac64138c59b1a150987b023d57d to your computer and use it in GitHub Desktop.
Pathway Commons Guide: Workflow -- Process data

This code is referenced in the pathway enrichment workflow step Process Data.

Use the edgeR R/Bioconductor package to analyze RNA sequencing data for high-grade serous ovarian cancer (HGS-OvCa) from The Cancer Genome Atlas (TCGA) for differentially expressed genes.

Software requirements

  • R: version 3.3.1
    • Bioconductor: version 3.3
      • edgeR: version 3.16.0

Details

  • process_data.R
    • Modify BASE_DIR variable to point to the directory where you wish to download files
    • You will need the file TCGAOV_data.rda that defines a DGEList named TCGAOV_data that contains the HGS-OvCa RNA sequencing data and sample subtype assignments. This was generated in the previous step Get Data
    • The resulting ranked gene list MesenchymalvsImmunoreactive_edger_ranks.rnk will be saved to file as a compressed tab-delimited file.
rm(list=ls(all=TRUE))
### ============ 0. Installation and setup =========
### ============ Install packages from Bioconductor ========
# source("https://bioconductor.org/biocLite.R")
# biocLite(c("edgeR"))
library("edgeR")
### ============ Declare directories =========
BASE_DIR = "/Documents/data/TCGA/" # Change to suit your directory
TCGAOV_data_FILE = file.path(BASE_DIR, "TCGAOV_data.rda")
### ============ Declare samples =========
CATEGORY_BASELINE = "Immunoreactive"
CATEGORY_TEST = "Mesenchymal"
TCGAOV_ranks_FILE = file.path(BASE_DIR,
"MesenchymalvsImmunoreactive_edger_ranks.rnk")
load(TCGAOV_data_FILE)
### ============ 1. Filter ===============
index_BASELINE = TCGAOV_data$samples$group == CATEGORY_BASELINE
index_TEST = TCGAOV_data$samples$group == CATEGORY_TEST
index_BASELINE_TEST = which(index_BASELINE | index_TEST)
N_BASELINE = sum(index_BASELINE)
N_TEST = sum(index_TEST)
TCGAOV_data_BT = TCGAOV_data[,index_BASELINE_TEST]
row_with_mincount = rowSums(cpm(TCGAOV_data_BT) > 10) >= min(N_BASELINE,
N_TEST)
TCGAOV_filtered = TCGAOV_data_BT[row_with_mincount, , keep.lib.sizes=FALSE]
### ============ 2. Normalize ===============
TCGAOV_normalized_TMM = calcNormFactors(TCGAOV_filtered, method="TMM")
### ============ 3. Fit ===============
TCGAOV_fitted_commondisp = estimateCommonDisp(TCGAOV_normalized_TMM)
TCGAOV_fitted_tagwise = estimateTagwiseDisp(TCGAOV_fitted_commondisp)
### ============ 4. Test ===============
TCGAOV_DE = exactTest(TCGAOV_fitted_tagwise,
pair=c(CATEGORY_BASELINE, CATEGORY_TEST))
### ============ 5. Adjust ===============
TCGAOV_TT = topTags(TCGAOV_DE,
n=nrow(TCGAOV_filtered),
adjust.method="BH",
sort.by="PValue")
### ============ 6. Rank ===============
TCGAOV_ranks <- sign(TCGAOV_TT$table$logFC) * (-1) * log10(TCGAOV_TT$table$PValue)
genenames <- TCGAOV_TT$table$external_gene_name
TCGAOV_ranks <- data.frame(gene=genenames, rank=TCGAOV_ranks)
TCGAOV_ordered_ranks <- TCGAOV_ranks[order(TCGAOV_ranks[,2], decreasing = TRUE),]
## Save to file as tab-delimited text
write.table(TCGAOV_ordered_ranks,
TCGAOV_ranks_FILE,
col.name=TRUE,
sep="\t",
row.names=FALSE,
quote=FALSE)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment