Skip to content

Instantly share code, notes, and snippets.

@jdblischak
Last active September 6, 2022 15:34
Show Gist options
  • Star 15 You must be signed in to star a gist
  • Fork 12 You must be signed in to fork a gist
  • Save jdblischak/fdb1745612927252a7633751e5e60bcb to your computer and use it in GitHub Desktop.
Save jdblischak/fdb1745612927252a7633751e5e60bcb to your computer and use it in GitHub Desktop.
RNA-seq analysis with R/Bioconductor
# HGEN 473 - Genomics
# Spring 2017
# Tuesday, May 9 & Thursday, May 11
#
# RNA-seq analysis with R/Bioconductor
#
# John Blischak
#
# Last updated: 2020-04-08
# Introduction -------------------------------------------------------
# The goal of this tutorial is to introduce you to the analysis of
# RNA-seq data using some of the powerful, open source software
# packages provides by R, and specifically the Bioconductor project.
#
# Before the lecture, please make sure you have R and RStudio
# installed, and also run the code in the sections "Installation" and
# "Download data" below.
#
# The code below is adapted from the paper "RNA-seq analysis is easy
# as 1-2-3 with limma, Glimma and edgeR" by Charity et al., 2017. The
# original paper and this derivative work are freely available under
# the CC-BY license.
#
# Publication: https://f1000research.com/articles/5-1408/v2
#
# Source code: https://bioconductor.org/packages/release/workflows/html/RNAseq123.html
#
# CC-BY: https://creativecommons.org/licenses/by/4.0/
#
# Full citation:
#
# Law CW, Alhamdoosh M, Su S et al. RNA-seq analysis is easy as 1-2-3
# with limma, Glimma and edgeR [version 2; referees: 3 approved].
# F1000Research 2016, 5:1408 (doi: 10.12688/f1000research.9005.2)
# Installation -------------------------------------------------------
# To install the Bioconductor and CRAN packages used in this tutorial, run the
# lines below to install the RNAseq123 workflow package. If it asks if you would
# like to install the packages in a personal directory, confirm yes (Y).
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("RNAseq123")
# Download data ------------------------------------------------------
# The example data used in this tutorial is RNA-seq of 3 cell
# populations in the mouse mammary gland collected by Sheridan et al.,
# 2015. The code below downloads the count data from the Gene
# Expression Omnibus (GEO), an NCBI repository of genomics data.
#
# If you receive an error message from the function `download.file`,
# try setting the argument `method`. For example try `method =
# "wget"`` or `method = "curl"`.
url <- "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE63310&format=file"
utils::download.file(url, destfile = "GSE63310_RAW.tar", mode = "wb")
utils::untar("GSE63310_RAW.tar", exdir = ".")
files_gz <- Sys.glob("GSM*txt.gz")
for(f in files_gz)
R.utils::gunzip(f, overwrite = TRUE)
# Citation for data set:
#
# Sheridan JM, Ritchie ME, Best SA, et al.: A pooled shRNA screen for
# regulators of primary mammary stem and progenitor cells identifies
# roles for Asap1 and Prox1. BMC Cancer. 2015; 15(1): 221.
# Setup --------------------------------------------------------------
# Load the packages we will use.
library("limma") # Linear models for differential expression
library("Glimma") # Interactive plots for exploration
library("edgeR") # Process count data from NGS experiments
library("Mus.musculus") # Gene annotations for the Mus musculus genome
# Import the data.
files <- c("GSM1545535_10_6_5_11.txt", "GSM1545536_9_6_5_11.txt",
"GSM1545538_purep53.txt", "GSM1545539_JMS8-2.txt",
"GSM1545540_JMS8-3.txt", "GSM1545541_JMS8-4.txt",
"GSM1545542_JMS8-5.txt", "GSM1545544_JMS9-P7c.txt",
"GSM1545545_JMS9-P8c.txt")
read.delim(files[1], nrow = 5)
x <- readDGE(files, columns = c(1, 3))
class(x)
dim(x)
names(x)
str(x)
# Annotate the samples.
x$samples
samplenames <- substring(colnames(x), 12, nchar(colnames(x)))
samplenames
colnames(x) <- samplenames
group <- as.factor(c("LP", "ML", "Basal", "Basal", "ML", "LP",
"Basal", "ML", "LP"))
x$samples$group <- group
lane <- as.factor(rep(c("L004", "L006", "L008"), c(3, 4, 2)))
x$samples$lane <- lane
x$samples
# Annotate the genes.
head(x$counts)
dim(x$counts)
geneid <- rownames(x)
genes <- select(Mus.musculus, keys = geneid,
columns = c("SYMBOL", "TXCHROM"),
keytype = "ENTREZID")
head(genes)
genes <- genes[!duplicated(genes$ENTREZID), ]
x$genes <- genes
x
# Filter genes -------------------------------------------------------
# Calculate (log) counts-per-million (cpm)
cpm <- cpm(x)
lcpm <- cpm(x, log = TRUE)
# Detect genes with zero counts across all 9 samples
table(rowSums(x$counts == 0) == 9)
# Visualize distribution of gene expression levels
plotDensities(lcpm, legend = FALSE, main = "Before filtering")
abline(v = 0, lty = 3)
# Only keep genes which have cpm greater than 1 in at least 3 samples.
keep.exprs <- rowSums(cpm > 1) >= 3
x <- x[keep.exprs, , keep.lib.sizes = FALSE]
dim(x)
lcpm <- cpm(x, log=TRUE)
# Visualize distribution of gene expression levels after filtering
plotDensities(lcpm, legend = FALSE, main = "After filtering")
abline(v = 0, lty = 3)
# Discuss: What information are we ignoring when we use
# counts-per-million? Is it a valid measurement for performing a
# differential expression analysis?
# Normalization ------------------------------------------------------
# The default normalization provided by edgeR is TMM (trimmed mean of
# M-values), which prevents differences in highly expressed genes from
# biasing the entire distribution of gene expression. It often has a
# modest effect, as observed here.
x <- calcNormFactors(x, method = "TMM")
x$samples$norm.factors
# But here is a extreme toy example that demonstrates it will work if
# necessary.
x2 <- x
x2$samples$norm.factors <- 1
x2$counts[,1] <- ceiling(x2$counts[, 1] * 0.05)
x2$counts[,2] <- x2$counts[, 2] * 5
lcpm2 <- cpm(x2, log = TRUE)
boxplot(lcpm2, las = 2, main = "Before normalization")
x2 <- calcNormFactors(x2)
x2$samples$norm.factors
lcpm2 <- cpm(x2, log = TRUE)
boxplot(lcpm2, las=2, main = "After normalization")
# Exploration --------------------------------------------------------
# Visualize sample relationships with multidimensional scaling (MDS).
library("RColorBrewer")
group
col.group <- group
levels(col.group) <- brewer.pal(nlevels(col.group), "Set1")
col.group <- as.character(col.group)
lane
col.lane <- lane
levels(col.lane) <- brewer.pal(nlevels(col.lane), "Set2")
col.lane <- as.character(col.lane)
plotMDS(lcpm, labels = group, col = col.group,
main = "group")
plotMDS(lcpm, labels = lane, col = col.lane, dim = c(3, 4),
main = "lane")
# Interactive MDS plot
glMDSPlot(lcpm, labels = paste(group, lane, sep = "_"),
groups = x$samples[, c(2, 5)], launch = TRUE)
# Construct linear model ---------------------------------------------
# The linear model we construct will not have an intercept. This is
# referred to as the group-means parametrization in the limma User's
# Guide. The advantage of having each coefficient (beta) model the
# mean expression level for that group is that it makes it more
# straightforward to test specific hypotheses.
design <- model.matrix(~0 + group + lane)
colnames(design) <- gsub("group", "", colnames(design))
design
# Here is the model specified in LaTeX, where Y is the expression
# level of a particular gene in a particular sample:
#
# Y=\beta_{basal}+\beta_{LP}+\beta_{ML}+\beta_{L06}+\beta_{L08}+\epsilon
# We create the following 3 contrasts to test the following 3
# hypotheses:
#
# BasalvsLP: Which genes are DE between Basal and LP cells?
#
# BasalvsML: Which genes are DE between Basal and ML cells?
#
# LPvsML: Which genes are DE between LP and ML cells?
contr.matrix <- makeContrasts(BasalvsLP = Basal - LP,
BasalvsML = Basal - ML,
LPvsML = LP - ML,
levels = colnames(design))
contr.matrix
# Convert counts to be used in linear model --------------------------
# The RNA-seq counts cannot be used directly in the linear model
# because they violoate its assumptions, specifically that the
# variance should not depend on the mean. One option is to perform a
# test that directly models the counts (e.g. such models are provided
# by edgeR and DESeq2). However, the linear modelling framework is
# generally more flexible, and limma has many nice downstream
# functions for further testing the data (e.g. testing for enrichment
# of functional categories).
#
# Even after converting to log-cpm, the RNA-seq data still has the
# mean-variance relationship. Thus the function `voom` calculates
# weights to offset this relationship.
v <- voom(x, design, plot = TRUE)
v
# Note that for convenience `voom` also calculates the log-cpm and
# optionally applies additional normalization. These side-effects
# should not be mistaken as the primary purpose of `voom` since
# standardization to log-cpm and normalization can easily be done by
# other functions (which is what `voom` does under the hood). The
# primary purpose of `voom` is to calculate the weights to be used in
# the linear model.
# Test for differential expression (DE) ------------------------------
# Fit a linear model per gene.
vfit <- lmFit(v, design)
# Calculate the statistics for our specific contrasts of interest.
vfit <- contrasts.fit(vfit, contrasts = contr.matrix)
# Calculate levels of significance. Uses an empirical Bayes algorithm
# to shrink the gene-specific variances towards the average variance
# across all genes.
efit <- eBayes(vfit)
# As seen in this diagnostic plot of the residual variation versus the
# mean gene expression level, `voom` successfully removed the
# relationship between the mean and the variance.
plotSA(efit, main = "Final model: Mean−variance trend")
# Discuss: What is the value of sharing information across genes? In
# other words, why not just use the original variance calculated per
# gene?
# Explore the results ------------------------------------------------
# Tabulate the results
summary(decideTests(efit))
# If the magnitude of the effect size is important for your downstream
# analysis (e.g. perhaps you are trying to prioritize genes for futher
# experiments), you can specify a minimal log-fold-change with the
# function `treat` isntead of using `eBayes`.
tfit <- treat(vfit, lfc = 1)
dt <- decideTests(tfit)
summary(dt)
# Create a venn diagram of the results.
head(dt)
de.common <- which(dt[, 1] != 0 & dt[, 2] != 0)
length(de.common)
head(tfit$genes$SYMBOL[de.common], n = 20)
vennDiagram(dt[, 1:2], circle.col = c("turquoise", "salmon"))
write.fit(tfit, dt, file = "results.txt")
# Identify top DE genes.
basal.vs.lp <- topTreat(tfit, coef = 1, n = Inf)
basal.vs.ml <- topTreat(tfit, coef = 2, n = Inf)
head(basal.vs.lp)
head(basal.vs.ml)
# Visualize DE genes.
plotMD(tfit, column = 1, status = dt[, 1], main = colnames(tfit)[1],
xlim = c(-8, 13))
glMDPlot(tfit, coef = 1, status = dt, main = colnames(tfit)[1],
side.main = "ENTREZID", counts = x$counts, groups = group,
launch = TRUE)
# View heatmap of top 100 DE genes between Basal and LP cells.
library("gplots")
basal.vs.lp.topgenes <- basal.vs.lp$ENTREZID[1:100]
i <- which(v$genes$ENTREZID %in% basal.vs.lp.topgenes)
mycol <- colorpanel(1000, "blue", "white", "red")
heatmap.2(v$E[i, ], scale = "row",
labRow = v$genes$SYMBOL[i], labCol = group,
col = mycol, trace = "none", density.info = "none")
# Discuss: Why did we scale by row (per gene) in the heatmap?
# Test for enrichment of gene sets -----------------------------------
# A common first-step post-DE testing is to test for enrichment of
# specific gene sets, e.g. Gene Ontology (GO) categories or KEGG
# pathways. Here we will use the gene set collection MSigDB c2 from
# the Broad Institute.
load(system.file("extdata", "mouse_c2_v5p1.rda", package = "RNAseq123"))
class(Mm.c2)
Mm.c2$KEGG_GLYCOLYSIS_GLUCONEOGENESIS
idx <- ids2indices(Mm.c2, id = rownames(v))
# The limma function `camera` tests for enrichment for a specific
# contrast of interest by comparing all the gene sets against one
# another.
cam.BasalvsLP <- camera(v, idx, design, contrast = contr.matrix[, 1])
head(cam.BasalvsLP, 5)
cam.BasalvsML <- camera(v, idx, design, contrast = contr.matrix[, 2])
head(cam.BasalvsML, 5)
cam.LPvsML <- camera(v, idx, design, contrast = contr.matrix[, 3])
head(cam.LPvsML, 5)
# Visualize gene set enrichment with a barcode plot.
barcodeplot(efit$t[, 3], index = idx$LIM_MAMMARY_LUMINAL_MATURE_UP,
index2 = idx$LIM_MAMMARY_LUMINAL_MATURE_DN,
main = "LPvsML")
# Note that the directions of the effect are reversed from the results
# in the original study because we tested the contrast LP - ML,
# whereas the original analysis tested ML - LP.
# Report session information -----------------------------------------
# It's always a good idea to report the versions of the software you
# used to generate results.
sessionInfo()
# The code in this tutorial was tested with the following setup:
# > sessionInfo()
# R version 3.6.3 (2020-02-29)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 10 x64 (build 18363)
#
# Matrix products: default
#
# locale:
# [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
# [4] LC_NUMERIC=C LC_TIME=English_United States.1252
#
# attached base packages:
# [1] parallel stats4 stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] gplots_3.0.3 RColorBrewer_1.1-2
# [3] Mus.musculus_1.3.1 TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0
# [5] org.Mm.eg.db_3.10.0 GO.db_3.10.0
# [7] OrganismDbi_1.28.0 GenomicFeatures_1.38.2
# [9] GenomicRanges_1.38.0 GenomeInfoDb_1.22.1
# [11] AnnotationDbi_1.48.0 IRanges_2.20.2
# [13] S4Vectors_0.24.3 Biobase_2.46.0
# [15] BiocGenerics_0.32.0 edgeR_3.28.0
# [17] Glimma_1.14.0 limma_3.42.0
#
# loaded via a namespace (and not attached):
# [1] httr_1.4.1 bit64_0.9-7 jsonlite_1.6.1 R.utils_2.9.2
# [5] gtools_3.8.2 assertthat_0.2.1 askpass_1.1 BiocManager_1.30.10
# [9] BiocFileCache_1.10.2 RBGL_1.62.1 blob_1.2.1 GenomeInfoDbData_1.2.2
# [13] Rsamtools_2.2.3 progress_1.2.2 pillar_1.4.3 RSQLite_2.2.0
# [17] lattice_0.20-40 glue_1.3.1 digest_0.6.25 XVector_0.26.0
# [21] Matrix_1.2-18 R.oo_1.23.0 XML_3.99-0.3 pkgconfig_2.0.3
# [25] biomaRt_2.42.1 zlibbioc_1.32.0 purrr_0.3.3 gdata_2.18.0
# [29] BiocParallel_1.20.1 tibble_2.1.3 openssl_1.4.1 SummarizedExperiment_1.16.1
# [33] magrittr_1.5 crayon_1.3.4 memoise_1.1.0 R.methodsS3_1.8.0
# [37] graph_1.64.0 tools_3.6.3 prettyunits_1.1.1 hms_0.5.3
# [41] matrixStats_0.56.0 stringr_1.4.0 locfit_1.5-9.1 DelayedArray_0.12.2
# [45] Biostrings_2.54.0 compiler_3.6.3 caTools_1.18.0 rlang_0.4.5
# [49] grid_3.6.3 RCurl_1.98-1.1 rstudioapi_0.11 rappdirs_0.3.1
# [53] bitops_1.0-6 DBI_1.1.0 curl_4.3 R6_2.4.1
# [57] GenomicAlignments_1.22.1 dplyr_0.8.5 rtracklayer_1.46.0 bit_1.1-15.2
# [61] KernSmooth_2.23-16 stringi_1.4.6 Rcpp_1.0.3 vctrs_0.2.4
# [65] dbplyr_1.4.2 tidyselect_1.0.0
# Further reading ----------------------------------------------------
# The limma User's Guide (run `limmaUsersGuide()` in the R console) is
# very useful. Especially see Section 2.1 for citations to the primary
# publications that describe the methods and Chapter 9 for how to
# contruct the model for different types of study designs.
# The Bioconductor support site (https://support.bioconductor.org/)
# has years worth of questions and answers about RNA-seq analysis and
# other topics in bioinformatics.
# Overview of RNA-seq analysis
#
# A. Oshlack, M. D. Robinson and M. D. Young. “From RNA-seq reads to
# differential expression results”. In: _Genome Biology_ 11.12 (2010),
# p. 220. DOI: 10.1186/gb-2010-11-12-220.
# R/Bioconductor tutorial starting from fastq files
#
# Chen Y, Lun ATL and Smyth GK. From reads to genes to pathways:
# differential expression analysis of RNA-Seq experiments using
# Rsubread and the edgeR quasi-likelihood pipeline [version 2;
# referees: 5 approved]. F1000Research 2016, 5:1438 (doi:
# 10.12688/f1000research.8987.2)
# Comparisons of RNA-seq methods for differential expression testing
#
# F. Rapaport, R. Khanin, Y. Liang, et al. “Comprehensive evaluation
# of differential gene expression analysis methods for RNA-seq data”.
# In: _Genome Biology_ 14.9 (2013), p. R95. DOI:
# 10.1186/gb-2013-14-9-r95.
#
# C. Soneson and M. Delorenzi. “A comparison of methods for
# differential expression analysis of RNA-seq data”. In: _BMC
# Bioinformatics_ 14.1 (2013), p. 91. DOI: 10.1186/1471-2105-14-91.
# limma
#
# Ritchie ME, Phipson B, Wu D, et al.: limma powers differential
# expression analyses for RNA-sequencing and microarray studies.
# Nucleic Acids Res. 2015; 43(7): e47.
# Glimma
#
# Su S, Ritchie ME: Glimma: Interactive HTML graphics for RNA-seq
# data. 2016; R package version 1.1.1.
# edgeR
#
# Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package
# for differential expression analysis of digital gene expression
# data. Bioinformatics. 2010; 26(1): 139–140.
# Bioconductor project
#
# Huber W, Carey VJ, Gentleman R, et al.: Orchestrating
# high-throughput genomic analysis with Bioconductor. Nat Methods.
# 2015; 12(2): 115–121.
# TMM normalization
#
# Robinson MD, Oshlack A: A scaling normalization method for
# differential expression analysis of RNA-seq data. Genome Biol. 2010;
# 11(3): R25.
# limma+voom
#
# Law CW, Chen Y, Shi W, et al.: voom: Precision weights unlock linear
# model analysis tools for RNA-seq read counts. Genome Biol. 2014;
# 15(2): R29
# Empirical Bayes to estimate gene expression variance
#
# Smyth GK: Linear models and empirical bayes methods for assessing
# differential expression in microarray experiments. Stat Appl Genet
# Mol Biol. 2004; 3(1): Article3.
@Iroda-0809
Copy link

Dear, John Blischak
When I try to run readDGE function, it is showing Error
Error in readDGE(files, columns = c(1, 3)) :
could not find function "readDGE"

Can u advise me something about this function? Right now I'm using the R 4.0.0 version.

@jdblischak
Copy link
Author

Hi @Iroda-0809. The function readDGE() is in the package edgeR. You need to load the package in your R session prior to running readDGE():

library("edgeR")

@loisvdpluijm
Copy link

Hi,

Thanks for sharing this code, very helpful! However, somehow I cannot even get past the gene annotation, since it seems to be impossible for me to get the mus.musculus data. Any clue?

3: In install.packages(...) :
installation of package ‘TxDb.Mmusculus.UCSC.mm10.knownGene’ had non-zero exit status
4: In install.packages(...) :
installation of package ‘Mus.musculus’ had non-zero exit status

also when i try to get the mus.musculus from bioconductor seperately, the same problem appears to happen.. Any idea?

Thanks so much :)

@jdblischak
Copy link
Author

Hi @loisvdpluijm, what command did you run when you tried to install the package? The Bioconductor installation instructions have changed since this tutorial was written. Starting in 2018, the package BiocManager was released for installing Bioconductor packages. You can read more at the Bioconductor installation instructions. For this tutorial, you'll want to run the below to install the RNAseq123 workflow:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("RNAseq123")

If that still fails, please copy-paste the command you entered and the full output in order for me to better understand how it failed. Also, I wanted to let you know that Bioconductor has a Support Site. If I can't figure out what is going wrong, then you could post there.

@loisvdpluijm
Copy link

loisvdpluijm commented Apr 8, 2020

Heya,

Awesome that you are willing to answer and help! What you suggest is indeed what I runned! The rest of the packages like limma and glimma are perfectly fine and i am able to load those using the library function without any problems :)

Here is the entire thing that I get: I am sorry for this huge blob of text

`> if (!requireNamespace("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
> BiocManager::install("RNAseq123")
Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.3 (2020-02-29)
Installing package(s) 'RNAseq123'
also installing the dependencies ‘GO.db’, ‘org.Mm.eg.db’, ‘TxDb.Mmusculus.UCSC.mm10.knownGene’, ‘Mus.musculus’

installing the source packages ‘GO.db’, ‘org.Mm.eg.db’, ‘TxDb.Mmusculus.UCSC.mm10.knownGene’, ‘Mus.musculus’, ‘RNAseq123’

trying URL 'https://bioconductor.org/packages/3.10/data/annotation/src/contrib/GO.db_3.10.0.tar.gz'
Content type 'application/x-gzip' length 31820873 bytes (30.3 MB)
downloaded 30.3 MB

trying URL 'https://bioconductor.org/packages/3.10/data/annotation/src/contrib/org.Mm.eg.db_3.10.0.tar.gz'
Content type 'application/x-gzip' length 72221575 bytes (68.9 MB)
downloaded 68.9 MB

trying URL 'https://bioconductor.org/packages/3.10/data/annotation/src/contrib/TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0.tar.gz'
Content type 'application/x-gzip' length 24858207 bytes (23.7 MB)
downloaded 23.7 MB

trying URL 'https://bioconductor.org/packages/3.10/data/annotation/src/contrib/Mus.musculus_1.3.1.tar.gz'
Content type 'application/x-gzip' length 1618 bytes
downloaded 1618 bytes

trying URL 'https://bioconductor.org/packages/3.10/workflows/src/contrib/RNAseq123_1.10.0.tar.gz'
Content type 'application/x-gzip' length 4796432 bytes (4.6 MB)
downloaded 4.6 MB

* installing *source* package 'GO.db' ...
** using staged installation
Warning in file(file, if (append) "a" else "w") :
  cannot open file 'C:/Users/Loos/Documents/R/win-library/3.6/00LOCK-GO.db/00new/GO.db/DESCRIPTION': No such file or directory
Error in file(file, if (append) "a" else "w") : 
  cannot open the connection
ERROR: installing package DESCRIPTION failed for package 'GO.db'
* removing 'C:/Users/Loïs/Documents/R/win-library/3.6/GO.db'
* installing *source* package 'org.Mm.eg.db' ...
** using staged installation
Warning in file(file, if (append) "a" else "w") :
  cannot open file 'C:/Users/Loos/Documents/R/win-library/3.6/00LOCK-org.Mm.eg.db/00new/org.Mm.eg.db/DESCRIPTION': No such file or directory
Error in file(file, if (append) "a" else "w") : 
  cannot open the connection
ERROR: installing package DESCRIPTION failed for package 'org.Mm.eg.db'
* removing 'C:/Users/Loïs/Documents/R/win-library/3.6/org.Mm.eg.db'
* installing *source* package 'TxDb.Mmusculus.UCSC.mm10.knownGene' ...
** using staged installation
Warning in file(file, if (append) "a" else "w") :
  cannot open file 'C:/Users/Loos/Documents/R/win-library/3.6/00LOCK-TxDb.Mmusculus.UCSC.mm10.knownGene/00new/TxDb.Mmusculus.UCSC.mm10.knownGene/DESCRIPTION': No such file or directory
Error in file(file, if (append) "a" else "w") : 
  cannot open the connection
ERROR: installing package DESCRIPTION failed for package 'TxDb.Mmusculus.UCSC.mm10.knownGene'
* removing 'C:/Users/Loïs/Documents/R/win-library/3.6/TxDb.Mmusculus.UCSC.mm10.knownGene'
ERROR: dependencies 'GO.db', 'org.Mm.eg.db', 'TxDb.Mmusculus.UCSC.mm10.knownGene' are not available for package 'Mus.musculus'
* removing 'C:/Users/Loïs/Documents/R/win-library/3.6/Mus.musculus'
ERROR: dependency 'Mus.musculus' is not available for package 'RNAseq123'
* removing 'C:/Users/Loïs/Documents/R/win-library/3.6/RNAseq123'

The downloaded source packages are in
	‘C:\Users\Loïs\AppData\Local\Temp\RtmpaeQJ0x\downloaded_packages’
Installation path not writeable, unable to update packages: class, foreign, lattice, nlme, nnet, survival
Old packages: 'backports', 'fs', 'glue', 'lubridate'
Update all/some/none? [a/s/n]: 
a

  There are binary versions available but the source versions are later:
          binary source needs_compilation
backports  1.1.5  1.1.6              TRUE
fs         1.4.0  1.4.1              TRUE
glue       1.3.2  1.4.0              TRUE
lubridate  1.7.4  1.7.8              TRUE

  Binaries will be installed
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/backports_1.1.5.zip'
Content type 'application/zip' length 68649 bytes (67 KB)
downloaded 67 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/fs_1.4.0.zip'
Content type 'application/zip' length 774540 bytes (756 KB)
downloaded 756 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/glue_1.3.2.zip'
Content type 'application/zip' length 153968 bytes (150 KB)
downloaded 150 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/lubridate_1.7.4.zip'
Content type 'application/zip' length 1581320 bytes (1.5 MB)
downloaded 1.5 MB

package ‘backports’ successfully unpacked and MD5 sums checked
package ‘fs’ successfully unpacked and MD5 sums checked
package ‘glue’ successfully unpacked and MD5 sums checked
package ‘lubridate’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Loïs\AppData\Local\Temp\RtmpaeQJ0x\downloaded_packages
Warning messages:
1: In install.packages(...) :
  installation of package ‘GO.db’ had non-zero exit status
2: In install.packages(...) :
  installation of package ‘org.Mm.eg.db’ had non-zero exit status
3: In install.packages(...) :
  installation of package ‘TxDb.Mmusculus.UCSC.mm10.knownGene’ had non-zero exit status
4: In install.packages(...) :
  installation of package ‘Mus.musculus’ had non-zero exit status
5: In install.packages(...) :
  installation of package ‘RNAseq123’ had non-zero exit status`

So there are 2 things that seem to be off. First of all it sometimes refers to my folder as "Loos" instead of "Loïs". I thought that maybe it did not comprehend the "i" with two dots, so I changed the folder's name. This did not seem to be the problem. Second thing is that it is not able to update certain packages. It doesnt seem to matter if I then choose to try and update them anyway or leave them like that.

Appreciate your help! Thanks a lot!

@jdblischak
Copy link
Author

OK. I haven't seen that particular error before. Can you try the following:

Sys.setenv(R_INSTALL_STAGED = FALSE)
BiocManager::install("GO.db")

Also, could you please share the results of sessionInfo()?

@loisvdpluijm
Copy link

Running that left me with kind of the same thing:

> Sys.setenv(R_INSTALL_STAGED = FALSE)
> BiocManager::install("GO.db")
Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.3 (2020-02-29)
Installing package(s) 'GO.db'
installing the source package ‘GO.db’

trying URL 'https://bioconductor.org/packages/3.10/data/annotation/src/contrib/GO.db_3.10.0.tar.gz'
Content type 'application/x-gzip' length 31820873 bytes (30.3 MB)
downloaded 30.3 MB

* installing *source* package 'GO.db' ...
** using non-staged installation
Warning in file(file, if (append) "a" else "w") :
  cannot open file 'C:/Users/Loos/Documents/R/win-library/3.6/GO.db/DESCRIPTION': No such file or directory
Error in file(file, if (append) "a" else "w") : 
  cannot open the connection
ERROR: installing package DESCRIPTION failed for package 'GO.db'
* removing 'C:/Users/Loïs/Documents/R/win-library/3.6/GO.db'

The downloaded source packages are in
	‘C:\Users\Loïs\AppData\Local\Temp\RtmpaeQJ0x\downloaded_packages’
Installation path not writeable, unable to update packages: class, foreign, lattice, nlme, nnet, survival
Old packages: 'backports', 'fs', 'glue', 'lubridate'

sessionInfo as follows:

> > sessionInfo
> function (package = NULL) 
> {
>     z <- list()
>     z$R.version <- R.Version()
>     z$platform <- z$R.version$platform
>     if (nzchar(.Platform$r_arch)) 
>         z$platform <- paste(z$platform, .Platform$r_arch, sep = "/")
>     z$platform <- paste0(z$platform, " (", 8 * .Machine$sizeof.pointer, 
>         "-bit)")
>     z$locale <- Sys.getlocale()
>     z$running <- osVersion
>     z$RNGkind <- RNGkind()
>     if (is.null(package)) {
>         package <- grep("^package:", search(), value = TRUE)
>         keep <- vapply(package, function(x) x == "package:base" || 
>             !is.null(attr(as.environment(x), "path")), 
>             NA)
>         package <- .rmpkg(package[keep])
>     }
>     pkgDesc <- lapply(package, packageDescription, encoding = NA)
>     if (length(package) == 0) 
>         stop("no valid packages were specified")
>     basePkgs <- sapply(pkgDesc, function(x) !is.null(x$Priority) && 
>         x$Priority == "base")
>     z$basePkgs <- package[basePkgs]
>     if (any(!basePkgs)) {
>         z$otherPkgs <- pkgDesc[!basePkgs]
>         names(z$otherPkgs) <- package[!basePkgs]
>     }
>     loadedOnly <- loadedNamespaces()
>     loadedOnly <- loadedOnly[!(loadedOnly %in% package)]
>     if (length(loadedOnly)) {
>         names(loadedOnly) <- loadedOnly
>         pkgDesc <- c(pkgDesc, lapply(loadedOnly, packageDescription))
>         z$loadedOnly <- pkgDesc[loadedOnly]
>     }
>     z$matprod <- as.character(options("matprod"))
>     es <- extSoftVersion()
>     z$BLAS <- as.character(es["BLAS"])
>     z$LAPACK <- La_library()
>     class(z) <- "sessionInfo"
>     z
> }
> <bytecode: 0x000000001da6db28>
> <environment: namespace:utils>
> >

@jdblischak
Copy link
Author

For sessionInfo(), you need to include the parentheses to execute the function. You sent the function definition.

@loisvdpluijm
Copy link

God, that was super dumb. I am so sorry.

R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252    LC_MONETARY=Dutch_Netherlands.1252
[4] LC_NUMERIC=C                       LC_TIME=Dutch_Netherlands.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_3.28.1  Glimma_1.14.0 limma_3.42.2 

loaded via a namespace (and not attached):
[1] BiocManager_1.30.10 compiler_3.6.3      tools_3.6.3         Rcpp_1.0.4          grid_3.6.3         
[6] locfit_1.5-9.4      jsonlite_1.6.1      lattice_0.20-38    

@jdblischak
Copy link
Author

No worries!

I'm able to install GO.db on Windows:

> BiocManager::install("GO.db")
Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.3 (2020-02-29)
Installing package(s) 'GO.db'
installing the source package ‘GO.db’

trying URL 'https://bioconductor.org/packages/3.10/data/annotation/src/contrib/GO.db_3.10.0.tar.gz'
Content type 'application/x-gzip' length 31820873 bytes (30.3 MB)
downloaded 30.3 MB

* installing *source* package 'GO.db' ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
  converting help for package 'GO.db'
    finding HTML links ... done
    GOBASE                                  html  
    GOBPANCESTOR                            html  
    GOBPCHILDREN                            html  
    GOBPOFFSPRING                           html  
    GOBPPARENTS                             html  
    GOCCANCESTOR                            html  
    GOCCCHILDREN                            html  
    GOCCOFFSPRING                           html  
    GOCCPARENTS                             html  
    GOMAPCOUNTS                             html  
    GOMFANCESTOR                            html  
    GOMFCHILDREN                            html  
    GOMFOFFSPRING                           html  
    GOMFPARENTS                             html  
    GOOBSOLETE                              html  
    GOSYNONYM                               html  
    GOTERM                                  html  
    GO_dbconn                               html  
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (GO.db)

The downloaded source packages are in
	‘C:\Users\john\AppData\Local\Temp\RtmpuCnRoE\downloaded_packages’
> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] BiocManager_1.30.10 compiler_3.6.3      tools_3.6.3  

@jdblischak
Copy link
Author

From searching your issue, it looks like it is likely due to your username:

https://stat.ethz.ch/pipermail/r-help/2014-February/371262.html

Note that even though you changed your username, R still recognizes both versions. The first message says Loos and the second says Loïs.

Warning in file(file, if (append) "a" else "w") :
  cannot open file 'C:/Users/Loos/Documents/R/win-library/3.6/GO.db/DESCRIPTION': No such file or directory
Error in file(file, if (append) "a" else "w") : 
  cannot open the connection
ERROR: installing package DESCRIPTION failed for package 'GO.db'
* removing 'C:/Users/Loïs/Documents/R/win-library/3.6/GO.db'

Thus I'd recommend restarting R (or even better, restart your computer), and trying again. You can confirm via .libPaths() to see the path that R is looking for packages.

@loisvdpluijm
Copy link

Missed your last comment. I am going to try again, but I did already tried this cause this was also the only thing I could find in the errors that made sense. I even created a new user on my computer, since it is hard to change the name of user maps (lots of other programs depend on it ofcourse). Then the names seemed to be the same in both messages. Maybe I should even redownload R and place it in another folder? I will check it out later today. Thanks John!

@nsamanta
Copy link

nsamanta commented Dec 2, 2021

Hi @jdblischak , Thank you for sharing this tutorial. I have a qq about your multivariate model. Could you have ran a glmLRT() for each contrast group too?

@jdblischak
Copy link
Author

Could you have ran a glmLRT() for each contrast group too?

Sure. You can use the edgeR testing functions instead. You'd replace voom/lmFit/contrasts.fit/eBayes with estimateDisp/glmFit/glmLRT. See section 3.2.3 of the edgeR User's Guide. It shows how to use makeContrasts with edgeR functions. It uses the alternative glmQLFit and glmQLFTest. I'll let you evaluate and decide which would be better for your use case.

Though I have to ask, how many samples do you have per group in your experiment? If you have >=7 samples per group, I'd recommend using limma+voom. See Soneson & Delorenzi , 2013.

@nsamanta
Copy link

nsamanta commented Dec 3, 2021

Would any of this allow me to get top DE genes based on "adjusted p values"? Does toptags() consider adj-p values?

@jdblischak
Copy link
Author

Yes. topTags(), topTreat(), and topTable() all run p.adjust() to generate adjusted p-values. You can set the argument adjust.method to choose the multiple testing correction method. I recommend reading ?topTags and ?p.adjust to learn about the options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment