Skip to content

Instantly share code, notes, and snippets.

@stevekm
Last active June 6, 2020 03:31
Show Gist options
  • Save stevekm/f8f552dff7b900e09e94fc6f1be9afe8 to your computer and use it in GitHub Desktop.
Save stevekm/f8f552dff7b900e09e94fc6f1be9afe8 to your computer and use it in GitHub Desktop.
How to make a volcano plot of differentially bound genes using R and Plot.ly

Plot.ly Volcano Plot Example

Stephen Kelly
9/24/2016

In this example, I will demonstrate how to use gene differential binding data to create a volcano plot using R and Plot.ly. This dataset was generated by DiffBind during the analysis of a ChIP-Seq experiment. Each entry represents a bound peak that was differentially expressed between groups of samples.

First, install any libraries you might not have, load the Plot.ly library, and load the file containing your differential binding data.

# dependencies:
# install.packages("ggplot2")
# install.packages("gridExtra")
# install.packages("plotly")
suppressPackageStartupMessages(library("plotly"))

# path to the DiffBind table with genes
input_file <- "/Users/steve/Bioinformatics/DiffBind_scripts_reports/DiffBind_Volcano_Plot_report/input/diff_bind.Treatment4-ChIPSeq-vs-Control-ChIPSeq.p100.csv"

# read the file into a dataframe
diff_df <- read.delim(file = input_file,header = TRUE,sep = ',')

# check some attributes of the data
colnames(diff_df)
##  [1] "seqnames"                   "start"                     
##  [3] "end"                        "width"                     
##  [5] "strand"                     "Conc"                      
##  [7] "Conc_Treatment4.ChIPSeq"    "Conc_Control.ChIPSeq"      
##  [9] "Fold"                       "p.value"                   
## [11] "FDR"                        "Sample1.Treatment4.ChIPSeq"
## [13] "Sample2.Treatment4.ChIPSeq" "Sample3.Treatment4.ChIPSeq"
## [15] "Sample7.Control.ChIPSeq"    "Sample8.Control.ChIPSeq"   
## [17] "Sample9.Control.ChIPSeq"    "feature"                   
## [19] "external_gene_name"         "gene_biotype"              
## [21] "start_position"             "end_position"              
## [23] "insideFeature"              "distancetoFeature"         
## [25] "shortestDistance"           "fromOverlappingOrNearest"
dim(diff_df)
## [1] 3192   26

This table has a lot of information, but for this plot we only need the following:

  • Gene names
  • Fold change value
  • Statistical significance (e.g. p value, adjusted p value, etc.)

In this case we are using the False Discovery Rate as our signficance value.

# keep only the fields needed for the plot
# FDR = false discovery rate = adjusted p value = significance 
diff_df <- diff_df[c("external_gene_name", "Fold", "FDR")]

# preview the dataset; data required for the plot
head(diff_df)
##   external_gene_name  Fold      FDR
## 1      RP11-431K24.1 -4.13 1.04e-05
## 2              UBE4B  2.42 8.71e-06
## 3             UBIAD1  4.27 5.50e-06
## 4             UBIAD1  1.89 2.16e-04
## 5             UBIAD1  2.74 8.59e-05
## 6             UBIAD1  3.42 1.39e-08

In order to color the points on the plot easier, we will add an extra column to the data to mark whether each entry passes our cutoffs for Fold change and Significance. Our criteria for this experiment were:

  • Fold change > 1.5x change
  • FDR < 0.05 (similar to a p value of <0.05)

We would also like to label the top 10 genes; the ones with the highest positive or negative fold changes and FDR values.

# add a grouping column; default value is "not significant"
diff_df["group"] <- "NotSignificant"

# for our plot, we want to highlight 
# FDR < 0.05 (significance level)
# Fold Change > 1.5

# change the grouping for the entries with significance but not a large enough Fold change
diff_df[which(diff_df['FDR'] < 0.05 & abs(diff_df['Fold']) < 1.5 ),"group"] <- "Significant"

# change the grouping for the entries a large enough Fold change but not a low enough p value
diff_df[which(diff_df['FDR'] > 0.05 & abs(diff_df['Fold']) > 1.5 ),"group"] <- "FoldChange"

# change the grouping for the entries with both significance and large enough fold change
diff_df[which(diff_df['FDR'] < 0.05 & abs(diff_df['Fold']) > 1.5 ),"group"] <- "Significant&FoldChange"


# Find and label the top peaks..
top_peaks <- diff_df[with(diff_df, order(Fold, FDR)),][1:5,]
top_peaks <- rbind(top_peaks, diff_df[with(diff_df, order(-Fold, FDR)),][1:5,])


# Add gene labels to the plot
# Single Gene Annotation example
# m <- diff_df[with(diff_df, order(Fold, FDR)),][1,]
# a <- list(
#   x = m[["Fold"]],
#   y = -log10(m[["FDR"]]),
#   text = m[["external_gene_name"]],
#   xref = "x",
#   yref = "y",
#   showarrow = TRUE,
#   arrowhead = 7,
#   ax = 20,
#   ay = -40
# )

# Add gene labels for all of the top genes we found
# here we are creating an empty list, and filling it with entries for each row in the dataframe
# each list entry is another list with named items that will be used by Plot.ly
a <- list()
for (i in seq_len(nrow(top_peaks))) {
  m <- top_peaks[i, ]
  a[[i]] <- list(
    x = m[["Fold"]],
    y = -log10(m[["FDR"]]),
    text = m[["external_gene_name"]],
    xref = "x",
    yref = "y",
    showarrow = TRUE,
    arrowhead = 0.5,
    ax = 20,
    ay = -40
  )
}


# make the Plot.ly plot
p <- plot_ly(data = diff_df, x = Fold, y = -log10(FDR), text = external_gene_name, mode = "markers", color = group) %>% 
  layout(title ="Volcano Plot") %>%
  layout(annotations = a)
p

Here is a snapshot of what the plot will look like (the real Plot.ly output does not render here; see this page instead)

Image preview

# to save plot to a HTML file:
htmlwidgets::saveWidget(as.widget(p), "graph.html")

# System Information
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.6 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] plotly_3.6.0  ggplot2_2.1.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.7        knitr_1.14         magrittr_1.5      
##  [4] munsell_0.4.3      colorspace_1.2-6   R6_2.1.3          
##  [7] stringr_1.1.0      httr_1.2.1         plyr_1.8.4        
## [10] tools_3.3.1        grid_3.3.1         gtable_0.2.0      
## [13] htmltools_0.3.5    assertthat_0.1     yaml_2.1.13       
## [16] digest_0.6.10      tibble_1.2         gridExtra_2.2.1   
## [19] RColorBrewer_1.1-2 formatR_1.4        tidyr_0.6.0       
## [22] viridis_0.3.4      base64enc_0.1-3    htmlwidgets_0.7   
## [25] evaluate_0.9       rmarkdown_1.0      stringi_1.1.1     
## [28] scales_0.4.0       jsonlite_1.1

References:

https://plot.ly/r/offline/
https://plot.ly/r/
https://plot.ly/r/text-and-annotations/
https://plot.ly/r/reference/#Layout_and_layout_style_objects
http://www.gettinggeneticsdone.com/2014/05/r-volcano-plots-to-visualize-rnaseq-microarray.html

@shehneela
Copy link

hello,
i tried this code but it always gives this error

Error in plot_ly(data = diff_df, x = Fold, y = -log10(FDR), text = external_gene_name, :
object 'Fold' not found

@CharuKMidha
Copy link

@shehneela try diff_df$Fold, diff_df$FDR and diff_df$external_gene_name to omit the error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment