Skip to content

Instantly share code, notes, and snippets.

@markziemann
Created February 15, 2024 02:18
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save markziemann/6126570ace44bee6db5aeff5313324a9 to your computer and use it in GitHub Desktop.
Save markziemann/6126570ace44bee6db5aeff5313324a9 to your computer and use it in GitHub Desktop.
Analysis of the coverage and depth of pathway databases included in MSigDB including KEGG, Reactome and GO Biological Process
---
title: "Why use KEGG?"
author: "Mark Ziemann"
date: "`r Sys.Date()`"
output:
html_document:
toc: true
toc_float: true
fig_width: 7
fig_height: 7
theme: cosmo
---
## Introduction
Here we will look at the contents of MSigDB, and compare the coverage of KEGG and other
databases.
```{r,libs}
library("mitch")
msigdb <- gmt_import("msigdb.v2023.2.Hs.symbols.gmt")
```
There are two types of KEGG pathways in the MSigDB dataset, KEGG Legacy and KEGG Medicus.
Let's take a look at these side by side with Reactome and GO Biological Process.
```{r,kegg1}
kegg_all <- msigdb[grep("KEGG_",names(msigdb))]
kegg_med <- msigdb[grep("KEGG_MED",names(msigdb))]
kegg_leg <- setdiff(kegg_all,kegg_med)
message("number of pathways in KEGG Legacy:")
length(kegg_leg)
message("number of pathways in KEGG Medicus:")
length(kegg_med)
reactome <- msigdb[grep("REACTOME_",names(msigdb))]
message("number of pathways in Reactome:")
length(reactome)
gobp <- msigdb[grep("GOBP_",names(msigdb))]
message("number of pathways in GOBP:")
length(gobp)
```
So KEGG legacy has 186 pathways and the most up to date KEGG Medicus has 619, but this is
dwarfed by Reactome with 1692 pathways and GO BP with 7649.
Granted there is some redundancy in Reactome and GO BP, but the difference is stark.
Let's dig a bit deeper, and find out the coverage, that is, what proportion of human genes
are mentioned in these three databases.
```{r,coverage1}
message("Total number of unique gene symbols in MSigDB 2023.2:")
length(unique(unlist(msigdb)))
message("Total number of unique gene symbols in KEGG Legacy:")
length(unique(unlist(kegg_leg)))
message("Proportion of genes covered with KEGG Legacy:")
paste(signif(length(unique(unlist(kegg_leg))) / length(unique(unlist(msigdb))) * 100, 3),"%")
message("Total number of unique gene symbols in KEGG Medicus:")
length(unique(unlist(kegg_med)))
message("Proportion of genes covered with KEGG Medicus:")
paste(signif(length(unique(unlist(kegg_med))) / length(unique(unlist(msigdb))) * 100, 3), "%")
message("Total number of unique gene symbols in Reactome:")
length(unique(unlist(reactome)))
message("Proportion of genes covered with Reactome:")
paste(signif(length(unique(unlist(reactome))) / length(unique(unlist(msigdb))) * 100, 3), "%")
message("Total number of unique gene symbols in GO BP:")
length(unique(unlist(gobp)))
message("Proportion of genes covered with GO BP:")
paste(signif(length(unique(unlist(gobp))) / length(unique(unlist(msigdb))) * 100, 3), "%")
```
A very interesting result.
Firstly, GO BP and Reactome dominate coverage, with 42.3% and 26.3% respectively which is waaaay better than KEGG legacy with 12.4%.
What surprised me was the paucity of KEGG Medicus data, which covers only 6.4% of genes, which is very disappointing.
Next, I wanted to look at the total number of annotations that are included in these databases.
```{r,numlinks}
message("Total number of functional annotations in MsigDB:")
length(unlist(msigdb))
message("Total number of functional annotations in KEGG Legacy:")
length(unlist(kegg_leg))
message("Total number of functional annotations in KEGG Medicus:")
length(unlist(kegg_med))
message("Total number of functional annotations in Reactome:")
length(unlist(reactome))
message("Total number of functional annotations in GO BP:")
length(unlist(gobp))
```
In conclusion, KEGG coverage is very poor as compared to the alternatives.
Reactome is a better source for canonical pathways but Gene Ontology Biological Process appears to be the most
comprehensive.
## Session information
```{r,sessionInfo}
sessionInfo()
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment