Created
February 15, 2024 02:18
-
-
Save markziemann/6126570ace44bee6db5aeff5313324a9 to your computer and use it in GitHub Desktop.
Analysis of the coverage and depth of pathway databases included in MSigDB including KEGG, Reactome and GO Biological Process
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Why use KEGG?" | |
author: "Mark Ziemann" | |
date: "`r Sys.Date()`" | |
output: | |
html_document: | |
toc: true | |
toc_float: true | |
fig_width: 7 | |
fig_height: 7 | |
theme: cosmo | |
--- | |
## Introduction | |
Here we will look at the contents of MSigDB, and compare the coverage of KEGG and other | |
databases. | |
```{r,libs} | |
library("mitch") | |
msigdb <- gmt_import("msigdb.v2023.2.Hs.symbols.gmt") | |
``` | |
There are two types of KEGG pathways in the MSigDB dataset, KEGG Legacy and KEGG Medicus. | |
Let's take a look at these side by side with Reactome and GO Biological Process. | |
```{r,kegg1} | |
kegg_all <- msigdb[grep("KEGG_",names(msigdb))] | |
kegg_med <- msigdb[grep("KEGG_MED",names(msigdb))] | |
kegg_leg <- setdiff(kegg_all,kegg_med) | |
message("number of pathways in KEGG Legacy:") | |
length(kegg_leg) | |
message("number of pathways in KEGG Medicus:") | |
length(kegg_med) | |
reactome <- msigdb[grep("REACTOME_",names(msigdb))] | |
message("number of pathways in Reactome:") | |
length(reactome) | |
gobp <- msigdb[grep("GOBP_",names(msigdb))] | |
message("number of pathways in GOBP:") | |
length(gobp) | |
``` | |
So KEGG legacy has 186 pathways and the most up to date KEGG Medicus has 619, but this is | |
dwarfed by Reactome with 1692 pathways and GO BP with 7649. | |
Granted there is some redundancy in Reactome and GO BP, but the difference is stark. | |
Let's dig a bit deeper, and find out the coverage, that is, what proportion of human genes | |
are mentioned in these three databases. | |
```{r,coverage1} | |
message("Total number of unique gene symbols in MSigDB 2023.2:") | |
length(unique(unlist(msigdb))) | |
message("Total number of unique gene symbols in KEGG Legacy:") | |
length(unique(unlist(kegg_leg))) | |
message("Proportion of genes covered with KEGG Legacy:") | |
paste(signif(length(unique(unlist(kegg_leg))) / length(unique(unlist(msigdb))) * 100, 3),"%") | |
message("Total number of unique gene symbols in KEGG Medicus:") | |
length(unique(unlist(kegg_med))) | |
message("Proportion of genes covered with KEGG Medicus:") | |
paste(signif(length(unique(unlist(kegg_med))) / length(unique(unlist(msigdb))) * 100, 3), "%") | |
message("Total number of unique gene symbols in Reactome:") | |
length(unique(unlist(reactome))) | |
message("Proportion of genes covered with Reactome:") | |
paste(signif(length(unique(unlist(reactome))) / length(unique(unlist(msigdb))) * 100, 3), "%") | |
message("Total number of unique gene symbols in GO BP:") | |
length(unique(unlist(gobp))) | |
message("Proportion of genes covered with GO BP:") | |
paste(signif(length(unique(unlist(gobp))) / length(unique(unlist(msigdb))) * 100, 3), "%") | |
``` | |
A very interesting result. | |
Firstly, GO BP and Reactome dominate coverage, with 42.3% and 26.3% respectively which is waaaay better than KEGG legacy with 12.4%. | |
What surprised me was the paucity of KEGG Medicus data, which covers only 6.4% of genes, which is very disappointing. | |
Next, I wanted to look at the total number of annotations that are included in these databases. | |
```{r,numlinks} | |
message("Total number of functional annotations in MsigDB:") | |
length(unlist(msigdb)) | |
message("Total number of functional annotations in KEGG Legacy:") | |
length(unlist(kegg_leg)) | |
message("Total number of functional annotations in KEGG Medicus:") | |
length(unlist(kegg_med)) | |
message("Total number of functional annotations in Reactome:") | |
length(unlist(reactome)) | |
message("Total number of functional annotations in GO BP:") | |
length(unlist(gobp)) | |
``` | |
In conclusion, KEGG coverage is very poor as compared to the alternatives. | |
Reactome is a better source for canonical pathways but Gene Ontology Biological Process appears to be the most | |
comprehensive. | |
## Session information | |
```{r,sessionInfo} | |
sessionInfo() | |
``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment