crazyhottommy/convert_msigdb_human_gmt_to_mouse.md

## convert_msigdb_human_gmt_to_mouse.md

      
    Raw
  

              convert_msigdb_human_gmt_to_mouse.md
            
          
    A function to convert

The script was copied from http://genomespot.blogspot.com/2016/12/msigdb-gene-sets-for-mouse.html by Mark Ziemann.
save the below script to convert_gmt_perline.sh:
This function translate the human gene symbol to mouse gene symbol for each line.
#!/bin/bash

# author: Mark Ziemann
# modified by: Ming Tang

set -eu
set -o pipefail


line=$1
  NAME_DESC=`echo $line | cut -d ' ' -f-2`

  GENES=`echo $line | cut -d ' ' -f3- \
  | tr ' ' '\n' | sort -uk 1b,1 \
  | join -1 1 -2 1 - \
  <(cut -f3,5 mouse2hum_biomart_ens87.txt \
  | sed 1d | awk '$1!="" && $2!=""' \
  | sort -uk 1b,1) | cut -d ' ' -f2 \
  | sort -u | tr '\n' '\t' \
  | sed 's/\t$/\n/'`

  echo $NAME_DESC $GENES | tr ' ' '\t'

A mouse to human gene mapping file

Downloaded from the same site. https://gist.github.com/crazyhottommy/dde3c10cc367df715ef5c7c5f353b5f5
less -S mouse2hum_biomart_ens87.txt  | head |  csvtk csv2md -t


Gene ID
Transcript ID
Human associated gene name
Human gene stable ID
Associated Gene Name


ENSMUSG00000064336
ENSMUST00000082387


mt-Tf


ENSMUSG00000064337
ENSMUST00000082388


mt-Rnr1


ENSMUSG00000064338
ENSMUST00000082389


mt-Tv


ENSMUSG00000064339
ENSMUST00000082390


mt-Rnr2


ENSMUSG00000064340
ENSMUST00000082391


mt-Tl1


ENSMUSG00000064341
ENSMUST00000082392
MT-ND1
ENSG00000198888
mt-Nd1


ENSMUSG00000064342
ENSMUST00000082393


mt-Ti


ENSMUSG00000064343
ENSMUST00000082394


mt-Tq


ENSMUSG00000064344
ENSMUST00000082395


mt-Tm


Download the human  msigDB sets and convert

download from http://software.broadinstitute.org/gsea/msigdb/collections.jsp#H
less -S h.all.v6.1.symbols.gmt
cat h.all.v6.1.symbols.gmt | parallel -k -j 5 './convert_gmt_perline.sh {}' > h.all.v6.1.symbols_mouse.gmt
use GSEA pre-rank

As I mentioned in my previous blog post, one can run GSEA in two modes:

input the raw expression level of all samples.
pre-rank the genes and use this ranked gene list for GSEA.

Even if you go with option 1, GSEA internally rank the gene list by some metrics. The default is signal to noise ratio: (uA - uB)/sigmaA + sigmaB.
As GSEA was designed for array data, for RNAseq data, what value (RPKM, rlog counts, or TPM?) to input to GSEA is still not well understood. see Q&A from GSEA. They recommend using the pre-rank method for RNAseq data.
To generate a pre-ranked gene list(for RNAseq, go from raw counts to DESeq2):
library(DESeq2)

dds <- DESeqDataSetFromMatrix(countData = countmat,
                              colData = coldata,
                              design = ~ condition)
dds
dds <- DESeq(dds)


## I have 20 samples for each condition: knockout gene X, and WT gene X
## KO vs WT
dds$condition <- factor(dds$condition, levels = c("KO","WT"))

res <- results(dds, contrast=c("condition", "KO", "WT"))
resultsNames(dds)

res$ENSEMBL<- rownames(res)

# I used ENSEMBL gene id for the matrix, and annotate with gene symbols.
res_anno<- as.data.frame(res) %>% left_join(gene_symbols, c("ENSEMBL" = "gene_id"))

res_pre_rank<- sign(res_anno$log2FoldChange) * -log10(res_anno$pvalue)

# you will need the gene symbol, and suffix the file with rnk for GSEA to recognize
write_tsv(data.frame(Name = res_anno$gene_name, metric = res_pre_rank), "KO_vs_WT_pre_rank.rnk")
Now, you can load the rnk file into GSEA and do the analysis.
Intepretation of the GSEA figure

We rank the genes by sign of the fold change times the p-value (we told DESeq2 to compare KO vs WT, if a fold change is positive, that means the gene is up-regulated in KO), so the genes on the top (or left) is the genes with higher expression value in the KO group, while the genes on the bottom (or right) are the genes with lower expression value in KO group.
How to read the figure ?
The X-axis is all your genes in the expriment (~ 20,000 in this case) pre-ranked by your metric. each black bar is the gene in this gene set(pathway). You have an idea where are the genes located in the pre-ranked list.
Enrichement Score is calcuated by some metric that ES is positive if the gene set is located in the top of the pre-ranked gene list. ES is negative if the gene set is located in the bottom of the pre-ranked gene list.

We see glycolisis is positively co-related with KO.

IL6_JAK_SATA3 signaling is positively co-related with WT.
Gene ID	Transcript ID	Human associated gene name	Human gene stable ID	Associated Gene Name
ENSMUSG00000064336	ENSMUST00000082387			mt-Tf
ENSMUSG00000064337	ENSMUST00000082388			mt-Rnr1
ENSMUSG00000064338	ENSMUST00000082389			mt-Tv
ENSMUSG00000064339	ENSMUST00000082390			mt-Rnr2
ENSMUSG00000064340	ENSMUST00000082391			mt-Tl1
ENSMUSG00000064341	ENSMUST00000082392	MT-ND1	ENSG00000198888	mt-Nd1
ENSMUSG00000064342	ENSMUST00000082393			mt-Ti
ENSMUSG00000064343	ENSMUST00000082394			mt-Tq
ENSMUSG00000064344	ENSMUST00000082395			mt-Tm