Skip to content

Instantly share code, notes, and snippets.

@crazyhottommy
Last active April 29, 2019 07:44
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save crazyhottommy/40bd107710f311f4098c1f64d9f494ac to your computer and use it in GitHub Desktop.
Save crazyhottommy/40bd107710f311f4098c1f64d9f494ac to your computer and use it in GitHub Desktop.

A function to convert

The script was copied from http://genomespot.blogspot.com/2016/12/msigdb-gene-sets-for-mouse.html by Mark Ziemann.

save the below script to convert_gmt_perline.sh:

This function translate the human gene symbol to mouse gene symbol for each line.

#!/bin/bash

# author: Mark Ziemann
# modified by: Ming Tang

set -eu
set -o pipefail


line=$1
  NAME_DESC=`echo $line | cut -d ' ' -f-2`

  GENES=`echo $line | cut -d ' ' -f3- \
  | tr ' ' '\n' | sort -uk 1b,1 \
  | join -1 1 -2 1 - \
  <(cut -f3,5 mouse2hum_biomart_ens87.txt \
  | sed 1d | awk '$1!="" && $2!=""' \
  | sort -uk 1b,1) | cut -d ' ' -f2 \
  | sort -u | tr '\n' '\t' \
  | sed 's/\t$/\n/'`

  echo $NAME_DESC $GENES | tr ' ' '\t'

A mouse to human gene mapping file

Downloaded from the same site. https://gist.github.com/crazyhottommy/dde3c10cc367df715ef5c7c5f353b5f5

less -S mouse2hum_biomart_ens87.txt | head | csvtk csv2md -t

Gene ID Transcript ID Human associated gene name Human gene stable ID Associated Gene Name
ENSMUSG00000064336 ENSMUST00000082387 mt-Tf
ENSMUSG00000064337 ENSMUST00000082388 mt-Rnr1
ENSMUSG00000064338 ENSMUST00000082389 mt-Tv
ENSMUSG00000064339 ENSMUST00000082390 mt-Rnr2
ENSMUSG00000064340 ENSMUST00000082391 mt-Tl1
ENSMUSG00000064341 ENSMUST00000082392 MT-ND1 ENSG00000198888 mt-Nd1
ENSMUSG00000064342 ENSMUST00000082393 mt-Ti
ENSMUSG00000064343 ENSMUST00000082394 mt-Tq
ENSMUSG00000064344 ENSMUST00000082395 mt-Tm

Download the human msigDB sets and convert

download from http://software.broadinstitute.org/gsea/msigdb/collections.jsp#H

less -S h.all.v6.1.symbols.gmt
cat h.all.v6.1.symbols.gmt | parallel -k -j 5 './convert_gmt_perline.sh {}' > h.all.v6.1.symbols_mouse.gmt

use GSEA pre-rank

As I mentioned in my previous blog post, one can run GSEA in two modes:

  1. input the raw expression level of all samples.
  2. pre-rank the genes and use this ranked gene list for GSEA.

Even if you go with option 1, GSEA internally rank the gene list by some metrics. The default is signal to noise ratio: (uA - uB)/sigmaA + sigmaB.

As GSEA was designed for array data, for RNAseq data, what value (RPKM, rlog counts, or TPM?) to input to GSEA is still not well understood. see Q&A from GSEA. They recommend using the pre-rank method for RNAseq data.

To generate a pre-ranked gene list(for RNAseq, go from raw counts to DESeq2):

library(DESeq2)

dds <- DESeqDataSetFromMatrix(countData = countmat,
                              colData = coldata,
                              design = ~ condition)
dds
dds <- DESeq(dds)


## I have 20 samples for each condition: knockout gene X, and WT gene X
## KO vs WT
dds$condition <- factor(dds$condition, levels = c("KO","WT"))

res <- results(dds, contrast=c("condition", "KO", "WT"))
resultsNames(dds)

res$ENSEMBL<- rownames(res)

# I used ENSEMBL gene id for the matrix, and annotate with gene symbols.
res_anno<- as.data.frame(res) %>% left_join(gene_symbols, c("ENSEMBL" = "gene_id"))

res_pre_rank<- sign(res_anno$log2FoldChange) * -log10(res_anno$pvalue)

# you will need the gene symbol, and suffix the file with rnk for GSEA to recognize
write_tsv(data.frame(Name = res_anno$gene_name, metric = res_pre_rank), "KO_vs_WT_pre_rank.rnk")

Now, you can load the rnk file into GSEA and do the analysis.

Intepretation of the GSEA figure

We rank the genes by sign of the fold change times the p-value (we told DESeq2 to compare KO vs WT, if a fold change is positive, that means the gene is up-regulated in KO), so the genes on the top (or left) is the genes with higher expression value in the KO group, while the genes on the bottom (or right) are the genes with lower expression value in KO group.

How to read the figure ?

The X-axis is all your genes in the expriment (~ 20,000 in this case) pre-ranked by your metric. each black bar is the gene in this gene set(pathway). You have an idea where are the genes located in the pre-ranked list.

Enrichement Score is calcuated by some metric that ES is positive if the gene set is located in the top of the pre-ranked gene list. ES is negative if the gene set is located in the bottom of the pre-ranked gene list.

We see glycolisis is positively co-related with KO.

IL6_JAK_SATA3 signaling is positively co-related with WT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment