Skip to content

Instantly share code, notes, and snippets.

@markziemann
Created September 3, 2019 00:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save markziemann/3fc0c90e59c508c66067681a6c6dc3a1 to your computer and use it in GitHub Desktop.
Save markziemann/3fc0c90e59c508c66067681a6c6dc3a1 to your computer and use it in GitHub Desktop.
Create a library of gene sets based on protein domains
#!/bin/bash
# This script creates a GMT file of genesets classified by protein domains
# First need to obtain some data from ensembl biomart
# Go to https://www.ensembl.org/biomart/martview/
# Select human database
# Select the following attributes:
# - Gene stable ID
# - Interpro ID
# - Interpro Short Description
# - Interpro Description
# - HGNC symbol
DAT=mart_export.txt
for IPR in $(cut -f2 $DAT | sed 1d | sort -u | head -5) ; do
NAME=$(grep -wm1 $IPR $DAT | cut -f4)
grep -w $IPR $DAT | cut -f5 | sort -u | paste -s | sed "s#^#${NAME}\t${IPR}\t#"
done > ipr.gmt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment