Skip to content

Instantly share code, notes, and snippets.

@markziemann
Last active November 22, 2019 01:39
Show Gist options
  • Save markziemann/9876adbb887778955bb20e52dd42ae5e to your computer and use it in GitHub Desktop.
Save markziemann/9876adbb887778955bb20e52dd42ae5e to your computer and use it in GitHub Desktop.
Create a gmt file for the Remap2020 dataset http://remap.univ-amu.fr/target_page/AATF:9606
#!/bin/bash
# dependancies: parallel, bedtools, pigz
# promoters
wget -N ftp://ftp.ensembl.org/pub/release-98/gtf/homo_sapiens/Homo_sapiens.GRCh38.98.gtf.gz
GTFZ=Homo_sapiens.GRCh38.98.gtf.gz
TSS=tss.bed
zcat $GTFZ | grep 'exon_number "1"' | cut -f1,4,5,7,9 \
| cut -d '"' -f1,12 | sed 's/gene_id "//' | awk '$4=="+"' \
| awk '{OFS="\t"}{print $1,$2-1000,$2+1000,$5}' > $TSS
zcat $GTFZ | grep -w 'exon_number "1"' | cut -f1,4,5,7,9 \
| cut -d '"' -f1,12 | sed 's/gene_id "//' | awk '$4=="-"' \
| awk '{OFS="\t"}{print $1,$3-1000,$3+1000,$5}' >> $TSS
grep -v ^MT $TSS | sed 's/^/chr/'| sortBed > $TSS.tmp && mv $TSS.tmp $TSS
# remap
wget -N http://remap.univ-amu.fr/storage/remap2020/hg38/MACS2/remap2020_all_macs2_hg38_v1_0.bed.gz
mkdir peaks && cd peaks
ulimit -n 1000000
zcat ../remap2020_all_macs2_hg38_v1_0.bed.gz | awk '{print > $4".bed"}'
pk2gene(){
PEAK=$1
GENES=$2
bedtools intersect -wo -a $GENES -b $PEAK \
| sort -k9gr | awk '!arr[$4]++ {print $4}' \
| paste -s -d '\t' \
| sed "s/^/${PEAK}\tFromRemap2020\t/" > $PEAK.gmt
}
export -f pk2gene
parallel pk2gene ::: *bed ::: ../tss.bed
cat *gmt | pigz >../remap2020.gmt.gz
@markziemann
Copy link
Author

The gmt file is available from
http://dee2.io/data/tmp/

@markziemann
Copy link
Author

Could have saved a bunch of time if I used GTFtools to extract TSS coordinates!!

gtftools.py -t Homo_sapiens.GRCh38.94.gtf.tss.bed -w 1000  Homo_sapiens.GRCh38.94.gtf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment