Skip to content

Instantly share code, notes, and snippets.

@dinovski
Last active October 17, 2018 12:16
Show Gist options
  • Save dinovski/cc6a539db6872de01f68d835bc4245a2 to your computer and use it in GitHub Desktop.
Save dinovski/cc6a539db6872de01f68d835bc4245a2 to your computer and use it in GitHub Desktop.
TSS +/- N bp from GTF > BED6
#!/bin/bash
BEDTOOLS=/usr/bin/bedtools/bin
ODIR=/data/dbase/
GENOME=hg19
SLOP=3000
SLOP_OPTS='-l 3000 -r 3000'
GENE_BED=${IDIR}/${GENOME}_3kb.bed
## NOTE: does not account for multiple transcripts; use uniqueGenes.R to collapse to min start/max end
####
if [ ${GENOME} == "hg19" ];then
GTF=/data/dbase/human/hg19/gencode.v19.annotation.gtf
CHR_LEN=/data/dbase/human/hg19/hg19_chr.len
elif [ ${GENOME} == "mm10" ];then
GTF=/data/dbase/mouse/mm10/Mus_musculus.GRCm38.90.gtf
CHR_LEN=/data/dbase/mouse/mm10/mm10.chrom.sizes
else
die "genome not supported"
fi
## Extract TSS coordinates for fwd strand genes
grep -w transcript ${GTF} | awk '$7=="+"' |\
cut -d '"' -f-2,10 | cut -f1,4,5,9 | sed 's/gene_id "//' |\
tr '"' '_' | awk '{OFS="\t"} {print $1,$2,$2,$4,$5,$6}' > ${DIR}/tmp+
## Extract TSS coordinates for reverse strand genes
grep -w transcript ${GTF} | awk '$7!="+"' |\
cut -d '"' -f-2,10 | cut -f1,4,5,9 | sed 's/gene_id "//' |\
tr '"' '_' | awk '{OFS="\t"} {print $1,$3,$3,$4,$5,$6}' > ${DIR}/tmp-
## $4=ensembl id, $5=name
cat ${DIR}/tmp+ ${DIR}/tmp- | ${BEDTOOLS} slop -i - -g ${CHR_LEN} ${SLOP_OPTS} | sort -k1,1 -k2n,2 |\
awk '{OFS="\t"} {print $1,$2,$3,$4,$5,$6}' | sed 's/_/\t/' | cut -f 4 --complement > ${DIR}/${BED}
rm ${DIR}/tmp*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment