Skip to content

Instantly share code, notes, and snippets.

@hiraksarkar
Last active December 14, 2018 21:53
Show Gist options
  • Save hiraksarkar/ce8a71a6953cb4e9823d868c283bf99d to your computer and use it in GitHub Desktop.
Save hiraksarkar/ce8a71a6953cb4e9823d868c283bf99d to your computer and use it in GitHub Desktop.
3 line script to extract intron boundaries per transcript
## requirement bed tools
BIN='/home/hirak/bedtools2/bin'
## Gencode
## gencode.v29.chr_patch_hapl_scaff.annotation.gtf
GTF_FILE="gencode.v29.chr_patch_hapl_scaff.annotation.gtf"
# extract transcript boundaries
cat $GTF_FILE | awk 'BEGIN{OFS="\t";} $3=="transcript" {print $1,$4-1,$5,$12}' | tr -d "\"" | tr -d ";" | $BIN/sortBed > gencode_transcript_intervals.bed
# merge exon boundaris
cat $GTF_FILE | awk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5,$12}' | tr -d "\"" | tr -d ";" | $BIN/sortBed | $BIN/mergeBed -i - -c 4 -o collapse > gencode_exon_merged.bed
# extract introns per transcript
$BIN/subtractBed -a gencode_transcript_intervals.bed -b gencode_exon_merged.bed -nonamecheck > intron_transcript.bed
# Group by by awk
#awk -F "\t" '{a[$1"\t"$2"\t"$3]=a[$1"\t"$2"\t"$3]?a[$1"\t"$2"\t"$3] OFS $4:$4 } END {for (i in a) print i FS a[i]}' OFS=, intron_transcript.bed | /mnt/scratch1/hirak/bedtools2/bin/sortBed -i -
# Or by bedtools groupby
sort -k1,1 -k2,2 -k3,3 intron_transcript.bed | $BIN/groupBy -i - -g 1,2,3 -c 4 -o collapse > hg_intron_transcript_collapsed.bed
#
# make fasta by using this bed
$BIN/bedtools getfasta -fi hg_genome.fasta -bed intron_transcript.bed -fo hg_intron_transcript.fasta -name+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment