Skip to content

Instantly share code, notes, and snippets.

Created August 26, 2011 14:59
Show Gist options
  • Save brentp/1173596 to your computer and use it in GitHub Desktop.
Save brentp/1173596 to your computer and use it in GitHub Desktop.
get the first intron and the first coding exon from UCSC database.
mysql -A -D $ORG -e "SELECT chrom, txStart, txEnd, cdsStart, cdsEnd, name2, name, \
strand, exonStarts, exonEnds from refGene;" \
| awk 'BEGIN {FS=OFS="\t"}
delete cstarts; delete cends;
split($9, cstarts, ",");
split($10, cends, ",");
if(strand == "+"){
# 1 based indexing...
print $1,cends[1],cstarts[2],name,strand
else if(strand == "-"){
n = length(cends) - 1 # account for trailing coma
print $1,cends[n-1],cstarts[n],name,strand
}' > first.introns.bed
mysql -A -D $ORG -e "SELECT chrom, txStart, txEnd, cdsStart, cdsEnd, name2, name, \
strand, exonStarts, exonEnds from refGene;" \
| awk 'BEGIN {FS=OFS="\t"}
if($4==$5){ next; } # noncoding
delete cstarts; delete cends;
split($9, cstarts, ",");
split($10, cends, ",");
if(strand == "+"){
for(i=1; i < length(cstarts); i++){
# if the start of the exon is >= the cdsStart...
if(cends[i] >= $4){
# account for UTR? this just prints entire exon...
# could use cdsStart instead of cstarts[i]
print $1,cstarts[i],cends[i],name,strand
else if(strand == "-"){
for(i=length(cstarts) - 1; i > 0; i--){
if(cstarts[i] <= $5){
# could use cdsEnd instead of cends[i]
print $1,cstarts[i],cends[i],name,strand
}' > first.coding.exon.bed
Copy link

Farhat commented Aug 28, 2011

You can make it a tiny bit more efficient by using else instead of a second if in the first query processing.

Copy link

brentp commented Aug 29, 2011

@Farhat I changed it, but, yeah, I think it will be unnoticeable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment