Skip to content

Instantly share code, notes, and snippets.

@malcook
Last active August 27, 2019 22:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save malcook/61f91b2178b0cac909f594af5628fead to your computer and use it in GitHub Desktop.
Save malcook/61f91b2178b0cac909f594af5628fead to your computer and use it in GitHub Desktop.
In service of analyzing intron retention of introns not otherwise annotated as retained, augment GTF formatted transcriptome with imputed transcripts, one for each such intronic region.
imputeIR<-function(inPath
,outPath
,source='imputeIR'
,txNameFmt='ITR%07d'
,IGVHideGene=FALSE
,index=FALSE
,...
) {
## PURPOSE: Write GTF file to <outPath> as a version of GTF file
## <inPath> containing additional transcripts, one for every region
## not otherwise appearing as transcribed in any other annotated
## transcript (also known (by me) as "locally constitutively
## intronic" in that they are intronic in every transcript which
## spans the region). Annotate their 'source' attribute as
## <source>, and name them sequentially using sprintf format string
## <txNameFmt>. <IGVHideGene>, if TRUE, provides to write a header
## line to outPath that Invisibly return the dataframe
## exported. <...> is passed to export
##
## Tested with Ensemble formatted GTF for fly (dm6 Ensemble 84).
##
## AUTHOR: malcolm.cook@{stowers.org,gmail.com}
##
## REQUIRES:
## library(GenomicRanges)
## library(rtracklayer)
if (index && IGVHideGene) {
message(
'WARNING: in imputeIR: IGVHideGene=TRUE is clobbered when index=TRUE (apparently due to tabix processing).')
}
x<-import.gff(inPath,format='gtf')
xl<-split(x,x$gene_id)
xg<-x[x$type=='gene']
xe<-x[x$type=='exon']
xt<-x[x$type=='transcript']
xgl<-split(xg,xg$gene_id)
xel<-split(xe,xe$gene_id)
xtl<-split(xt,xt$gene_id)
xtl1<-
## find the first transcript for each gene.
xtl[as.list(rep(1,length(xtl)))]
xelr<-
## NB: A namespace issue I don't fully understand requires
## prefixing `reduce` with `GenomicRanges`.
GenomicRanges::reduce(xel)
grlGaps<-function(grl) {
## for discussion, c.f: [gaps does not work on a
## GRangesList](https://github.com/Bioconductor/GenomicRanges/issues/17)
psetdiff(unlist(range(grl),use.names=FALSE),grl)
}
xil<-
## The intron fragments which are nowhere exonic, grouped by gene.
grlGaps(xelr)
## sanity check:
## stopifnot(all(names(xel)==names(xgl)))
xilt<-xil+1 # Impute transcripts from the gaps by extending them to
# 'overlap' by 1-bp with up/downstream exonic fragents.
## Contrive to give the imputed transcripts metadata (mcols)
## initially borrowed from the corresponding transcript granges
## list, and further editted to provide transcript-level
## identifiers:
xiltu<-unlist(xilt,use.names=FALSE)
mcols(xiltu)<-mcols(unlist(rep(xtl1[names(xilt)],lengths(xilt))))
mcols(xiltu)<-mcols(xiltu)[,grep('^gene',names(mcols(xiltu))),drop=F]
mcols(xiltu)<-within(mcols(xiltu),{
type='transcript'
source=source
transcript_id=sprintf(txNameFmt,seq_along(xiltu))
transcript_name=sprintf(txNameFmt,seq_along(xiltu)) ## TODO: name it based on gene_name?
transcript_source=source
})
xilt<-relist(xiltu,xilt)
## Create a single imputed (retained) exon for each (imputed)
## transcript.
xile<-copy(xilt) # TODO: copy overkill?
xileu<-unlist(xile,use.names=FALSE)
mcols(xileu)<-within(mcols(xileu),{
type='exon'
exon_number=1 # each imputed transcript has only a single exon.
exon_id=sprintf('%s-E1',transcript_id)
})
xile<-relist(xileu,xile)
res<-sort(unlist(pc(xl,xilt,xile),use.names=FALSE)) # sort is gratuitous, especially if indexed by tabix?
if(IGVHideGene) {
con<-file(outPath,open='w')
export.gff(GRanges(),con,format='gtf',append=FALSE,index=FALSE) ## write just a gtf header comments.
writeLines('#hide=gene',con)
close(con)
}
export.gff(res,outPath,format='gtf',append=IGVHideGene,index=index)#,...)
invisible(res)
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment