Skip to content

Instantly share code, notes, and snippets.

@brentp
Created March 25, 2011 15:51
Show Gist options
  • Save brentp/887065 to your computer and use it in GitHub Desktop.
Save brentp/887065 to your computer and use it in GitHub Desktop.
example of pybedtools use i'd like to make easier.
"""
%prog [options] region_bed hu_gtf cpg_bed = args
"""
import optparse
import sys
import os
os.environ["PATH"] += ":/home/brentp/src/bedtools/bin/"
from pybedtools import BedTool
import collections
def intersect_all(region_bed, hu_gtf, cpg_bed):
r = BedTool(region_bed)
h = BedTool(hu_gtf)
c = BedTool(cpg_bed)
rcount = r.field_count()
# get the all the columns from the first
# plus the gene_name attribute
# and the last (distance) column.
rh = r.closest(h, d=True).cut(range(rcount) + ['gene_name', -1])
# it can overlap multiple exons, just find the nearest.
names = collections.defaultdict(list)
for feature in rh:
gtf_name = feature.other[1]
dist = int(feature.other[2])
names[gtf_name].append((dist, feature))
collapsed_fh = open('collapse.bed', 'w')
for name, hit_list in names.iteritems():
# get closest one.
dist, feat = sorted(hit_list)[0]
print >>collapsed_fh, feat
collapsed_fh.close()
rh = BedTool(collapsed_fh.name)
rcount = rh.field_count()
# and name and distance.
rhc = rh.closest(c, d=True).cut(range(rcount) + [rcount + 3, -1])
rhc.saveas('done.bed')
def main():
p = optparse.OptionParser(__doc__)
opts, args = p.parse_args()
if (len(args) == 0):
sys.exit(not p.print_help())
region_bed, hu_gtf, cpg_bed = args
intersect_all(region_bed, hu_gtf, cpg_bed)
if __name__ == "__main__":
main()
@daler
Copy link

daler commented Mar 25, 2011

see if I have this right . . . after refactoring, that first loop would be something like this (since intersect would know to add the extra dist field):

for bed, gtf in rh.features():
    dist = int(bed.dist)
    names[gtf.name].append( (dist, bed, gtf))

then you'd probably have to manually set:

rhc._feature_objects = [BedFeature, BedFeature]

then do something like:

for bed1,bed2 in rhc.features():
    str(bed2) + '\t'.join( [bed2.name, bed2.dist] )

(not sure if I have the specifics right, but hopefully the general idea?)

@brentp
Copy link
Author

brentp commented Mar 25, 2011

yeah, that looks good.
hmm, it could also be incorporated into closest (and others):

rh = r.closest(h, d=True, keep_new=('name', 'dist'))

then you get the original bed + only 2 new columns. then have:

class Bed2(BedFeature):
    __slots__ = ('chr', 'name', 'start', 'stop', 'name2', 'dist')

and that class could be automatically generated inside closest. but that might be getting too complicated.

@daler
Copy link

daler commented Mar 25, 2011

whoa . . . pretty cool, and you're right it might get complicated, but there would be nothing forcing users to use it. One nice thing is that such a created-on-the-fly class is auto-documented in a sense, because you can always just check __slots__ to see what it has for attributes. I added an issue for this to return to later ( https://github.com/daler/pybedtools/issues#issue/5 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment