Created
March 25, 2011 15:51
-
-
Save brentp/887065 to your computer and use it in GitHub Desktop.
example of pybedtools use i'd like to make easier.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
%prog [options] region_bed hu_gtf cpg_bed = args | |
""" | |
import optparse | |
import sys | |
import os | |
os.environ["PATH"] += ":/home/brentp/src/bedtools/bin/" | |
from pybedtools import BedTool | |
import collections | |
def intersect_all(region_bed, hu_gtf, cpg_bed): | |
r = BedTool(region_bed) | |
h = BedTool(hu_gtf) | |
c = BedTool(cpg_bed) | |
rcount = r.field_count() | |
# get the all the columns from the first | |
# plus the gene_name attribute | |
# and the last (distance) column. | |
rh = r.closest(h, d=True).cut(range(rcount) + ['gene_name', -1]) | |
# it can overlap multiple exons, just find the nearest. | |
names = collections.defaultdict(list) | |
for feature in rh: | |
gtf_name = feature.other[1] | |
dist = int(feature.other[2]) | |
names[gtf_name].append((dist, feature)) | |
collapsed_fh = open('collapse.bed', 'w') | |
for name, hit_list in names.iteritems(): | |
# get closest one. | |
dist, feat = sorted(hit_list)[0] | |
print >>collapsed_fh, feat | |
collapsed_fh.close() | |
rh = BedTool(collapsed_fh.name) | |
rcount = rh.field_count() | |
# and name and distance. | |
rhc = rh.closest(c, d=True).cut(range(rcount) + [rcount + 3, -1]) | |
rhc.saveas('done.bed') | |
def main(): | |
p = optparse.OptionParser(__doc__) | |
opts, args = p.parse_args() | |
if (len(args) == 0): | |
sys.exit(not p.print_help()) | |
region_bed, hu_gtf, cpg_bed = args | |
intersect_all(region_bed, hu_gtf, cpg_bed) | |
if __name__ == "__main__": | |
main() |
yeah, that looks good.
hmm, it could also be incorporated into closest (and others):
rh = r.closest(h, d=True, keep_new=('name', 'dist'))
then you get the original bed + only 2 new columns. then have:
class Bed2(BedFeature):
__slots__ = ('chr', 'name', 'start', 'stop', 'name2', 'dist')
and that class could be automatically generated inside closest. but that might be getting too complicated.
whoa . . . pretty cool, and you're right it might get complicated, but there would be nothing forcing users to use it. One nice thing is that such a created-on-the-fly class is auto-documented in a sense, because you can always just check __slots__
to see what it has for attributes. I added an issue for this to return to later ( https://github.com/daler/pybedtools/issues#issue/5 )
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
see if I have this right . . . after refactoring, that first loop would be something like this (since intersect would know to add the extra dist field):
then you'd probably have to manually set:
then do something like:
(not sure if I have the specifics right, but hopefully the general idea?)