Skip to content

Instantly share code, notes, and snippets.

@daler
Last active Sep 27, 2016
Embed
What would you like to do?
binary heatmap demo

Setup

Using bioconda, you can install everything you need for this example with

conda install --channel bioconda pybedtools bedtools htslib matplotlib 

Run

Run get-data.sh to download data from ENCODE.

bash get-data.sh

Input files have the following format (UCSC broadPeak and narrowPeak formats, which ar variants of BED format):

chr1    569797  570055  .       1000    .       38.118451       16.0    -1
chr1    724125  2647713 .       258     .       1.259053        11.2    -1
chr1    752542  752779  .       658     .       10.178273       1.9     -1

Then run binary_heatmaps.py to generate the plot, a summary file, and a directory of interval files for each class.

python binary_heatmaps.py

Output

binary_heatmap.png

Rows are genomic intervals (as output by bedtools multiinter); columns are input BED files; black indicates that factor was found in that genomic interval.

heatmap

class_counts.txt

Summary of how many genomic intervals for each combinatorial class:

             LSD1: 16181
        LSD1,TAL1: 15120
             TAL1: 7989
  GATA1,LSD1,TAL1: 3009
       GATA1,LSD1: 654
            GATA1: 231
       GATA1,TAL1: 214

intervals/*.bed

For each of the above classes, a BED file of the indicated intervals. For example,

track name="LSD1_and_TAL1"
chr1    778211  778487
chr1    854053  854329
chr1    948500  948776
...
import os
from matplotlib import pyplot as plt
from pybedtools.contrib import plotting
import pybedtools
import numpy as np
# set up the order in which to plot the columns of the binary heatmap
names, bts = zip(*[
('GATA1', 'wgEncodeAwgTfbsSydhK562Gata1UcdUniPk.narrowPeak.gz'),
('LSD1', 'wgEncodeBroadHistoneK562Lsd1Pk.broadPeak.gz'),
('TAL1', 'wgEncodeAwgTfbsSydhK562Tal1sc12984IggmusUniPk.narrowPeak.gz'),
])
bts = [pybedtools.BedTool(i).sort() for i in bts]
# set up the object by giving it a list of pybedtool.BedTool objects and a list
# of names to use.
b = plotting.BinaryHeatmap(
bts=bts,
names=names)
# plot it
b.plot()
# write out how many genomic location of each class were identified
with open('class_counts.txt', 'w') as fout:
for cls, cnt in sorted(b.class_counts.items(), key=lambda x: x[1], reverse=True):
fout.write('{0:>25}: {1:<15}\n'.format(cls, cnt))
# write out the actual intervals from each class
out_dir = 'intervals'
if not os.path.exists(out_dir):
os.makedirs(out_dir)
for k, v in b.classified_intervals.items():
label = k.replace(',', '_and_')
v.cut([0, 1, 2]).saveas(os.path.join(out_dir, label + '.bed'), trackline='track name="%s"' % label)
# save the figure
fig = plt.gcf()
fig.tight_layout()
fig.savefig('binary_heatmap.png')
plt.show()
#!/bin/bash
wget --no-clobber http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsSydhK562Gata1UcdUniPk.narrowPeak.gz
wget --no-clobber http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/wgEncodeBroadHistoneK562Lsd1Pk.broadPeak.gz
wget --no-clobber http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsSydhK562Tal1sc12984IggmusUniPk.narrowPeak.gz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment