Skip to content

Instantly share code, notes, and snippets.

@daler
Last active September 27, 2016 17:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save daler/07eb1a95f1e4639f22bd to your computer and use it in GitHub Desktop.
Save daler/07eb1a95f1e4639f22bd to your computer and use it in GitHub Desktop.
binary heatmap demo

Setup

Using bioconda, you can install everything you need for this example with

conda install --channel bioconda pybedtools bedtools htslib matplotlib 

Run

Run get-data.sh to download data from ENCODE.

bash get-data.sh

Input files have the following format (UCSC broadPeak and narrowPeak formats, which ar variants of BED format):

chr1    569797  570055  .       1000    .       38.118451       16.0    -1
chr1    724125  2647713 .       258     .       1.259053        11.2    -1
chr1    752542  752779  .       658     .       10.178273       1.9     -1

Then run binary_heatmaps.py to generate the plot, a summary file, and a directory of interval files for each class.

python binary_heatmaps.py

Output

binary_heatmap.png

Rows are genomic intervals (as output by bedtools multiinter); columns are input BED files; black indicates that factor was found in that genomic interval.

heatmap

class_counts.txt

Summary of how many genomic intervals for each combinatorial class:

             LSD1: 16181
        LSD1,TAL1: 15120
             TAL1: 7989
  GATA1,LSD1,TAL1: 3009
       GATA1,LSD1: 654
            GATA1: 231
       GATA1,TAL1: 214

intervals/*.bed

For each of the above classes, a BED file of the indicated intervals. For example,

track name="LSD1_and_TAL1"
chr1    778211  778487
chr1    854053  854329
chr1    948500  948776
...
import os
from matplotlib import pyplot as plt
from pybedtools.contrib import plotting
import pybedtools
import numpy as np
# set up the order in which to plot the columns of the binary heatmap
names, bts = zip(*[
('GATA1', 'wgEncodeAwgTfbsSydhK562Gata1UcdUniPk.narrowPeak.gz'),
('LSD1', 'wgEncodeBroadHistoneK562Lsd1Pk.broadPeak.gz'),
('TAL1', 'wgEncodeAwgTfbsSydhK562Tal1sc12984IggmusUniPk.narrowPeak.gz'),
])
bts = [pybedtools.BedTool(i).sort() for i in bts]
# set up the object by giving it a list of pybedtool.BedTool objects and a list
# of names to use.
b = plotting.BinaryHeatmap(
bts=bts,
names=names)
# plot it
b.plot()
# write out how many genomic location of each class were identified
with open('class_counts.txt', 'w') as fout:
for cls, cnt in sorted(b.class_counts.items(), key=lambda x: x[1], reverse=True):
fout.write('{0:>25}: {1:<15}\n'.format(cls, cnt))
# write out the actual intervals from each class
out_dir = 'intervals'
if not os.path.exists(out_dir):
os.makedirs(out_dir)
for k, v in b.classified_intervals.items():
label = k.replace(',', '_and_')
v.cut([0, 1, 2]).saveas(os.path.join(out_dir, label + '.bed'), trackline='track name="%s"' % label)
# save the figure
fig = plt.gcf()
fig.tight_layout()
fig.savefig('binary_heatmap.png')
plt.show()
#!/bin/bash
wget --no-clobber http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsSydhK562Gata1UcdUniPk.narrowPeak.gz
wget --no-clobber http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/wgEncodeBroadHistoneK562Lsd1Pk.broadPeak.gz
wget --no-clobber http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsSydhK562Tal1sc12984IggmusUniPk.narrowPeak.gz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment