daler/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Setup

Using bioconda, you can install everything you need for this example with
conda install --channel bioconda pybedtools bedtools htslib matplotlib 

Run

Run get-data.sh to download data from ENCODE.
bash get-data.sh
Input files have the following format (UCSC broadPeak and narrowPeak formats, which ar variants of BED format):
chr1    569797  570055  .       1000    .       38.118451       16.0    -1
chr1    724125  2647713 .       258     .       1.259053        11.2    -1
chr1    752542  752779  .       658     .       10.178273       1.9     -1

Then run binary_heatmaps.py to generate the plot, a summary file, and a
directory of interval files for each class.
python binary_heatmaps.py
Output

binary_heatmap.png

Rows are genomic intervals (as output by bedtools multiinter); columns are
input BED files; black indicates that factor was found in that genomic
interval.

class_counts.txt

Summary of how many genomic intervals for each combinatorial class:
             LSD1: 16181
        LSD1,TAL1: 15120
             TAL1: 7989
  GATA1,LSD1,TAL1: 3009
       GATA1,LSD1: 654
            GATA1: 231
       GATA1,TAL1: 214

intervals/*.bed

For each of the above classes, a BED file of the indicated intervals. For example,
track name="LSD1_and_TAL1"
chr1    778211  778487
chr1    854053  854329
chr1    948500  948776
...


## binary_heatmap.png

      
    Raw
  

              binary_heatmap.png
            
          
## binary_heatmaps.py
import os
from matplotlib import pyplot as plt
from pybedtools.contrib import plotting
import pybedtools
import numpy as np


# set up the order in which to plot the columns of the binary heatmap
names, bts = zip(*[
    ('GATA1', 'wgEncodeAwgTfbsSydhK562Gata1UcdUniPk.narrowPeak.gz'),
    ('LSD1', 'wgEncodeBroadHistoneK562Lsd1Pk.broadPeak.gz'),
    ('TAL1', 'wgEncodeAwgTfbsSydhK562Tal1sc12984IggmusUniPk.narrowPeak.gz'),
])
bts = [pybedtools.BedTool(i).sort() for i in bts]

# set up the object by giving it a list of pybedtool.BedTool objects and a list
# of names to use.
b = plotting.BinaryHeatmap(
    bts=bts,
    names=names)

# plot it
b.plot()

# write out how many genomic location of each class were identified
with open('class_counts.txt', 'w') as fout:
    for cls, cnt in sorted(b.class_counts.items(), key=lambda x: x[1], reverse=True):
        fout.write('{0:>25}: {1:<15}\n'.format(cls, cnt))

# write out the actual intervals from each class
out_dir = 'intervals'
if not os.path.exists(out_dir):
    os.makedirs(out_dir)

for k, v in b.classified_intervals.items():
    label = k.replace(',', '_and_')
    v.cut([0, 1, 2]).saveas(os.path.join(out_dir, label + '.bed'), trackline='track name="%s"' % label)

# save the figure
fig = plt.gcf()
fig.tight_layout()
fig.savefig('binary_heatmap.png')
plt.show()

## get-data.sh
#!/bin/bash
wget --no-clobber http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsSydhK562Gata1UcdUniPk.narrowPeak.gz

wget --no-clobber http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/wgEncodeBroadHistoneK562Lsd1Pk.broadPeak.gz

wget --no-clobber http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsSydhK562Tal1sc12984IggmusUniPk.narrowPeak.gz
	import os
	from matplotlib import pyplot as plt
	from pybedtools.contrib import plotting
	import pybedtools
	import numpy as np


	# set up the order in which to plot the columns of the binary heatmap
	names, bts = zip(*[
	('GATA1', 'wgEncodeAwgTfbsSydhK562Gata1UcdUniPk.narrowPeak.gz'),
	('LSD1', 'wgEncodeBroadHistoneK562Lsd1Pk.broadPeak.gz'),
	('TAL1', 'wgEncodeAwgTfbsSydhK562Tal1sc12984IggmusUniPk.narrowPeak.gz'),
	])
	bts = [pybedtools.BedTool(i).sort() for i in bts]

	# set up the object by giving it a list of pybedtool.BedTool objects and a list
	# of names to use.
	b = plotting.BinaryHeatmap(
	bts=bts,
	names=names)

	# plot it
	b.plot()

	# write out how many genomic location of each class were identified
	with open('class_counts.txt', 'w') as fout:
	for cls, cnt in sorted(b.class_counts.items(), key=lambda x: x[1], reverse=True):
	fout.write('{0:>25}: {1:<15}\n'.format(cls, cnt))

	# write out the actual intervals from each class
	out_dir = 'intervals'
	if not os.path.exists(out_dir):
	os.makedirs(out_dir)

	for k, v in b.classified_intervals.items():
	label = k.replace(',', '_and_')
	v.cut([0, 1, 2]).saveas(os.path.join(out_dir, label + '.bed'), trackline='track name="%s"' % label)

	# save the figure
	fig = plt.gcf()
	fig.tight_layout()
	fig.savefig('binary_heatmap.png')
	plt.show()
	#!/bin/bash
	wget --no-clobber http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsSydhK562Gata1UcdUniPk.narrowPeak.gz

	wget --no-clobber http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/wgEncodeBroadHistoneK562Lsd1Pk.broadPeak.gz

	wget --no-clobber http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsSydhK562Tal1sc12984IggmusUniPk.narrowPeak.gz