Skip to content

Instantly share code, notes, and snippets.

@jxtx
Last active December 4, 2017 19:16
Show Gist options
  • Save jxtx/3e02265bba9cb369e65c87ba8fe10594 to your computer and use it in GitHub Desktop.
Save jxtx/3e02265bba9cb369e65c87ba8fe10594 to your computer and use it in GitHub Desktop.
Using HiFive on restriction digest bulk Hi-C

Working through running HiFive on a Hi-C datasets.

First, a note on memory an performance: bin size influences everything. Starting with a bin size of 40kb, loading data in hg38 seems to stay under ~16GB. At fend level resolution memory requirements approach ~32GB and running time increases several fold.

Dealing with restriction fragment details

HiFive stores a fend file with information on the locations of restriction fragments in the genome. We need to get the locations of the RE sites into a BED file to create this. The UCSC findMotifs tool can do this.

We are also going to bin at 40k right from the beginning. This will help performance and we know the data is sparse. If QuASAR suggests we can go lower we can always reprocess.

# Install findMotifs and hifive from bioconda
conda install ucsc-findmotifs
conda install hifive
# Create a BED file with locations of HindIII sites
findMotifs -strand=+ -motif=AAGCTT $(SOMEWHERE)/hg38.fa > hg38.HindIII.bed
# Import into HiFive optimized format
hifive fends -B hg38.HindIII.bed --binned 40000 -R HindIII -G hg38 hg38.HindIII.40k.bed

Alternative: Binning without RE (e.g. DNAse Hi-C)

Here we just bin without regard to where RE site locations are. We need to get the chromosome sizes from UCSC so HiFive knows how long the chromosomes are. Then we just cut into 40kb chunks.

wget https://genome.ucsc.edu/goldenpath/help/hg38.chrom.sizes
hifive fends --binned 40000 -L hg38.chrom.sizes -g hg38 hg38.NA.40k.fends

Then we would use hg38.NA.40k.fends in place of hg38.HindIII.40k.fends below.

Aligning the reads

Just for demonstration let's get some reads for the Dixon et al. 2012 paper from SRA:

fastq-dump --accession SRR442158 --split-files

Now we will align with bwa mem. You will need an index first. I always name mine with the corresponding .fa file as the base name. Importantly, we align the ends independently.

bwa mem -t 16 $(SOMEWHERE)/hg38.fa SRR442158_1.fastq | samtools view -b > SRR442158_1.bam
bwa mem -t 16 $(SOMEWHERE)/hg38.fa SRR442158_2.fastq | samtools view -b > SRR442158_2.bam

Now we can load the aligned ends into a HiFive "Hi-C Data" file. We simply provide the two aligned ends and the fend file, plus a filename to write to (SRR442158.hcd). Then we will create a "Hi-C Project". This step will

hifive hic-data -S SRR442158_1.bam SRR442158_2.bam hg38.HindIII.40k.fends SRR442158.hcd
hifive hic-project -f 5 SRR442158.hcd SRR442158.hcp

with 40kb bins things look good:

Read 66055585 validly-mapped read pairs.
66055585 total validly-mapped read pairs loaded. 38346128 valid fend pairs
Parsing fend pairs... Done  17459539 cis reads, 21188745 trans reads

and then:

Filtering fends... Removed 6933 of 78722 bins
Finding distance curve... Done

Even if you get an error here (can't fit the distance function), we can still go ahead with QuASAR and get a resolution estimate.

hifive quasar -p SRR442158.hcp -r 1000000,200000,40000 -d 0 -o report.txt SRR442158.quasar

After QuASAR runs, we can check its report in report.txt:

Quality Score Results
  
Resolution      Coverage        All     chr1    chr10   chr10_GL383545v1_alt    chr10_GL383546v1_alt    chr10_KI270824v1_alt    chr10_KI270825v1_alt    chr11   chr11_GL383547v1_alt    chr11_JH159136v1_alt    chr11_JH159137v1_alt    chr11_
40 Kb   17,459,539      0.019164        0.019010        0.023605        nan     0.046658        nan     -0.108543       0.019584        nan     0.001550        nan     nan     -0.012705       nan     0.009512        -0.016825       -0.007
200 Kb  17,459,539      0.083343        0.085528        0.102154        nan     nan     nan     nan     0.097350        nan     nan     nan     nan     nan     nan     nan     nan     nan     nan     nan     nan     nan     0.087015
1 Mb    17,459,539      0.120813        0.127911        0.133518        nan     nan     nan     nan     0.142377        nan     nan     nan     nan     nan     nan     nan     nan     nan     nan     nan     nan     nan     0.115370

Strict quality maximum resolution: 105,888 bp
Loose quality maximum resolution: 73,549 bp

So, this data supports ~80-100k bins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment