Skip to content

Instantly share code, notes, and snippets.

@opplatek
Last active March 3, 2024 20:28
Show Gist options
  • Save opplatek/cc0601e6777a9f279dd2c4785a2e51ba to your computer and use it in GitHub Desktop.
Save opplatek/cc0601e6777a9f279dd2c4785a2e51ba to your computer and use it in GitHub Desktop.
#!/bin/bash
#
# Speed up deepTools computeMatrix by splitting the references into smaller chunks and then merging the matrices together
#
positions=5000
threads=12
rnd=$RANDOM
# split reference into chunks by number of lines
split -l $positions ref.bed ref.chunks${rnd}
for chunk in ref.chunks${rnd}*; do
# Rename name column (4) in bed to avoid potential problems which deepTools naming which might happen if the reference position name are not unique
name=$(basename $chunk)
name=${name##*.}
cat $chunk | awk -v name=$name 'BEGIN {FS = "\t"; OFS = "\t"} {print $1,$2,$3,name,$5,$6}' > tmp.$rnd && mv tmp.$rnd $chunk
done
# calculate matrix for each chunk
for chunk in ref.chunks${rnd}*; do
computeMatrix reference-point \
--referencePoint TSS \
-R $chunk \
-S input.bw \
-b 500 -a 500 \
--skipZeros \
--missingDataAsZero \
--binSize 10 \
--averageTypeBins median \
--numberOfProcessors $threads \
--outFileName ${chunk}.gz
done
# merge the chunks back to one file
computeMatrixOperations rbind -m ref.chunks${rnd}*.gz -o ref.matrix.gz && rm ref.chunks${rnd}*.gz
# make heatmaps
plotHeatmap \
-m ref.matrix.gz \
--sortUsing mean \
--averageTypeSummaryPlot mean \
--missingDataColor "#440154" \
--colorMap viridis \
--zMax 100 \
--linesAtTickMarks \
--refPointLabel "TSS" \
--heatmapHeight 20 \
--heatmapWidth 10 \
--dpi 300 \
--outFileName ref.png
rm ref.chunks${rnd}*
@oligomyeggo
Copy link

Hi @opplatek ! This is a very handy trick, thank you for sharing! I was curious if you had done any tests to compared splitting the reference file like this with not splitting the reference file? I tried both ways with the same input files, and ended up with different matrices (same number of lines, but differing contents). Any ideas why this might be happening?

@opplatek
Copy link
Author

Hey @oligomyeggo, It has been a while since I last used deepTools. I think I only compared the number of lines and the final plots, not the actual content of the matrices when I was working on this. The final plots looked the same.
I don't have any data (or time, sobbing emoji) to check it now. But just trying to think of something - is it possible that the matrices (lines) are just ordered differently?

@oligomyeggo
Copy link

Hi @opplatek , thanks for getting back to me! I did try sorting the matrices to see if it was an ordering issue, but they still ended up different. I also did not have time to dig into this more thoroughly though, and am hoping to have time to revisit this week. I didn't plot the matrices to see if they look the same, so I will try that as well. Thank you!

@mbassalbioinformatics
Copy link

Hi @opplatek, thanks for this useful post. When running though i keep getting a list of errors such as

Skipping chunks9399am_r1505, due to being absent in the computeMatrix output.

Any ideas what this is referring to? The plotted output shows nothing, so im guessing it is dropping all the data for some reason. Thanks!

@opplatek
Copy link
Author

opplatek commented Mar 3, 2024

Hi @mbassalbioinformatics. Not sure why you're getting the error message. This post seems to explain some of the reasons why this is happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment