Skip to content

Instantly share code, notes, and snippets.

@claczny
Created September 5, 2016 08:22
Show Gist options
  • Save claczny/adefc4d6be84c4d42a6498d1b69ac31a to your computer and use it in GitHub Desktop.
Save claczny/adefc4d6be84c4d42a6498d1b69ac31a to your computer and use it in GitHub Desktop.
A Makefile to compute the average genome coverage, coverage distribution, and other things from an input BAM-file. N.B. Secondary expansion (.SECONDEXPANSION) is used to create and populate a dedicated directory per sample.
SHELL=/bin/bash
SAMPLE?=<YOUR_SAMPLE>
DOUBLED_SAMPLE = $(SAMPLE)/$(SAMPLE)
RDIR?=results
DDIR?=data
#####
# BEAUTY TARGETS
#####
.PHONY: all extract_fastq bam2sam sort_bam index_bam genomecov
all: extract_fastq genomecov
extract_fastq: $(RDIR)/$(DOUBLED_SAMPLE).fq
bam2sam: $(RDIR)/$(DOUBLED_SAMPLE).sam
sort_bam: $(RDIR)/$(DOUBLED_SAMPLE).srtd.bam
index_bam: $(RDIR)/$(DOUBLED_SAMPLE).srtd.bai
genomecov: $(RDIR)/$(DOUBLED_SAMPLE).srtd.cov_avg.txt
# SOME BASIC STATISTICS
get_unique_seq_count: $(RDIR)/$(DOUBLED_SAMPLE).sam
awk '{print $$1}' $^ | sort |uniq -c | wc -l
get_mapq_distribution: $(RDIR)/ $(DOUBLED_SAMPLE).sam
awk -F"\t" '{print $$5}' $^ | sort |uniq -c
get_cigar_distribution: $(RDIR)/$(DOUBLED_SAMPLE).sam
awk -F"\t" '{print $$6}' $^ | sort |uniq -c
#####
# ACTUAL TARGETS
#####
.SECONDARY:
.SECONDEXPANSION:
$(RDIR)/%.fq: $(DDIR)/$$(notdir $$*).bam
mkdir -p $(dir $@)
@date
time bedtools bamtofastq -i $^ -fq $@
@date
$(RDIR)/%.sam: $(DDIR)/$$(notdir $$*).bam
@date
time samtools view $^ > $@
@date
$(RDIR)/%.srtd.bam: $(DDIR)/$$(notdir $$*).bam
@date
time samtools sort $^ $(@:.bam=)
@date
%.bai: %.bam
@date
time samtools index $^ $@
@date
%.srtd.cov_hist.txt: %.srtd.bam %.srtd.bai
@date
bedtools genomecov -ibam $(word 1,$^) > $@
@date
%.cov_avg.txt: %.cov_hist.txt
@date
awk -F"\t" 'BEGIN {pc=""} \
{\
c=$$1;\
if (c == pc) {\
cov=cov+$$2*$$5;\
} else {\
print pc,cov;\
cov=$$2*$$5;\
pc=c}\
} END {print pc,cov}' $^ | tail -n +2 > $@
@date
#####
# CLEAN-UP
#####
clean:
echo "TODO: clean"
@claczny
Copy link
Author

claczny commented Sep 5, 2016

DESCRIPTION

This Makefile can be used to compute the average coverage, coverage histogram, and other results given a BAM-file.
Secondary expansion is used to create a separate directory which is named after the sample.
This should make the organization easier as all the results will be grouped by sample.
There may be other ways to achieve a similar effect using make, however, this is the only one that I stumbled upon so far.

SETUP

SAMPLE?=<YOUR_SAMPLE>
DOUBLED_SAMPLE = $(SAMPLE)/$(SAMPLE)

RDIR?=results
DDIR?=data

You must provide the name of your sample in the SAMPLE variable. Moreover, please set your results (output) and data (input) directories accordingly.
While they can be specified within the Makefile, they may be specified in the make call, e.g., make SAMPLE=FOO RDIR=/path/to/my/results DDIR=/path/to/my/bam/files, thus, with the input BAM-file being /path/to/my/bam/files/FOO.bam.
DOUBLED_SAMPLE is used to create a dedicated results directory for each input BAM-file, e.g. extract_fastq: $(RDIR)/$(DOUBLED_SAMPLE).fq aims at extracting the mapped sequences for the sample of interest into the respective results-directory.

SECONDARY EXPANSION

.SECONDEXPANSION:
$(RDIR)/%.fq: $(DDIR)/$$(notdir $$*).bam
        mkdir -p $(dir $@)
        @date
        time bedtools bamtofastq -i $^ -fq $@
@date

.SECONDEXPANSION is used to name the dependencies according to the target.
The aim is to make the whole approach more generic and make sure that all the newly "made" files are put into their respective, sample-specific folder.
$(RDIR)/%.fq is the typical way of creating a general make-target (s.a. pattern rules).

$(DDIR)/$$(notdir $$*).bam is where the secondary expansion is used. The $$ (double dollar sign) highlights this.
More precisely, on the first expansion, the $$* will evaluate to $*. In this way, upon the secondary expansion, the $* can be evaluated to the value of the place holder %, e.g., FOO, which is known/set after the first expansion.
In a way, it is similar to creating a dedicated Makefile for FOO.bam with the rule $(RDIR)/FOO/FOO.fq: $(DDIR)/FOO.bam`, but much more generic :)

CARRYING ON

Once the files that are dependent on the input BAM-file are created in the sample-specific results folder, the make-rules can be written as usual, e.g., %.bai: %.bam. This target-line means: For a given BAM-file, create a respective BAI-file. For example, it the target is /path/to/your/results/FOO.srtd.bam, the dependency will be /path/to/your/results/FOO.srtd.bam. It represents a pattern-rule, s. above.

WHAT'S THE POINT OF THIS?

The main point of this is that I often encounter the case where I have multiple individual files and a set of independent computations has to performed on each of them.
This results in multiple files being generated per input file - think, Single Instruction Multi Data (SIMD) or even Multiple Instruction Multiple Data (MIMD).
Using pattern rules, this is extremely convenient to achieve with make, yet the organization of the results suffers as all the results would typically be created in the same results folder, without any subdirectories.
This Makefile should help alleviate this cluttering while preserving all the nice functionalities of make, e.g., running multiple jobs concurrently using the -j option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment