For the most part the pipeline below follows the GATK's "best pratctice" advice for calling variants from short reads, which provides more documentation for the process.
For the Tetrahymena and Plasmodium projects these steps have been controlled by a set of shell scripts (see /home/david/malaria/scripts
for the latest iteration) that automate various steps along the way. I usually run the scripts with all output redirected to a log file so anythign interesting is recorded:./script.sh &> logs/script.log
##Align reads (Bowtie 2
)
Prior to alignment, you need to create an index of the reference genome:
$ bowtie2-build -f [path to ref] [index_file_stem]
Alignment with bowtie-2 is easy:
bowtie2-align -p [NUMBER OF PROCESSORS] -x [STEM OF INDEX] -1 [PATH TO R1 READS] -2 [PATH TO R2 READS]
In practice, we produce one .bam
file per sample, so want to automate that process. This script presumes you have $sample_tags
${ref_file}
and $sample_dir
defined as variables (possibly imported form an include
script)
for file_stem in $sample_tags; do
file=${sample_dir}/fastq/${file_stem}
echo === Aligning ${file} ===
bowtie2-align -p 32 -x ${ref_file} \
-1 ${file}_R1.fastq \
-2 ${file}_R2.fastq | gzip > bam/${file_stem}.sam.gz
done;
#sort and index. 17179.... = 2Gb of memory allocated for the sort
for file_stem in $pilot_tags; do
( samtools view -uS bam/${file_stem}.sam.gz | samtools sort -m 17179869184\
- bam/${file_stem};
samtools index bam/${file_stem}.bam ) &
done;
Bowtie is meant to work with gzipped fastq files, but it's always failed for me. An alternative to unzipping the files on disk to to created "named pipes" for each one with <(zcat reads.fastq.gz)
. (Note this syntax is a BASH-ism, so won't work with other shells)
##Merge per-sample BAMs
Once all those reads are aligned, we want to create a new .bam
with all the samples. It is important to create new readgroup tags for the merged BAM file which (a) maintain the existing information about platform, lane import size etc and (b) include sample information. With that information in a file called rg.txt you can use samtools
to merge the files:
in_bams=""
for fstem in $sample_tags; do
in_bams="${in_bams} bam/${fstem}.bam"
done;
samtools merge -rh rg.txt data-merged.bam ${in_bams}
samtools index data-merged.bam
###Quality assurance
We know have a file with mapped reads for all the samples, but we need to do some QA steps to clean up the alignments before we analyse them.
Mark PCR/optical duplicates
Picard
provides a tool to identify and mask duplicate reads (most downstream applications should respected the marking). Note, there is a version of Picard
on the $PATH
@ Herschel, but this suite updates frequently, and it might make more sense to keep a local install in your user path or within the working directory of the project (the code below presumes the Picard
tools are located in a bin
directory in the project's working directory):
java -jar bin/MarkDuplicates.jar INPUT=data-merged.bam OUTPUT=data-marked.bam METRICS_FILE=dup_metrics.txt ASSUME_SORTED=true
Local realignment around indels
GATK
procides a tool for local-realignment around what at first appear to be indels (many are not). Like picard
, GATK
is frequently updated and the version on the global $PATH is unlikely to be the latest. Here's a helper script to make calling the GATK
tools a little easier
#!/bin/sh
/usr/local/openjdk7/bin/java -Xmx8g -jar [path to GenomeAnalysisTK.jar] $@
With which you can make a set of putative indels, and re-align around them:
bin/gatk -T RealignerTargetCreator -I data_marked.bam -R ${ref_file}.fasta -nt 32 -o indels.intervals && \
bin/gatk -T IndelRealigner -I data_marked.bam -R ${ref_file}.fasta -targetIntervals indels.intervals -o data_realign.bam