dwinter/worfflow.md

## worfflow.md

      
    Raw
  

              worfflow.md
            
          
    For the most part the pipeline below follows the GATK's "best pratctice" advice for calling variants from short reads, which provides more documentation for the process.
For the Tetrahymena and Plasmodium projects these steps have been controlled by a set of shell scripts (see /home/david/malaria/scripts for the latest iteration) that automate various steps along the way. I usually run the scripts with all output redirected to a log file so anythign interesting is recorded:./script.sh &> logs/script.log
##Align reads (Bowtie 2)
Prior to alignment, you need to create an index of the reference genome:
$ bowtie2-build -f [path to ref] [index_file_stem]

Alignment with bowtie-2 is easy:
bowtie2-align -p [NUMBER OF PROCESSORS] -x [STEM OF INDEX] -1 [PATH TO R1 READS] -2 [PATH TO R2 READS]

In practice, we produce one .bam file per sample, so want to automate that process. This script presumes you have $sample_tags ${ref_file}and $sample_dir defined as variables (possibly imported form an include script)
for file_stem in $sample_tags; do
    file=${sample_dir}/fastq/${file_stem}
    echo === Aligning ${file} ===
    bowtie2-align -p 32 -x ${ref_file} \
                  -1 ${file}_R1.fastq  \
                  -2 ${file}_R2.fastq | gzip > bam/${file_stem}.sam.gz
done;

#sort and index. 17179.... = 2Gb of memory allocated for the sort
for file_stem in $pilot_tags; do
        ( samtools view -uS bam/${file_stem}.sam.gz | samtools sort -m 17179869184\
        - bam/${file_stem};
    samtools index bam/${file_stem}.bam ) &
done;

Bowtie is meant to work with gzipped fastq files, but it's always failed for me. An alternative to unzipping the files on disk to to created "named pipes" for each one with <(zcat reads.fastq.gz). (Note this syntax is a BASH-ism, so won't work with other shells)
##Merge per-sample BAMs
Once all those reads are aligned, we want to create a new .bam with all the samples. It is important to create new readgroup tags for the merged BAM file which (a) maintain the existing information about platform, lane import size etc and (b) include sample information. With that information in a file called rg.txt you can use samtools to merge the files:
in_bams=""
for fstem in $sample_tags; do
    in_bams="${in_bams} bam/${fstem}.bam"
done;

samtools merge -rh rg.txt data-merged.bam ${in_bams}
samtools index data-merged.bam
###Quality assurance
We know have a file with mapped reads for all the samples, but we need to do some QA steps to clean up the alignments before we analyse them.
Mark PCR/optical duplicates
Picard provides a tool to identify and mask duplicate reads (most downstream applications should respected the marking). Note, there is a version of Picard on the $PATH @ Herschel, but this suite updates frequently, and it might make more sense to keep a local install in your user path or within the working directory of the project (the code below presumes the Picard tools are located in a bin directory in the project's working directory):
java -jar bin/MarkDuplicates.jar INPUT=data-merged.bam OUTPUT=data-marked.bam METRICS_FILE=dup_metrics.txt ASSUME_SORTED=true

Local realignment around indels
GATK procides a tool for local-realignment around what at first appear to be indels (many are not). Like picard, GATK is frequently updated and the version on the global $PATH is unlikely to be the latest. Here's a helper script to make calling the GATK tools a little easier
#!/bin/sh

/usr/local/openjdk7/bin/java -Xmx8g -jar [path to GenomeAnalysisTK.jar] $@
With which you can make a set of putative indels, and re-align around them:
bin/gatk -T RealignerTargetCreator -I data_marked.bam -R ${ref_file}.fasta -nt 32 -o indels.intervals && \
bin/gatk -T IndelRealigner -I data_marked.bam -R ${ref_file}.fasta -targetIntervals indels.intervals -o data_realign.bam