brantfaircloth/non-model-snps.rst

## non-model-snps.rst

      
    Raw
  

              non-model-snps.rst
            
          
    Calling SNPs with GATK in non-model taxa

These notes build from several excellent sources:

http://sfg.stanford.edu/
http://www.broadinstitute.org/gatk/guide/
http://www.broadinstitute.org/gatk/guide/topic?name=best-practices

and assume you're working with GATK 2.2-16. These notes also assume working on a decent machine (the one below is 12 physical cores + 48 GB RAM + NAS).
The notes differ somewhat from the "Simple Fools Guide" in that they are updated for GATK 2.2-16 and use PICARD in place of bwa for several operations, namely incorporating the @RG header information to samples. IMO, PICARD is simply better at doing this correctly.
All errors are my own.
Requirements


Java
GATK
PICARD
Unix/Linux/OSX

First steps

Be sure to clean adapter contamination and remove low quality bases from your reads using appropriate tools. Subsequent to doing that, you need to assemble some/all of your reads into reference contigs using a tool like Velvet or ABySS. In Velvet, something like the following will work (remove slashes or Velvet won't be happy):
VelvetOptimiser -s 87 -e 95 -t 1 -c ncon -a -f '-fastq.gz -shortPaired \
critter1-READ1and2-interleaved.fastq.gz \
critter2-READ1and2-interleaved.fastq.gz \
critter3-READ1and2-interleaved.fastq.gz \
-short critter1-READ-singleton.fastq.gz \
critter2-READ-singleton.fastq.gz \
critter3-READ-singleton.fastq.gz'
If the assembly is "optimal", VelvetOptimiser will choose a kmer value in the middle of the range given. If the optimal kmer is at either end of the range, it's best to start again to see if you can "optimize".
The assmebled contigs will be in a directory names "auto_data_X" where "X" is the "optimal" kmer value. Generally, I symlink these contigs to another location. The above also outputs an AMOS-formatted read-tracking file that you can use to view the assembly.
Do this once


index reference fasta (assembled using velvet, bwa, etc.):
bwa index -a is <reference>

Do this for each set of reads in population/group


align PE reads to reference (skipping SE reads since they're small potatoes):
bwa aln -t 12 <reference> <fastq-R1.gz> > <name>-r1.sai
bwa aln -t 12 <reference> <fastq-R2.gz> > <name>-r2.sai
bwa sampe -a 500 -P <reference> <name>-r1.sai <name>-r2.sai <fastq-R1.gz> <fastq-R2.gz> | samtools view -bS - > <name>.bam
go ahead and add in @RG header info. This will also sort the BAM:
java -Xmx20g -jar ~/bin/AddOrReplaceReadGroups.jar \
I=<name>.bam \
O=<name>-RG.bam \
SORT_ORDER=coordinate \
RGPL=illumina \
RGPU=D109LACXX \
RGLB=Lib1 \
RGID=<name> \
RGSM=<name> \
VALIDATION_STRINGENCY=LENIENT
you can view the BAM with something like:
samtools view <name>-RG.bam | less -S
you can view the header info with something like:
samtools view -H <name>-RG.bam | grep @RG
mark and remove duplicates:
java -Xmx20g -jar ~/bin/MarkDuplicates.jar \
INPUT=<name>-RG.bam \
OUTPUT=<name>-DD.bam \
METRICS_FILE=<name>-metricsfile \
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=250 \
ASSUME_SORTED=true \
VALIDATION_STRINGENCY=SILENT \
REMOVE_DUPLICATES=True

Do this once (sort of)


merge the individual bam files into a "group" BAM. The information in the @RG headers will keep straight who is who:
java -jar ~/bin/MergeSamFiles.jar \
I=<name1>-DD.bam \
I=<name2>-DD.bam \
I=<name2>-DD.bam \
SO=coordinate \
AS=true \
VALIDATION_STRINGENCY=SILENT \
O=merged.bam
index the merged BAM:
samtools index merged.bam
we need to re-align around indels, so create targets for indel realignment:
java -Xmx20g -jar ~/bin/GenomeAnalysisTK.jar \
-T RealignerTargetCreator \
-R <reference> \
-I merged.bam \
--minReadsAtLocus 4 \
-o merged.intervals
realign the BAM based on indel intervals:
java -Xmx20g -jar ~/bin/GenomeAnalysisTK.jar \
-T IndelRealigner \
-R <reference> \
-I merged.bam \
-targetIntervals merged.intervals \
-LOD 3.0
-o merged-realigned.bam \
call SNPs:
java -Xmx20g -jar ~/bin/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
-R <reference> \
-I merged-realigned.bam \
-gt_mode DISCOVERY \
-stand_call_conf 30 \
-stand_emit_conf 10 \
-o rawSNPS-Q30.vcf
annotate SNPs:
java -Xmx20g -jar ~/bin/GenomeAnalysisTK.jar \
-T VariantAnnotator \
-R <reference> \
-I merged-realigned.bam \
-G StandardAnnotation \
-V:variant,VCF rawSNPS-Q30.vcf \
-XA SnpEff \
-o rawSNPS-Q30-annotated.vcf
call indels:
java -Xmx20g -jar ~/bin/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
-R <reference> \
-I merged-realigned.bam \
-gt_mode DISCOVERY \
-glm INDEL \
-stand_call_conf 30 \
-stand_emit_conf 10 \
-o inDels-Q30.vcf
filter SNP calls around indels (note, this is pretty conservative):
java -Xmx20g -jar ~/bin/GenomeAnalysisTK.jar \
-T VariantFiltration \
-R <reference> \
-V rawSNPS-Q30-annotated.vcf \
--mask inDels-Q30.vcf \
    --maskExtension 5 \
--maskName InDel \
--clusterWindowSize 10 \
--filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)" \
--filterName "BadValidation" \
--filterExpression "QUAL < 30.0" \
--filterName "LowQual" \
--filterExpression "QD < 5.0" \
--filterName "LowVQCBD" \
--filterExpression "FS > 60" \
--filterName "FisherStrand" \
-o Q30-SNPs.vcf
output only the passing SNPs into a file:
cat Q30-SNPs.vcf | grep 'PASS\|^#' > only-PASS-Q30-SNPs.vcf
if you have the correct data (parent-offspring trios), check for mendelian inheritance:
java -Xmx2g -jar ~/bin/GenomeAnalysisTK.jar \
-T SelectVariants \
-R <reference> \
--variant only-PASS-Q30-SNPs.vcf \
-mv \
-ped family.ped \
-mvq 30 \
-o mendelian-violations.vcf
a ped file looks like this (family, individual, mother, father, sex, Tx):
F1  wbb-149143361   wbb-149143140   wbb-149143346   2   -9
F1  wbb-149143140   -9  -9  1   -9
F1  wbb-149143346   -9  -9  2   -9
you can filter on SNPs that don't follow Mendelian inheritance by running VariantFiltration again and adding them as a masking option.

Rinse, Repeat

According to GATK best practices, you probably want to take this set of high quality SNPs, and start the process over. But this time, you want to be sure to run the base quality score recalibrator. See the following page for more info:
http://www.broadinstitute.org/gatk/guide/article?id=44