Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Calling SNPs with GATK in non-model taxa

Calling SNPs with GATK in non-model taxa

These notes build from several excellent sources:

and assume you're working with GATK 2.2-16. These notes also assume working on a decent machine (the one below is 12 physical cores + 48 GB RAM + NAS).

The notes differ somewhat from the "Simple Fools Guide" in that they are updated for GATK 2.2-16 and use PICARD in place of bwa for several operations, namely incorporating the @RG header information to samples. IMO, PICARD is simply better at doing this correctly.

All errors are my own.


First steps

Be sure to clean adapter contamination and remove low quality bases from your reads using appropriate tools. Subsequent to doing that, you need to assemble some/all of your reads into reference contigs using a tool like Velvet or ABySS. In Velvet, something like the following will work (remove slashes or Velvet won't be happy):

VelvetOptimiser -s 87 -e 95 -t 1 -c ncon -a -f '-fastq.gz -shortPaired \
critter1-READ1and2-interleaved.fastq.gz \
critter2-READ1and2-interleaved.fastq.gz \
critter3-READ1and2-interleaved.fastq.gz \
-short critter1-READ-singleton.fastq.gz \
critter2-READ-singleton.fastq.gz \

If the assembly is "optimal", VelvetOptimiser will choose a kmer value in the middle of the range given. If the optimal kmer is at either end of the range, it's best to start again to see if you can "optimize".

The assmebled contigs will be in a directory names "auto_data_X" where "X" is the "optimal" kmer value. Generally, I symlink these contigs to another location. The above also outputs an AMOS-formatted read-tracking file that you can use to view the assembly.

Do this once

  • index reference fasta (assembled using velvet, bwa, etc.):

    bwa index -a is <reference>

Do this for each set of reads in population/group

  • align PE reads to reference (skipping SE reads since they're small potatoes):

    bwa aln -t 12 <reference> <fastq-R1.gz> > <name>-r1.sai
    bwa aln -t 12 <reference> <fastq-R2.gz> > <name>-r2.sai
    bwa sampe -a 500 -P <reference> <name>-r1.sai <name>-r2.sai <fastq-R1.gz> <fastq-R2.gz> | samtools view -bS - > <name>.bam
  • go ahead and add in @RG header info. This will also sort the BAM:

    java -Xmx20g -jar ~/bin/AddOrReplaceReadGroups.jar \
    I=<name>.bam \
    O=<name>-RG.bam \
    SORT_ORDER=coordinate \
    RGPL=illumina \
    RGPU=D109LACXX \
    RGLB=Lib1 \
    RGID=<name> \
    RGSM=<name> \
  • you can view the BAM with something like:

    samtools view <name>-RG.bam | less -S
  • you can view the header info with something like:

    samtools view -H <name>-RG.bam | grep @RG
  • mark and remove duplicates:

    java -Xmx20g -jar ~/bin/MarkDuplicates.jar \
    INPUT=<name>-RG.bam \
    OUTPUT=<name>-DD.bam \
    METRICS_FILE=<name>-metricsfile \
    ASSUME_SORTED=true \

Do this once (sort of)

  • merge the individual bam files into a "group" BAM. The information in the @RG headers will keep straight who is who:

    java -jar ~/bin/MergeSamFiles.jar \
    I=<name1>-DD.bam \
    I=<name2>-DD.bam \
    I=<name2>-DD.bam \
    SO=coordinate \
    AS=true \
  • index the merged BAM:

    samtools index merged.bam
  • we need to re-align around indels, so create targets for indel realignment:

    java -Xmx20g -jar ~/bin/GenomeAnalysisTK.jar \
    -T RealignerTargetCreator \
    -R <reference> \
    -I merged.bam \
    --minReadsAtLocus 4 \
    -o merged.intervals
  • realign the BAM based on indel intervals:

    java -Xmx20g -jar ~/bin/GenomeAnalysisTK.jar \
    -T IndelRealigner \
    -R <reference> \
    -I merged.bam \
    -targetIntervals merged.intervals \
    -LOD 3.0
    -o merged-realigned.bam \
  • call SNPs:

    java -Xmx20g -jar ~/bin/GenomeAnalysisTK.jar \
    -T UnifiedGenotyper \
    -R <reference> \
    -I merged-realigned.bam \
    -gt_mode DISCOVERY \
    -stand_call_conf 30 \
    -stand_emit_conf 10 \
    -o rawSNPS-Q30.vcf
  • annotate SNPs:

    java -Xmx20g -jar ~/bin/GenomeAnalysisTK.jar \
    -T VariantAnnotator \
    -R <reference> \
    -I merged-realigned.bam \
    -G StandardAnnotation \
    -V:variant,VCF rawSNPS-Q30.vcf \
    -XA SnpEff \
    -o rawSNPS-Q30-annotated.vcf
  • call indels:

    java -Xmx20g -jar ~/bin/GenomeAnalysisTK.jar \
    -T UnifiedGenotyper \
    -R <reference> \
    -I merged-realigned.bam \
    -gt_mode DISCOVERY \
    -glm INDEL \
    -stand_call_conf 30 \
    -stand_emit_conf 10 \
    -o inDels-Q30.vcf
  • filter SNP calls around indels (note, this is pretty conservative):

    java -Xmx20g -jar ~/bin/GenomeAnalysisTK.jar \
    -T VariantFiltration \
    -R <reference> \
    -V rawSNPS-Q30-annotated.vcf \
    --mask inDels-Q30.vcf \
    --maskExtension 5 \
    --maskName InDel \
    --clusterWindowSize 10 \
    --filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)" \
    --filterName "BadValidation" \
    --filterExpression "QUAL < 30.0" \
    --filterName "LowQual" \
    --filterExpression "QD < 5.0" \
    --filterName "LowVQCBD" \
    --filterExpression "FS > 60" \
    --filterName "FisherStrand" \
    -o Q30-SNPs.vcf
  • output only the passing SNPs into a file:

    cat Q30-SNPs.vcf | grep 'PASS\|^#' > only-PASS-Q30-SNPs.vcf
  • if you have the correct data (parent-offspring trios), check for mendelian inheritance:

    java -Xmx2g -jar ~/bin/GenomeAnalysisTK.jar \
    -T SelectVariants \
    -R <reference> \
    --variant only-PASS-Q30-SNPs.vcf \
    -mv \
    -ped family.ped \
    -mvq 30 \
    -o mendelian-violations.vcf
  • a ped file looks like this (family, individual, mother, father, sex, Tx):

    F1      wbb-149143361   wbb-149143140   wbb-149143346   2       -9
    F1      wbb-149143140   -9      -9      1       -9
    F1      wbb-149143346   -9      -9      2       -9
  • you can filter on SNPs that don't follow Mendelian inheritance by running VariantFiltration again and adding them as a masking option.

Rinse, Repeat

According to GATK best practices, you probably want to take this set of high quality SNPs, and start the process over. But this time, you want to be sure to run the base quality score recalibrator. See the following page for more info:


This comment has been minimized.

Copy link

willgilks commented Sep 24, 2015

I know these guidelines are bit old now but I was wondering:

  1. Why the filter for Bad Validation includes the expression ((MQ0 / (1.0 * DP)) as surely (1.0*DP)=DP ?
  2. Does filtering out the passed variants using bash-grep instead of GATK cause downstream problems because no vcf index file is produced?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.