pichuan/run-vg-giraffe.md

## run-vg-giraffe.md

      
    Raw
  

              run-vg-giraffe.md
            
          
    Pi-Chuan's notes on running vg giraffe


This is an example from pichuan@ to run vg giraffe, so we can go from FASTQs --> BAM.
To test this, I got a machine using the command listed here: https://github.com/google/deepvariant/blob/r1.5/docs/deepvariant-details.md#command-for-a-cpu-only-machine-on-google-cloud-platform. I added more disks because 300G is not enough for the example below. I used  --boot-disk-size "1000" for now.
(Note: the disk issue might be because I didn't run with --rm in docker.)
Your input FASTQ files

DATA_DIR=${PWD}/data
mkdir -p ${DATA_DIR}
gcloud storage cp gs://brain-genomics-public/research/sequencing/fastq/novaseq/wgs_pcr_free/35x/HG003.novaseq.pcr-free.35x.R?.fastq.gz ${DATA_DIR}/

Download other files that VG needs

gcloud storage cp gs://hprc-pangenomes/hprc-v1.0-mc-chm13-minaf.0.1.gbz ${DATA_DIR}/

I used aria2c to download these files. You can use other approaches as well:
aria2c -c -x10 -s10 -d "${DATA_DIR}" https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/filtered/hprc-v1.0-mc-chm13-minaf.0.1.min
aria2c -c -x10 -s10 -d "${DATA_DIR}" https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/filtered/hprc-v1.0-mc-chm13-minaf.0.1.dist

Run vg giraffe with one command to get from FASTQs to BAM.

wget -P ${DATA_DIR} https://storage.googleapis.com/hprc-pangenomes/GRCh38.path_list.txt
SAMP=HG003
READS1=${DATA_DIR}/HG003.novaseq.pcr-free.35x.R1.fastq.gz
READS2=${DATA_DIR}/HG003.novaseq.pcr-free.35x.R2.fastq.gz

docker pull quay.io/vgteam/vg:ci-684-bc9aa5dfc4b0d14519ea47333075906a4ec74656

time docker run -v ${DATA_DIR}:${DATA_DIR}  \
  quay.io/vgteam/vg:ci-684-bc9aa5dfc4b0d14519ea47333075906a4ec74656 \
  vg giraffe --progress \
  --read-group "ID:1 LB:lib1 SM:${SAMP} PL:illumina PU:unit1" \
  --sample "${SAMP}" \
  -o BAM --ref-paths ${DATA_DIR}/GRCh38.path_list.txt \
  -P -L 3000 \
  -f $READS1 -f $READS2 \
  -Z ${DATA_DIR}/hprc-v1.0-mc-chm13-minaf.0.1.gbz \
  -d ${DATA_DIR}/hprc-v1.0-mc-chm13-minaf.0.1.dist \
  -m ${DATA_DIR}/hprc-v1.0-mc-chm13-minaf.0.1.min \
  -t $(nproc) > reads.unsorted.bam

NOTE: No need to sort this yet, because we'll need to sort it in the next step.
Runtime

Mapping speed: 721.705 reads per second per thread
Used 1.15553e+06 CPU-seconds (including output).
Achieved 725.539 reads per CPU-second (including output)
Memory footprint: 60.2669 GB

real    307m59.952s
user    0m42.707s
sys     3m41.616s

File size:
$ ls -lh reads.unsorted.bam
-rw-r--r-- 1 root root 68G Mar 28 21:36 reads.unsorted.bam

Fix the BAM's contig name

INBAM=reads.unsorted.bam
BAM=reads.sorted.chrfixed.bam

## Download the reference fasta, index, and dict (although for this you only need the .dict).
wget https://storage.googleapis.com/hprc-pangenomes/hg38.fa
wget https://storage.googleapis.com/hprc-pangenomes/hg38.fa.fai
wget https://storage.googleapis.com/hprc-pangenomes/hg38.dict

# Prepare the new header.
samtools view -H $INBAM | grep ^@HD > new_header.sam
grep ^@SQ hg38.dict | awk '{print $1 "\t" $2 "\t" $3}' >> new_header.sam
samtools view -H $INBAM  | grep -v ^@HD | grep -v ^@SQ >> new_header.sam
## write new BAM, removing the "GRCh38." prefixes
time cat <(cat new_header.sam) <(samtools view $INBAM) | sed -e "s/GRCh38.//g" | samtools sort --threads 8 -m 2G -O BAM > ${BAM}
# Index the BAM.
samtools index -@10 ${BAM}

This took:
real    77m1.554s
user    156m4.586s
sys     25m6.986s

File size:
$ ls -lh reads.sorted.chrfixed.bam 
-rw-r--r-- 1 root root 40G Mar 28 22:53 reads.sorted.chrfixed.bam

Sanity check: Run DeepVariant on the BAM file

Get the same reference we used for https://github.com/google/deepvariant/blob/r1.5/docs/deepvariant-case-study.md
FTPDIR=ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids

curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | gunzip > ${DATA_DIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
samtools faidx ${DATA_DIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

With min_mapping_quality=1,keep_legacy_allele_counter_behavior=true,normalize_reads=true

BIN_VERSION="1.5.0"
time docker run \
  -v "${DATA_DIR}":"${DATA_DIR}" \
  -v "${PWD}:${PWD}" \
  google/deepvariant:"${BIN_VERSION}" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type=WGS \
  --ref=${DATA_DIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
  --reads=${PWD}/${BAM} \
  --output_vcf=${PWD}/min_mapping_quality-keep_legacy_allele_counter_behavior-normalize_reads-vg.vcf.gz \
  --output_gvcf=${PWD}/min_mapping_quality-keep_legacy_allele_counter_behavior-normalize_reads-vg.g.vcf.gz \
  --make_examples_extra_args="min_mapping_quality=1,keep_legacy_allele_counter_behavior=true,normalize_reads=true" \
  --num_shards=$(nproc)


Stage
Time (minutes)


make_examples
106m25.298s


call_variants
154m28.314s


postprocess_variants (with gVCF)
39m33.041s


Run hap.py for min_mapping_quality=1,keep_legacy_allele_counter_behavior=true,normalize_reads=true

mkdir -p benchmark

FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG003_NA24149_father/NISTv4.2.1/GRCh38

curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi

mkdir -p happy

docker pull jmcdani20/hap.py:v0.3.12

docker run \
  -v "${DATA_DIR}":"${DATA_DIR}" \
  -v "${PWD}:${PWD}" \
  jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
  ${PWD}/benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
  ${PWD}/min_mapping_quality-keep_legacy_allele_counter_behavior-normalize_reads-vg.vcf.gz \
  -f ${PWD}/benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
  -r ${DATA_DIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
  -o ${PWD}/happy/happy.output \
  --engine=vcfeval \
  --pass-only

hap.py output for min_mapping_quality=1,keep_legacy_allele_counter_behavior=true,normalize_reads=true


Type
TRUTH.TP
TRUTH.FN
QUERY.FP
METRIC.Recall
METRIC.Precision
METRIC.F1_Score


INDEL
502166
2335
1335
0.995372
0.997456
0.996413


SNP
3314019
13477
4127
0.99595
0.998757
0.997351


This can be compared with https://github.com/google/deepvariant/blob/r1.5/docs/metrics.md#accuracy.
Which shows that vg giraffe improves F1:

Indel F1: 0.996103 --> 0.996413
SNP F1: 0.996247 --> 0.997351
Stage	Time (minutes)
make_examples	106m25.298s
call_variants	154m28.314s
postprocess_variants (with gVCF)	39m33.041s
Type	TRUTH.TP	TRUTH.FN	QUERY.FP	METRIC.Recall	METRIC.Precision	METRIC.F1_Score
INDEL	502166	2335	1335	0.995372	0.997456	0.996413
SNP	3314019	13477	4127	0.99595	0.998757	0.997351