Skip to content

Instantly share code, notes, and snippets.

@pichuan
Last active October 30, 2023 21:03
Show Gist options
  • Save pichuan/eedab4cf2e06fa7ceb2fad0f9b3f8066 to your computer and use it in GitHub Desktop.
Save pichuan/eedab4cf2e06fa7ceb2fad0f9b3f8066 to your computer and use it in GitHub Desktop.
Pi-Chuan's notes on running `vg giraffe`

Pi-Chuan's notes on running vg giraffe


This is an example from pichuan@ to run vg giraffe, so we can go from FASTQs --> BAM.

To test this, I got a machine using the command listed here: https://github.com/google/deepvariant/blob/r1.5/docs/deepvariant-details.md#command-for-a-cpu-only-machine-on-google-cloud-platform. I added more disks because 300G is not enough for the example below. I used --boot-disk-size "1000" for now.

(Note: the disk issue might be because I didn't run with --rm in docker.)

Your input FASTQ files

DATA_DIR=${PWD}/data
mkdir -p ${DATA_DIR}
gcloud storage cp gs://brain-genomics-public/research/sequencing/fastq/novaseq/wgs_pcr_free/35x/HG003.novaseq.pcr-free.35x.R?.fastq.gz ${DATA_DIR}/

Download other files that VG needs

gcloud storage cp gs://hprc-pangenomes/hprc-v1.0-mc-chm13-minaf.0.1.gbz ${DATA_DIR}/

I used aria2c to download these files. You can use other approaches as well:

aria2c -c -x10 -s10 -d "${DATA_DIR}" https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/filtered/hprc-v1.0-mc-chm13-minaf.0.1.min
aria2c -c -x10 -s10 -d "${DATA_DIR}" https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/filtered/hprc-v1.0-mc-chm13-minaf.0.1.dist

Run vg giraffe with one command to get from FASTQs to BAM.

wget -P ${DATA_DIR} https://storage.googleapis.com/hprc-pangenomes/GRCh38.path_list.txt
SAMP=HG003
READS1=${DATA_DIR}/HG003.novaseq.pcr-free.35x.R1.fastq.gz
READS2=${DATA_DIR}/HG003.novaseq.pcr-free.35x.R2.fastq.gz

docker pull quay.io/vgteam/vg:ci-684-bc9aa5dfc4b0d14519ea47333075906a4ec74656

time docker run -v ${DATA_DIR}:${DATA_DIR}  \
  quay.io/vgteam/vg:ci-684-bc9aa5dfc4b0d14519ea47333075906a4ec74656 \
  vg giraffe --progress \
  --read-group "ID:1 LB:lib1 SM:${SAMP} PL:illumina PU:unit1" \
  --sample "${SAMP}" \
  -o BAM --ref-paths ${DATA_DIR}/GRCh38.path_list.txt \
  -P -L 3000 \
  -f $READS1 -f $READS2 \
  -Z ${DATA_DIR}/hprc-v1.0-mc-chm13-minaf.0.1.gbz \
  -d ${DATA_DIR}/hprc-v1.0-mc-chm13-minaf.0.1.dist \
  -m ${DATA_DIR}/hprc-v1.0-mc-chm13-minaf.0.1.min \
  -t $(nproc) > reads.unsorted.bam

NOTE: No need to sort this yet, because we'll need to sort it in the next step.

Runtime

Mapping speed: 721.705 reads per second per thread
Used 1.15553e+06 CPU-seconds (including output).
Achieved 725.539 reads per CPU-second (including output)
Memory footprint: 60.2669 GB

real    307m59.952s
user    0m42.707s
sys     3m41.616s

File size:

$ ls -lh reads.unsorted.bam
-rw-r--r-- 1 root root 68G Mar 28 21:36 reads.unsorted.bam

Fix the BAM's contig name

INBAM=reads.unsorted.bam
BAM=reads.sorted.chrfixed.bam

## Download the reference fasta, index, and dict (although for this you only need the .dict).
wget https://storage.googleapis.com/hprc-pangenomes/hg38.fa
wget https://storage.googleapis.com/hprc-pangenomes/hg38.fa.fai
wget https://storage.googleapis.com/hprc-pangenomes/hg38.dict

# Prepare the new header.
samtools view -H $INBAM | grep ^@HD > new_header.sam
grep ^@SQ hg38.dict | awk '{print $1 "\t" $2 "\t" $3}' >> new_header.sam
samtools view -H $INBAM  | grep -v ^@HD | grep -v ^@SQ >> new_header.sam
## write new BAM, removing the "GRCh38." prefixes
time cat <(cat new_header.sam) <(samtools view $INBAM) | sed -e "s/GRCh38.//g" | samtools sort --threads 8 -m 2G -O BAM > ${BAM}
# Index the BAM.
samtools index -@10 ${BAM}

This took:

real    77m1.554s
user    156m4.586s
sys     25m6.986s

File size:

$ ls -lh reads.sorted.chrfixed.bam 
-rw-r--r-- 1 root root 40G Mar 28 22:53 reads.sorted.chrfixed.bam

Sanity check: Run DeepVariant on the BAM file

Get the same reference we used for https://github.com/google/deepvariant/blob/r1.5/docs/deepvariant-case-study.md

FTPDIR=ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids

curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | gunzip > ${DATA_DIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
samtools faidx ${DATA_DIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

With min_mapping_quality=1,keep_legacy_allele_counter_behavior=true,normalize_reads=true

BIN_VERSION="1.5.0"
time docker run \
  -v "${DATA_DIR}":"${DATA_DIR}" \
  -v "${PWD}:${PWD}" \
  google/deepvariant:"${BIN_VERSION}" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type=WGS \
  --ref=${DATA_DIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
  --reads=${PWD}/${BAM} \
  --output_vcf=${PWD}/min_mapping_quality-keep_legacy_allele_counter_behavior-normalize_reads-vg.vcf.gz \
  --output_gvcf=${PWD}/min_mapping_quality-keep_legacy_allele_counter_behavior-normalize_reads-vg.g.vcf.gz \
  --make_examples_extra_args="min_mapping_quality=1,keep_legacy_allele_counter_behavior=true,normalize_reads=true" \
  --num_shards=$(nproc)
Stage Time (minutes)
make_examples 106m25.298s
call_variants 154m28.314s
postprocess_variants (with gVCF) 39m33.041s

Run hap.py for min_mapping_quality=1,keep_legacy_allele_counter_behavior=true,normalize_reads=true

mkdir -p benchmark

FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG003_NA24149_father/NISTv4.2.1/GRCh38

curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
mkdir -p happy

docker pull jmcdani20/hap.py:v0.3.12

docker run \
  -v "${DATA_DIR}":"${DATA_DIR}" \
  -v "${PWD}:${PWD}" \
  jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
  ${PWD}/benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
  ${PWD}/min_mapping_quality-keep_legacy_allele_counter_behavior-normalize_reads-vg.vcf.gz \
  -f ${PWD}/benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
  -r ${DATA_DIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
  -o ${PWD}/happy/happy.output \
  --engine=vcfeval \
  --pass-only

hap.py output for min_mapping_quality=1,keep_legacy_allele_counter_behavior=true,normalize_reads=true

Type TRUTH.TP TRUTH.FN QUERY.FP METRIC.Recall METRIC.Precision METRIC.F1_Score
INDEL 502166 2335 1335 0.995372 0.997456 0.996413
SNP 3314019 13477 4127 0.99595 0.998757 0.997351

This can be compared with https://github.com/google/deepvariant/blob/r1.5/docs/metrics.md#accuracy.

Which shows that vg giraffe improves F1:

  • Indel F1: 0.996103 --> 0.996413
  • SNP F1: 0.996247 --> 0.997351
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment