This is an example from pichuan@ to run vg giraffe
, so we can go from FASTQs --> BAM.
To test this, I got a machine using the command listed here: https://github.com/google/deepvariant/blob/r1.5/docs/deepvariant-details.md#command-for-a-cpu-only-machine-on-google-cloud-platform. I added more disks because 300G is not enough for the example below. I used --boot-disk-size "1000"
for now.
(Note: the disk issue might be because I didn't run with --rm
in docker.)
DATA_DIR=${PWD}/data
mkdir -p ${DATA_DIR}
gcloud storage cp gs://brain-genomics-public/research/sequencing/fastq/novaseq/wgs_pcr_free/35x/HG003.novaseq.pcr-free.35x.R?.fastq.gz ${DATA_DIR}/
gcloud storage cp gs://hprc-pangenomes/hprc-v1.0-mc-chm13-minaf.0.1.gbz ${DATA_DIR}/
I used aria2c
to download these files. You can use other approaches as well:
aria2c -c -x10 -s10 -d "${DATA_DIR}" https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/filtered/hprc-v1.0-mc-chm13-minaf.0.1.min
aria2c -c -x10 -s10 -d "${DATA_DIR}" https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/filtered/hprc-v1.0-mc-chm13-minaf.0.1.dist
wget -P ${DATA_DIR} https://storage.googleapis.com/hprc-pangenomes/GRCh38.path_list.txt
SAMP=HG003
READS1=${DATA_DIR}/HG003.novaseq.pcr-free.35x.R1.fastq.gz
READS2=${DATA_DIR}/HG003.novaseq.pcr-free.35x.R2.fastq.gz
docker pull quay.io/vgteam/vg:ci-684-bc9aa5dfc4b0d14519ea47333075906a4ec74656
time docker run -v ${DATA_DIR}:${DATA_DIR} \
quay.io/vgteam/vg:ci-684-bc9aa5dfc4b0d14519ea47333075906a4ec74656 \
vg giraffe --progress \
--read-group "ID:1 LB:lib1 SM:${SAMP} PL:illumina PU:unit1" \
--sample "${SAMP}" \
-o BAM --ref-paths ${DATA_DIR}/GRCh38.path_list.txt \
-P -L 3000 \
-f $READS1 -f $READS2 \
-Z ${DATA_DIR}/hprc-v1.0-mc-chm13-minaf.0.1.gbz \
-d ${DATA_DIR}/hprc-v1.0-mc-chm13-minaf.0.1.dist \
-m ${DATA_DIR}/hprc-v1.0-mc-chm13-minaf.0.1.min \
-t $(nproc) > reads.unsorted.bam
NOTE: No need to sort this yet, because we'll need to sort it in the next step.
Mapping speed: 721.705 reads per second per thread
Used 1.15553e+06 CPU-seconds (including output).
Achieved 725.539 reads per CPU-second (including output)
Memory footprint: 60.2669 GB
real 307m59.952s
user 0m42.707s
sys 3m41.616s
File size:
$ ls -lh reads.unsorted.bam
-rw-r--r-- 1 root root 68G Mar 28 21:36 reads.unsorted.bam
INBAM=reads.unsorted.bam
BAM=reads.sorted.chrfixed.bam
## Download the reference fasta, index, and dict (although for this you only need the .dict).
wget https://storage.googleapis.com/hprc-pangenomes/hg38.fa
wget https://storage.googleapis.com/hprc-pangenomes/hg38.fa.fai
wget https://storage.googleapis.com/hprc-pangenomes/hg38.dict
# Prepare the new header.
samtools view -H $INBAM | grep ^@HD > new_header.sam
grep ^@SQ hg38.dict | awk '{print $1 "\t" $2 "\t" $3}' >> new_header.sam
samtools view -H $INBAM | grep -v ^@HD | grep -v ^@SQ >> new_header.sam
## write new BAM, removing the "GRCh38." prefixes
time cat <(cat new_header.sam) <(samtools view $INBAM) | sed -e "s/GRCh38.//g" | samtools sort --threads 8 -m 2G -O BAM > ${BAM}
# Index the BAM.
samtools index -@10 ${BAM}
This took:
real 77m1.554s
user 156m4.586s
sys 25m6.986s
File size:
$ ls -lh reads.sorted.chrfixed.bam
-rw-r--r-- 1 root root 40G Mar 28 22:53 reads.sorted.chrfixed.bam
Get the same reference we used for https://github.com/google/deepvariant/blob/r1.5/docs/deepvariant-case-study.md
FTPDIR=ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids
curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | gunzip > ${DATA_DIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
samtools faidx ${DATA_DIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
BIN_VERSION="1.5.0"
time docker run \
-v "${DATA_DIR}":"${DATA_DIR}" \
-v "${PWD}:${PWD}" \
google/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WGS \
--ref=${DATA_DIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
--reads=${PWD}/${BAM} \
--output_vcf=${PWD}/min_mapping_quality-keep_legacy_allele_counter_behavior-normalize_reads-vg.vcf.gz \
--output_gvcf=${PWD}/min_mapping_quality-keep_legacy_allele_counter_behavior-normalize_reads-vg.g.vcf.gz \
--make_examples_extra_args="min_mapping_quality=1,keep_legacy_allele_counter_behavior=true,normalize_reads=true" \
--num_shards=$(nproc)
Stage | Time (minutes) |
---|---|
make_examples | 106m25.298s |
call_variants | 154m28.314s |
postprocess_variants (with gVCF) | 39m33.041s |
mkdir -p benchmark
FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG003_NA24149_father/NISTv4.2.1/GRCh38
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
mkdir -p happy
docker pull jmcdani20/hap.py:v0.3.12
docker run \
-v "${DATA_DIR}":"${DATA_DIR}" \
-v "${PWD}:${PWD}" \
jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
${PWD}/benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
${PWD}/min_mapping_quality-keep_legacy_allele_counter_behavior-normalize_reads-vg.vcf.gz \
-f ${PWD}/benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
-r ${DATA_DIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
-o ${PWD}/happy/happy.output \
--engine=vcfeval \
--pass-only
hap.py output for min_mapping_quality=1,keep_legacy_allele_counter_behavior=true,normalize_reads=true
Type | TRUTH.TP | TRUTH.FN | QUERY.FP | METRIC.Recall | METRIC.Precision | METRIC.F1_Score |
---|---|---|---|---|---|---|
INDEL | 502166 | 2335 | 1335 | 0.995372 | 0.997456 | 0.996413 |
SNP | 3314019 | 13477 | 4127 | 0.99595 | 0.998757 | 0.997351 |
This can be compared with https://github.com/google/deepvariant/blob/r1.5/docs/metrics.md#accuracy.
Which shows that vg giraffe
improves F1:
- Indel F1: 0.996103 --> 0.996413
- SNP F1: 0.996247 --> 0.997351