Skip to content

Instantly share code, notes, and snippets.

@pichuan
Last active September 19, 2022 22:22
Show Gist options
  • Save pichuan/64d73bc965300645470eb29a66116593 to your computer and use it in GitHub Desktop.
Save pichuan/64d73bc965300645470eb29a66116593 to your computer and use it in GitHub Desktop.
DeepVariant WGS case study - using population data

https://gist.github.com/pichuan/64d73bc965300645470eb29a66116593


Here is an example of how to apply the model described in Improving variant calling using population data and deep learning . The example is similar to DeepVariant WGS case study.

I got a machine using a command like this: https://github.com/google/deepvariant/blob/r1.1/docs/deepvariant-details.md#command-for-a-cpu-only-machine-on-google-cloud-platform

Then, follow the instructions in https://github.com/google/deepvariant/blob/r1.1/docs/deepvariant-case-study.md to download the data.

And, because we are going to test the allele frequency feature, we need to download the file used in --population_vcfs flag, and also the trained model files:

The full pVCFs can be found in:

$ gsutil ls -lh gs://brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref/cohort-*.release_missing2ref.vcf.gz
...
TOTAL: 24 objects, 1002670628843 bytes (933.81 GiB)

We created a set of smaller set of files that contain just the necessary allele frequency information here:

$ gsutil ls -lh gs://brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref/cohort-*.release_missing2ref.no_calls.vcf.gz
...
TOTAL: 24 objects, 2353577310 bytes (2.19 GiB)

In this tutorial, we'll just run on chr20. So we can download just chr20:

AF_DIR=https://storage.googleapis.com/brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref

curl ${AF_DIR}/cohort-chr20.release_missing2ref.no_calls.vcf.gz > input/cohort-chr20.release_missing2ref.no_calls.vcf.gz
curl ${AF_DIR}/cohort-chr20.release_missing2ref.no_calls.vcf.gz.tbi > input/cohort-chr20.release_missing2ref.no_calls.vcf.gz.tbi

Model files:

MODEL_DIR=https://storage.googleapis.com/brain-genomics-public/research/allele_frequency/pretrained_model_WGS

curl ${MODEL_DIR}/model.ckpt-2520200.data-00000-of-00001 > input/model.ckpt-2520200.data-00000-of-00001
curl ${MODEL_DIR}/model.ckpt-2520200.index > input/model.ckpt-2520200.index
curl ${MODEL_DIR}/model.ckpt-2520200.meta > input/model.ckpt-2520200.meta

And then in this step: https://github.com/google/deepvariant/blob/r1.1/docs/deepvariant-case-study.md#running-on-a-cpu-only-machine, use this command instead:

mkdir -p output
mkdir -p output/intermediate_results_dir

BIN_VERSION="1.1.0"

sudo docker run \
  -v "${PWD}/input":"/input" \
  -v "${PWD}/output":"/output" \
  -v "${PWD}/reference":"/reference" \
  google/deepvariant:"${BIN_VERSION}" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type WGS \
  --ref /reference/GRCh38_no_alt_analysis_set.fasta \
  --reads /input/HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam \
  --output_vcf /output/HG003.output.vcf.gz \
  --output_gvcf /output/HG003.output.g.vcf.gz \
  --num_shards $(nproc) \
  --regions chr20 \
  --intermediate_results_dir /output/intermediate_results_dir \
  --make_examples_extra_args="use_allele_frequency=true,population_vcfs=/input/cohort-chr20.release_missing2ref.no_calls.vcf.gz" \
  --customized_model="/input/model.ckpt-2520200"

Then, follow the rest of the instructions in https://github.com/google/deepvariant/blob/r1.1/docs/deepvariant-case-study.md to finish hap.py evaluation:

Output:

Benchmarking Summary:
  Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
 INDEL    ALL        10634     10581        53        21070        26      10007     19       0.995016          0.997650        0.474941         0.996331                     NaN                     NaN                   1.749861                   2.304642
 INDEL   PASS        10634     10581        53        21070        26      10007     19       0.995016          0.997650        0.474941         0.996331                     NaN                     NaN                   1.749861                   2.304642
   SNP    ALL        70209     69956       253        86127        73      16068     15       0.996396          0.998958        0.186562         0.997676                2.297347                2.068096                   1.884533                   1.955580
   SNP   PASS        70209     69956       253        86127        73      16068     15       0.996396          0.998958        0.186562         0.997676                2.297347                2.068096                   1.884533                   1.955580

An example to run on the whole genome

To run on the whole genome (instead of just chr20), the steps will be similar to the ones above. A few changes:

Download the whole genome:

mkdir -p input
gsutil -m cp gs://deepvariant/case-study-testdata/HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.bam{,.bai} input/

Download all of the population VCF files:

gsutil -m cp gs://brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref/cohort-*.release_missing2ref.no_calls.vcf.gz{,.tbi} input/

When run DeepVariant, use this command instead:

mkdir -p output
mkdir -p output/intermediate_results_dir

BIN_VERSION="1.1.0"

sudo docker run \
  -v "${PWD}/input":"/input" \
  -v "${PWD}/output":"/output" \
  -v "${PWD}/reference":"/reference" \
  google/deepvariant:"${BIN_VERSION}" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type WGS \
  --ref /reference/GRCh38_no_alt_analysis_set.fasta \
  --reads /input/HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.bam \
  --output_vcf /output/HG003.output.vcf.gz \
  --output_gvcf /output/HG003.output.g.vcf.gz \
  --num_shards $(nproc) \
  --intermediate_results_dir /output/intermediate_results_dir \
  --make_examples_extra_args='use_allele_frequency=true,population_vcfs="/input/cohort-chr1.release_missing2ref.no_calls.vcf.gz /input/cohort-chr10.release_missing2ref.no_calls.vcf.gz /input/cohort-chr11.release_missing2ref.no_calls.vcf.gz /input/cohort-chr12.release_missing2ref.no_calls.vcf.gz /input/cohort-chr13.release_missing2ref.no_calls.vcf.gz /input/cohort-chr14.release_missing2ref.no_calls.vcf.gz /input/cohort-chr15.release_missing2ref.no_calls.vcf.gz /input/cohort-chr16.release_missing2ref.no_calls.vcf.gz /input/cohort-chr17.release_missing2ref.no_calls.vcf.gz /input/cohort-chr18.release_missing2ref.no_calls.vcf.gz /input/cohort-chr19.release_missing2ref.no_calls.vcf.gz /input/cohort-chr2.release_missing2ref.no_calls.vcf.gz /input/cohort-chr20.release_missing2ref.no_calls.vcf.gz /input/cohort-chr21.release_missing2ref.no_calls.vcf.gz /input/cohort-chr22.release_missing2ref.no_calls.vcf.gz /input/cohort-chr3.release_missing2ref.no_calls.vcf.gz /input/cohort-chr4.release_missing2ref.no_calls.vcf.gz /input/cohort-chr5.release_missing2ref.no_calls.vcf.gz /input/cohort-chr6.release_missing2ref.no_calls.vcf.gz /input/cohort-chr7.release_missing2ref.no_calls.vcf.gz /input/cohort-chr8.release_missing2ref.no_calls.vcf.gz /input/cohort-chr9.release_missing2ref.no_calls.vcf.gz /input/cohort-chrX.release_missing2ref.no_calls.vcf.gz /input/cohort-chrY.release_missing2ref.no_calls.vcf.gz"' \
  --customized_model="/input/model.ckpt-2520200"

Update: If you're using our v1.4.0 Docker codebase

If you're using BIN_VERSION="1.4.0", and stil using the model ckpt above (which was trained without the new "insert_size" channel), please add an extra channel='' to your make_examples_extra_args. So it looks like:

--make_examples_extra_args="use_allele_frequency=true,population_vcfs=/input/cohort-chr20.release_missing2ref.no_calls.vcf.gz,channels=''"

The semantics here is: "Do not add any extra channel." This will overwrite this flag: https://github.com/google/deepvariant/blob/r1.4/scripts/run_deepvariant.py#L238 which was set to "insert_size" for WGS default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment