Skip to content

Instantly share code, notes, and snippets.

@arq5x
Last active November 17, 2022 23:03
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save arq5x/5f649e226421e515007cd74787dbb960 to your computer and use it in GitHub Desktop.
Save arq5x/5f649e226421e515007cd74787dbb960 to your computer and use it in GitHub Desktop.
Analysis of rare disease VCF with bcftools

Analysis of rare disease VCF with bcftools

Big Picture - You are the genome detective.

Here's the deal.

  • We have sequenced the exomes of an affected child, and their unaffected parents. The child has a rare skin condition.
  • We aligned their exome data to the human reference genome.
  • We called variants using GATK.
  • The resulting VCF file is called trio.trim.vep.vcf.gz.
  • Make a new directory called raredisease, cd into it, and download the VCF into the directory with:

wget https://home.chpc.utah.edu/~u1007787/trio.trim.vep.vcf.gz

  • Sample 1805 (genotype offset 0 in the file) is the father
  • Sample 1847 (genotype offset 1 in the file) is the mother
  • Sample 4805 (genotype offset 2 in the file) is the affected child
  • Given the phenotypes of the members of the family, we hypothesize that the child's phenotype is caused by a de novo mutation (DNM).

Your challenge

  • Your challenge is to use the concepts we discussed in this lecture, combined with bcftools, to nominate a causal variant for this child's phenotype.

  • Hints:

    • bcftools cheat sheet - useful!
    • You need to use the bcftools view command with the -i option to filter variants
    • You can filter the variants down to a reasonable set by:
        1. requiring each member of the family to have a specific genotype (e.g., GT[0]="0/1" && GT[1]="0/1" requires mom and dad to be homozygous for the reference allele)
        1. requiring each member of the family to have a minimum depth of 10 aligned sequence reads (e.g., FORMAT/DP[1]>=10 && FORMAT/DP[2]>=10 requires mom and kid to have at least 10 aligned reads)
        1. requiring the variant to be a missense change
        1. You might care to exclude variants that were labelled as potentially "dodgy" by the variant called (i.e., "PASS")
        1. Your VCF coordinates are with respect to GRCh37 (build 37 of the human reference genome)
  • Applying these hints using bcftools should filter the variants down to less than fifteen candidate variants.

  • Using bcftools, your powerful brain, and possible external resources (e.g., gnomad.broadinstitute.org), make your best guess as to the variant and gene that you think causes the kid's phenotype.

bcftools examples to get you rolling

  1. Report the header and every variant in the file:

    bcftools view trio.trim.vep.vcf.gz

  2. Report the header and every variant where mom and dad are homozygous for the reference allele:

    bcftools view trio.trim.vep.vcf.gz -i 'GT[0]="0/0" && GT[1]="0/0"'

Enter Your Prediction

https://docs.google.com/forms/d/e/1FAIpQLScqKUbaOCabhtV_uE9uHCqb6hOaVg39CXL-kOm8ysN6n1VBeQ/viewform?usp=sf_link

@arq5x
Copy link
Author

arq5x commented Nov 17, 2022

bcftools view trio.trim.vep.vcf.gz -i 'GT[0]="0/0" && GT[1]="0/0" && GT[2]="0/1" && FORMAT/DP[0]>=10 && FORMAT/DP[1]>=10 && FORMAT/DP[2]>=10' | grep missense | grep PASS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment