Skip to content

Instantly share code, notes, and snippets.

@Yiran-Guo
Last active February 17, 2022 19:06
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Yiran-Guo/b1a25189c80bcd73e147e8640c1eb736 to your computer and use it in GitHub Desktop.
Save Yiran-Guo/b1a25189c80bcd73e147e8640c1eb736 to your computer and use it in GitHub Desktop.

D3b Bioinformatics Scientist Exercises

There are three sections, the first is around bioinformatics pipeline development and automation, the second for database queries and result visualization, and the third for tertiary analysis involving some basic human genetics.

Pipeline Section

Write a workflow for the functions described below:

  • Input FastQ/BAM file
  • Use BWA MEM to re-align to hg38
  • Output CRAM file

When creating this workflow, please consider:

  • Efficiency in both space and time
  • Edge cases and error handling
  • Tests

We prefer common-workflow-language (CWL); however you can also use nextflow, snakemake, WDL, shell or any other scripts, and please explain why.

Data Section

This section consists of short exercises in analysis of a relational health-related database in SQLite, downloadable from here: https://www.dropbox.com/s/mgu1s93kpjsoyhh/openmrs.db?dl=0

This database is a processed version of the public data set for a specific instance of a query tool. The original open MRS data model can be found here: https://wiki.openmrs.org/display/docs/Data+Model. The solutions should be done on the data in SQLite database provided. You can convert it to a database flavor of your choice, and again, explain your choice.

Key tables include:

  • patient
  • encounter
  • encounter_diagnosis
  • diagnosis

Please provide both the data result as well as any code that was run to obtain the result.

Data Exercise 1

Provide a list of male patients in the database and the counts of patients by gender.

Data Exercise 2

Make a data visualization summary for each of the patient diagnosis, gender, and age.

Tertiary Analysis Section

Let's look at the Ashkenazi Jewish family trio in GIAB (https://www.nist.gov/programs-projects/genome-bottle), and you should be able to find relevant files for download there. You are encouraged to write a workflow as much as possible for automation.

  1. Download the latest variant calling files (vcf) in hg38

  2. Combine three vcfs into one family joint/multi-sample vcf (which tool are you going to use?)

  3. Annotate this multi-sample vcf using Annovar (https://annovar.openbioinformatics.org/en/latest/) with these databases: refGene, gnomad211_exome, clinvar_20210501

  4. Filter the annotated vcf, retaining variants that are

    a. (with MAF < 0.0001 in) or (absent from) gnomAD exome 2.1.1 and

    b. annotated to be (splicing or exonic nonsynonymous/stopgain by refGene) or (pathogenic or likely pathogenic by clinvar_20210501)

  5. Restrict the filtered vcf according to the following criteria (a separate result vcf for each of them)

    a. paternally inherited variants in the son

    b. maternally inherited variants in the son

    c. variants in the same genes as included in both a. and b. above

    d. de novo variants in the son

    e. hemizygous variants in the son following the X-linked recessive inheritance model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment