Yiran-Guo/d3b-bixsci-exercises.md

## d3b-bixsci-exercises.md

      
    Raw
  

              d3b-bixsci-exercises.md
            
          
    D3b Bioinformatics Scientist Exercises

There are three sections, the first is around bioinformatics pipeline development and automation, the second for database queries and result visualization, and the third for tertiary analysis involving some basic human genetics.
Pipeline Section

Write a workflow for the functions described below:

Input FastQ/BAM file
Use BWA MEM to re-align to hg38
Output CRAM file

When creating this workflow, please consider:

Efficiency in both space and time
Edge cases and error handling
Tests

We prefer common-workflow-language (CWL); however you can also use nextflow, snakemake, WDL, shell or any other scripts, and please explain why.
Data Section

This section consists of short exercises in analysis of a relational health-related database in SQLite, downloadable from here: https://www.dropbox.com/s/mgu1s93kpjsoyhh/openmrs.db?dl=0
This database is a processed version of the public data set for a specific instance of a query tool. The original open MRS data model can be found here: https://wiki.openmrs.org/display/docs/Data+Model. The solutions should be done on the data in SQLite database provided. You can convert it to a database flavor of your choice, and again, explain your choice.
Key tables include:

patient
encounter
encounter_diagnosis
diagnosis

Please provide both the data result as well as any code that was run to obtain the result.
Data Exercise 1

Provide a list of male patients in the database and the counts of patients by gender.
Data Exercise 2

Make a data visualization summary for each of the patient diagnosis, gender, and age.
Tertiary Analysis Section

Let's look at the Ashkenazi Jewish family trio in GIAB (https://www.nist.gov/programs-projects/genome-bottle), and you should be able to find relevant files for download there. You are encouraged to write a workflow as much as possible for automation.


Download the latest variant calling files (vcf) in hg38


Combine three vcfs into one family joint/multi-sample vcf (which tool are you going to use?)


Annotate this multi-sample vcf using Annovar (https://annovar.openbioinformatics.org/en/latest/) with these databases: refGene, gnomad211_exome, clinvar_20210501


Filter the annotated vcf, retaining variants that are
a. (with MAF < 0.0001 in) or (absent from) gnomAD exome 2.1.1 and
b. annotated to be (splicing or exonic nonsynonymous/stopgain by refGene) or (pathogenic or likely pathogenic by clinvar_20210501)


Restrict the filtered vcf according to the following criteria (a separate result vcf for each of them)
a. paternally inherited variants in the son
b. maternally inherited variants in the son
c. variants in the same genes as included in both a. and b. above
d. de novo variants in the son
e. hemizygous variants in the son following the X-linked recessive inheritance model