There are three sections, the first is around bioinformatics pipeline development and automation, the second for database queries and result visualization, and the third for tertiary analysis involving some basic human genetics.
Write a workflow for the functions described below:
- Input FastQ/BAM file
- Use BWA MEM to re-align to hg38
- Output CRAM file
When creating this workflow, please consider:
- Efficiency in both space and time
- Edge cases and error handling
- Tests
We prefer common-workflow-language (CWL); however you can also use nextflow, snakemake, WDL, shell or any other scripts, and please explain why.
This section consists of short exercises in analysis of a relational health-related database in SQLite, downloadable from here: https://www.dropbox.com/s/mgu1s93kpjsoyhh/openmrs.db?dl=0
This database is a processed version of the public data set for a specific instance of a query tool. The original open MRS data model can be found here: https://wiki.openmrs.org/display/docs/Data+Model. The solutions should be done on the data in SQLite database provided. You can convert it to a database flavor of your choice, and again, explain your choice.
Key tables include:
- patient
- encounter
- encounter_diagnosis
- diagnosis
Please provide both the data result as well as any code that was run to obtain the result.
Provide a list of male patients in the database and the counts of patients by gender.
Make a data visualization summary for each of the patient diagnosis, gender, and age.
Let's look at the Ashkenazi Jewish family trio in GIAB (https://www.nist.gov/programs-projects/genome-bottle), and you should be able to find relevant files for download there. You are encouraged to write a workflow as much as possible for automation.
-
Download the latest variant calling files (vcf) in hg38
-
Combine three vcfs into one family joint/multi-sample vcf (which tool are you going to use?)
-
Annotate this multi-sample vcf using Annovar (https://annovar.openbioinformatics.org/en/latest/) with these databases:
refGene
,gnomad211_exome
,clinvar_20210501
-
Filter the annotated vcf, retaining variants that are
a. (with MAF < 0.0001 in) or (absent from) gnomAD exome 2.1.1 and
b. annotated to be (splicing or exonic nonsynonymous/stopgain by
refGene
) or (pathogenic or likely pathogenic byclinvar_20210501
) -
Restrict the filtered vcf according to the following criteria (a separate result vcf for each of them)
a. paternally inherited variants in the son
b. maternally inherited variants in the son
c. variants in the same genes as included in both a. and b. above
d. de novo variants in the son
e. hemizygous variants in the son following the X-linked recessive inheritance model