JustinAronson/GSoC2023FinalSubmission.md

## GSoC2023FinalSubmission.md

      
    Raw
  

              GSoC2023FinalSubmission.md
            
          
    Google Summer of Code 2023 Final Report

Project Information


Project Name: Implementing Phenopackets in a Variant Discovery Pipeline
Organization: Global Alliance for Genomics and Health (GA4GH)
Mentors:

Bob Dolin
Srikar Chamala


Contributor: Justin Aronson
Repository: GitHub

Project Overview

Phenopacket Injection:

Enabled users to import phenopackets from a Fast Healthcare Interoperability Resources (FHIR) server to receive variant information of genes relevant to the patient's phenotype. HPO terms are retrieved from the FHIR server using FHIR search. A tool called Phen2Gene is then queryed to translate HPO terms into relevant genes. These genes are then displayed to the user, ranked based on their likelihood to cause the phenotypes seen in the patient.
Improved Query Strategy:

This app is very data heavy - typical use would involve querying 10 or more genes from the FHIR reference implementation. A new query strategy was used for this app, which involves submiting parallel queries to the reference implementation. When quering 3 genes from the reference implementation, an average of 45% speed improvement was observed over the old call strategy. This improvement increases for queries involving more genes.
Improved UI/UX:

Translated app into React.js and Typescript to help improve user experience. Improved multinucleotide variant reporting by including component single nucleotide variants.
Multiple Gene Loading:

Enabled users to load multiple genes at once, enabling phenopacket processing pipeline.
These changes can be found at this pull request
Next Steps

AI variant prioritization solution:

The rare disease variant prioritization process is a time-consuming process for clinicians. We aim to identify whether there exists enough signal in the FHIR server data to power an AI algorithm designed to 'bubble up' potentially rare disease causing variants. Public datasets, including ClinVar and OMIM, will be used to train the algorithm.