junaidnz97/JUNAID_NZ_GSOC2019.md

## JUNAID_NZ_GSOC2019.md

      
    Raw
  

              JUNAID_NZ_GSOC2019.md
            
          
    RESTFUL APIs for Genomic Variation Search - GSoC 2019

About the organization

The Global Alliance for Genomics and Health (GA4GH)


The Global Alliance for Genomics and Health (GA4GH) was formed to help accelerate the potential of genomic medicine to advance human health. It brings together over 400 leading Genome Institutes and Centers with IT industry leaders to create global standards and tools for the secure, privacy respecting and interoperable sharing of Genomic data.
European Variation Archive


The European Variation Archive is an open-access database of all types of genetic variation data from all species.
All users can download data from any study, or submit their own data to the archive. They can also query all variants in the EVA by study, gene, chromosomal location or dbSNP identifier using the Variant Browser or the API.
About the project

The European Variation Archive (EVA) would like to improve the alignment of its genomic variation search APIs with the corresponding standards defined by GA4GH.
The Beacon project was launched in 2014 to show the willingness of researchers to enable the secure sharing of genomic data from participants of genomic studies. Beacons are web-servers that answer questions like Does your dataset include a genome that has a specific nucleotide (e.g. G) at a specific genomic coordinate (e.g. Chromosome:1, Start:111, End:111)? to which the Beacon must respond with yes or no, without referring to a specific individual.
But the previous implementation was based on an outdated version of the beacon specification. So one of the goals of this project was to upgrade the implementation of “eva-server” so that it matches with the latest beacon specification.
Another goal of the project was to redesign the previous API for the searching of genomic variations based on REST principles. Previously this API retrieved humongous amount of data even for a small region and this information was cluttered and difficult to analyze. So the goal of this task was to redesign the previous API based on REST principles so that only the information queried by the user is returned.
The last goal of this project was to improve the efficiency for search for variants in a gene. So the idea was to implement a data pipeline to load mappings from gene ID to its coordinates in a genome. The reason for this was that,searching by gene is not efficient. Therefore, this mapping that we are using created can be used to retrieve the coordinates for a particular gene and further search using these coordinates.
Team


Junaid N Z
Cristina Yenyxe Gonzalez - Mentor
Andres Felipe Silva Valencia - Mentor
Jose Miguel Mut Lopez - Mentor

Technologies


Java 8: As the implementation language.
MongoDB: As the datastore.
Maven: As the build automation tool.
Spring Web MVC: As the server infrastructure.
HATEOAS: To provide information dynamically through hypermedia
Spring Batch: Framework to do batch processing tasks

My Pull Requests

My contribution to the Project is as follows:

PRs in eva-ws repository
PRs in variation-commons repository
PRs in eva-pipeline repository

Completed Tasks

I have completed the following tasks during the GSoC coding period.
Updated Beacon API to follow the latest specification

This was the first task of my project and it was really interesting. This task was quite big and had to be divided into different pull requests. Several new classes had to be made to represent different entities as mentioned in the beacon specification. Even though I had initially created all the classes denoting the different entities from the specification, after the suggestion from the mentors, these classes were replaced by autogenerated code using "swagger codegen" tool. Also, several functions that this endpoint uses to access the database were implemented in the variation-commons repository. Feedback on the issues that we faced on the specification were submitted  as well.
PULL REQUESTS


Add required functions and classes to variation-commons
Refactored Beacon Response classes to variaton-commons
Replace the new classes by auto-generated code
Minor bug fix to retireve all the required fields from the database
Add the code for beacon endpoint to eva-ws
Minor Bug fix
Bug Fix and optimization for beacon query

Redesigning the endpoint for searching for variant information using variant coordinates and alleles

Using this endpoint, searching for variant information was made possible using variant coordinates and alleles. This main endpoint was divided further into 3 endpoints, one that would retrieve only the core variant information, another that would retrieve only the annotation information and another that would only retrieve the sources information for the given variantId. Links were added in the response using HATEOAS so that a user can easily traverse through these sub enpoints.
PULL REQUESTS


Redesign search for variant information using variantId to follow RESTful principles
Remove the use of QueryResponse class in the response

Redesigning the endpoint for searching variants by region to follow REST principles

Using this API, searching of variants in one or more than one regions together along with various other search parameters was made possible. Also, links were added in the response for each variant, pointing to the "VariantServer" endpoint that provides detailed information on each individual variant. HATEOAS was used to achieve this. Pagination feature was also added to this endpoint
PULL REQUESTS


Redesign search for variants by region endpoint to follow RESTful principles
Add pagination feature to region endpoint

Adding an endpoint to search for variants using an identifier

Using this API, searching for variants using an identifier was made possible. Links that points towards the annotation information and sources information for each variant was also added using HATEOAS.
PULL REQUESTS


Add search for variants by an Identifier endpoint
Add HATEOAS feature to Identifier endpoint

Verify if the data pipeline to load mapping from geneId to its coordinates works

Since the code for this pipeline was already written, I just had to verify if the mappings gets loaded to the database from a ".gtf" file using the code. I was able to verify this after making a minor fix in an example properties file.
PULL REQUESTS


Added feature property to initialize-database.properties

Redesign search for variants by geneId to follow RESTful principles and to use the geneId to coordinates mapping

Using this API, searching for variants using a list of geneIds was made possible. Also, this API makes use of the geneId to coordinates mapping using which a call to the region endpoint is made to retrieve the respective variants. HATEOAS and pagination features were added to this endpoint as well.
PULL REQUESTS


Added functions needed to retrieve data from the database
Redesign search for variants by a list of geneIds
Add additional search parameters to gene endpoint
Minor Bug Fix

Add descriptions for the parameters

Descriptions for the request parameters are added so that they get displayed on the swagger page which would make it easier for the users to understand what the parameters mean.
PULL REQUESTS


Add parameter description

Add an endpoint to retrieve all the studies for a given species and assembly (Additional Goal)

Using this endpoint, retrieval of all studies for a given species and assembly is made possible. Pagination feature was added to this endpoint as well.
PULL REQUESTS


Added functions needed to retireve required data from the database
Added the endpoint to retrieve all the studies

Added test for DatabaseInitializationJob (Additional Goal)

The task was to add unit tests for the job DatabaseInitialization were added in eva-pipeline repository.
PULL REQUESTS


Added test for DatabaseInitializationJob

Added parameters validator for DatabaseInitializationJob (Additional Goal)

The task was to add parameters validator functionality to the job DatabaseInitializatio in eva-pipeline repository.
PULL REQUESTS


Added parameters validator for DatabaseInitializationJob

Conclusion

My project was mostly to design APIs based on RESTful principles and code them. Coding the beacon API was very challenging as well as interesting. The mentors were very helpful and the code reviews were very detaileds and I was able to learn a lot from them. Implementing HATEOAS feature was challenging and gave a lot of issues at first, but with the help of the mentors, I was able to fix them gradually.
The work is complete and I am really looking forward to contribute a lot more to the project.