Skip to content

Instantly share code, notes, and snippets.

@junaidnz97
Last active August 18, 2020 13:54
Show Gist options
  • Save junaidnz97/4feadc786249424e081021d21ae94294 to your computer and use it in GitHub Desktop.
Save junaidnz97/4feadc786249424e081021d21ae94294 to your computer and use it in GitHub Desktop.

RESTFUL APIs for Genomic Variation Search - GSoC 2019

About the organization

The Global Alliance for Genomics and Health (GA4GH)

The Global Alliance for Genomics and Health (GA4GH) was formed to help accelerate the potential of genomic medicine to advance human health. It brings together over 400 leading Genome Institutes and Centers with IT industry leaders to create global standards and tools for the secure, privacy respecting and interoperable sharing of Genomic data.

European Variation Archive

eva_logo.png

The European Variation Archive is an open-access database of all types of genetic variation data from all species. All users can download data from any study, or submit their own data to the archive. They can also query all variants in the EVA by study, gene, chromosomal location or dbSNP identifier using the Variant Browser or the API.

About the project

The European Variation Archive (EVA) would like to improve the alignment of its genomic variation search APIs with the corresponding standards defined by GA4GH.

The Beacon project was launched in 2014 to show the willingness of researchers to enable the secure sharing of genomic data from participants of genomic studies. Beacons are web-servers that answer questions like Does your dataset include a genome that has a specific nucleotide (e.g. G) at a specific genomic coordinate (e.g. Chromosome:1, Start:111, End:111)? to which the Beacon must respond with yes or no, without referring to a specific individual.

But the previous implementation was based on an outdated version of the beacon specification. So one of the goals of this project was to upgrade the implementation of “eva-server” so that it matches with the latest beacon specification.

Another goal of the project was to redesign the previous API for the searching of genomic variations based on REST principles. Previously this API retrieved humongous amount of data even for a small region and this information was cluttered and difficult to analyze. So the goal of this task was to redesign the previous API based on REST principles so that only the information queried by the user is returned.

The last goal of this project was to improve the efficiency for search for variants in a gene. So the idea was to implement a data pipeline to load mappings from gene ID to its coordinates in a genome. The reason for this was that,searching by gene is not efficient. Therefore, this mapping that we are using created can be used to retrieve the coordinates for a particular gene and further search using these coordinates.

Team

  • Junaid N Z
  • Cristina Yenyxe Gonzalez - Mentor
  • Andres Felipe Silva Valencia - Mentor
  • Jose Miguel Mut Lopez - Mentor

Technologies

  • Java 8: As the implementation language.
  • MongoDB: As the datastore.
  • Maven: As the build automation tool.
  • Spring Web MVC: As the server infrastructure.
  • HATEOAS: To provide information dynamically through hypermedia
  • Spring Batch: Framework to do batch processing tasks

My Pull Requests

My contribution to the Project is as follows:

Completed Tasks

I have completed the following tasks during the GSoC coding period.

Updated Beacon API to follow the latest specification

This was the first task of my project and it was really interesting. This task was quite big and had to be divided into different pull requests. Several new classes had to be made to represent different entities as mentioned in the beacon specification. Even though I had initially created all the classes denoting the different entities from the specification, after the suggestion from the mentors, these classes were replaced by autogenerated code using "swagger codegen" tool. Also, several functions that this endpoint uses to access the database were implemented in the variation-commons repository. Feedback on the issues that we faced on the specification were submitted as well.

PULL REQUESTS

Redesigning the endpoint for searching for variant information using variant coordinates and alleles

Using this endpoint, searching for variant information was made possible using variant coordinates and alleles. This main endpoint was divided further into 3 endpoints, one that would retrieve only the core variant information, another that would retrieve only the annotation information and another that would only retrieve the sources information for the given variantId. Links were added in the response using HATEOAS so that a user can easily traverse through these sub enpoints.

PULL REQUESTS

Redesigning the endpoint for searching variants by region to follow REST principles

Using this API, searching of variants in one or more than one regions together along with various other search parameters was made possible. Also, links were added in the response for each variant, pointing to the "VariantServer" endpoint that provides detailed information on each individual variant. HATEOAS was used to achieve this. Pagination feature was also added to this endpoint

PULL REQUESTS

Adding an endpoint to search for variants using an identifier

Using this API, searching for variants using an identifier was made possible. Links that points towards the annotation information and sources information for each variant was also added using HATEOAS.

PULL REQUESTS

Verify if the data pipeline to load mapping from geneId to its coordinates works

Since the code for this pipeline was already written, I just had to verify if the mappings gets loaded to the database from a ".gtf" file using the code. I was able to verify this after making a minor fix in an example properties file.

PULL REQUESTS

Redesign search for variants by geneId to follow RESTful principles and to use the geneId to coordinates mapping

Using this API, searching for variants using a list of geneIds was made possible. Also, this API makes use of the geneId to coordinates mapping using which a call to the region endpoint is made to retrieve the respective variants. HATEOAS and pagination features were added to this endpoint as well.

PULL REQUESTS

Add descriptions for the parameters

Descriptions for the request parameters are added so that they get displayed on the swagger page which would make it easier for the users to understand what the parameters mean.

PULL REQUESTS

Add an endpoint to retrieve all the studies for a given species and assembly (Additional Goal)

Using this endpoint, retrieval of all studies for a given species and assembly is made possible. Pagination feature was added to this endpoint as well.

PULL REQUESTS

Added test for DatabaseInitializationJob (Additional Goal)

The task was to add unit tests for the job DatabaseInitialization were added in eva-pipeline repository.

PULL REQUESTS

Added parameters validator for DatabaseInitializationJob (Additional Goal)

The task was to add parameters validator functionality to the job DatabaseInitializatio in eva-pipeline repository.

PULL REQUESTS

Conclusion

My project was mostly to design APIs based on RESTful principles and code them. Coding the beacon API was very challenging as well as interesting. The mentors were very helpful and the code reviews were very detaileds and I was able to learn a lot from them. Implementing HATEOAS feature was challenging and gave a lot of issues at first, but with the help of the mentors, I was able to fix them gradually.

The work is complete and I am really looking forward to contribute a lot more to the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment