The Global Alliance for Genomics and Health (GA4GH) was formed to help accelerate the potential of genomic medicine to advance human health. It brings together over 400 leading Genome Institutes and Centers with IT industry leaders to create global standards and tools for the secure, privacy respecting and interoperable sharing of Genomic data.
The European Variation Archive is an open-access database of all types of genetic variation data from all species. All users can download data from any study, or submit their own data to the archive. They can also query all variants in the EVA by study, gene, chromosomal location or dbSNP identifier using the Variant Browser or the API.
The European Variation Archive (EVA) would like to improve the alignment of its genomic variation search APIs with the corresponding standards defined by GA4GH.
The Beacon project was launched in 2014 to show the willingness of researchers to enable the secure sharing of genomic data from participants of genomic studies. Beacons are web-servers that answer questions like Does your dataset include a genome that has a specific nucleotide (e.g. G) at a specific genomic coordinate (e.g. Chromosome:1, Start:111, End:111)? to which the Beacon must respond with yes or no, without referring to a specific individual.
But the previous implementation was based on an outdated version of the beacon specification. So one of the goals of this project was to upgrade the implementation of “eva-server” so that it matches with the latest beacon specification.
Another goal of the project was to redesign the previous API for the searching of genomic variations based on REST principles. Previously this API retrieved humongous amount of data even for a small region and this information was cluttered and difficult to analyze. So the goal of this task was to redesign the previous API based on REST principles so that only the information queried by the user is returned.
The last goal of this project was to improve the efficiency for search for variants in a gene. So the idea was to implement a data pipeline to load mappings from gene ID to its coordinates in a genome. The reason for this was that,searching by gene is not efficient. Therefore, this mapping that we are using created can be used to retrieve the coordinates for a particular gene and further search using these coordinates.
- Junaid N Z
- Cristina Yenyxe Gonzalez - Mentor
- Andres Felipe Silva Valencia - Mentor
- Jose Miguel Mut Lopez - Mentor
- Java 8: As the implementation language.
- MongoDB: As the datastore.
- Maven: As the build automation tool.
- Spring Web MVC: As the server infrastructure.
- HATEOAS: To provide information dynamically through hypermedia
- Spring Batch: Framework to do batch processing tasks
My contribution to the Project is as follows:
I have completed the following tasks during the GSoC coding period.
This was the first task of my project and it was really interesting. This task was quite big and had to be divided into different pull requests. Several new classes had to be made to represent different entities as mentioned in the beacon specification. Even though I had initially created all the classes denoting the different entities from the specification, after the suggestion from the mentors, these classes were replaced by autogenerated code using "swagger codegen" tool. Also, several functions that this endpoint uses to access the database were implemented in the variation-commons repository. Feedback on the issues that we faced on the specification were submitted as well.
- Add required functions and classes to variation-commons
- Refactored Beacon Response classes to variaton-commons
- Replace the new classes by auto-generated code
- Minor bug fix to retireve all the required fields from the database
- Add the code for beacon endpoint to eva-ws
- Minor Bug fix
- Bug Fix and optimization for beacon query
Redesigning the endpoint for searching for variant information using variant coordinates and alleles
Using this endpoint, searching for variant information was made possible using variant coordinates and alleles. This main endpoint was divided further into 3 endpoints, one that would retrieve only the core variant information, another that would retrieve only the annotation information and another that would only retrieve the sources information for the given variantId. Links were added in the response using HATEOAS so that a user can easily traverse through these sub enpoints.
- Redesign search for variant information using variantId to follow RESTful principles
- Remove the use of QueryResponse class in the response
Using this API, searching of variants in one or more than one regions together along with various other search parameters was made possible. Also, links were added in the response for each variant, pointing to the "VariantServer" endpoint that provides detailed information on each individual variant. HATEOAS was used to achieve this. Pagination feature was also added to this endpoint
- Redesign search for variants by region endpoint to follow RESTful principles
- Add pagination feature to region endpoint
Using this API, searching for variants using an identifier was made possible. Links that points towards the annotation information and sources information for each variant was also added using HATEOAS.
Since the code for this pipeline was already written, I just had to verify if the mappings gets loaded to the database from a ".gtf" file using the code. I was able to verify this after making a minor fix in an example properties file.
Redesign search for variants by geneId to follow RESTful principles and to use the geneId to coordinates mapping
Using this API, searching for variants using a list of geneIds was made possible. Also, this API makes use of the geneId to coordinates mapping using which a call to the region endpoint is made to retrieve the respective variants. HATEOAS and pagination features were added to this endpoint as well.
- Added functions needed to retrieve data from the database
- Redesign search for variants by a list of geneIds
- Add additional search parameters to gene endpoint
- Minor Bug Fix
Descriptions for the request parameters are added so that they get displayed on the swagger page which would make it easier for the users to understand what the parameters mean.
Using this endpoint, retrieval of all studies for a given species and assembly is made possible. Pagination feature was added to this endpoint as well.
- Added functions needed to retireve required data from the database
- Added the endpoint to retrieve all the studies
The task was to add unit tests for the job DatabaseInitialization were added in eva-pipeline repository.
The task was to add parameters validator functionality to the job DatabaseInitializatio in eva-pipeline repository.
My project was mostly to design APIs based on RESTful principles and code them. Coding the beacon API was very challenging as well as interesting. The mentors were very helpful and the code reviews were very detaileds and I was able to learn a lot from them. Implementing HATEOAS feature was challenging and gave a lot of issues at first, but with the help of the mentors, I was able to fix them gradually.
The work is complete and I am really looking forward to contribute a lot more to the project.