amilamanoj/GA4GH_VCF_Streaming_API.md

## GA4GH_VCF_Streaming_API.md

      
    Raw
  

              GA4GH_VCF_Streaming_API.md
            
          
    VCF Streaming API: Server and Specification - GSoC 2017

About the organization

The Global Alliance for Genomics and Health (GA4GH)


The Global Alliance for Genomics and Health (GA4GH) was formed to help accelerate the potential of genomic medicine to advance human health. It brings together over 400 leading Genome Institutes and Centers with IT industry leaders to create global standards and tools for the secure, privacy respecting and interoperable sharing of Genomic data.
European Variation Archive


The European Variation Archive is an open-access database of all types of genetic variation data from all species.
All users can download data from any study, or submit their own data to the archive. They can also query all variants in the EVA by study, gene, chromosomal location or dbSNP identifier using the Variant Browser.
About the project

Htsget retrieval API spec is a specification by GA4GH for genomics bulk data transfers. This is a Streaming API that allows users to query remote services for regions of interest of genomic data, instead of having to download gigabytes of data and filter afterwards. The results are retrieved in formats that can be consumed by widely used bioinformatics tools.
The specification currently supports BAM and CRAM formats. Variant Call Format (VCF) is the standard for describing genomic variation in bioinformatics. This project extends the specification to support the VCF format and provide an implementation.
Team


Amila Manoj Silva
Cristina Yenyxe Gonzalez - Mentor
Pablo Arce Garcia - Mentor

Design

The client starts the interaction by calling the /variants/<id> GET method of the API.
Response body (also called the JSON ticket) contains a set of URLs for retrieving data-blocks in VCF format. The URLs returned by this first GET call is "self-explanatory". For example: <domain>/variants?chr=20:1000-2000&studies=6,7. This URL works without having to store any Hash-like structure internally.
The client then starts fetching the data chunks either sequentially or in parallel. The server fetches the correct variation data from the datastore, converts them to VCF and streams it to client via HTTP GET.
Technologies


Java 8: As the implementation language.
Springfox with swagger: For generating API specification.
HTSJDK: An implementation of a unified Java library for accessing common file formats, such as SAM and VCF, used for high-throughput sequencing data.
OpenCGA Storage: For querying genomics data from datastore.
Spring Web MVC: As the server infrastructure including streaming support.
MongoDB: As the datastore.
Maven: As the build automation tool.

Message flow


Code

Pull requests

HtsGet Spec

samtools/hts-specs#233
This PR includes the modifications to HtsGet specification to include support for VCF format. It extends htsget spec by adding a new method: Get Variants by ID.
HtsGet Endpoint

EBIvariation/eva-tools#27
This is the PR that contains the main changes in the server. It adds new endpoints to the server extending the REST API to add support for VCF streaming. It includes the functionality to separate requested region into blocks and responding with the JSON ticket. It also includes the endpoints for streaming the VCF header and the blocks of VCF body. Initial query would look like, for example:
http://localhost:8080/variants/7?format=VCF&species=hsapiens&referenceName=22&start=66000&end=67600.
And if the block size is configured as 250, the response JSON ticket would be:
{
	"htsget": {
		"urls": [{
			"url": "localhost:8080/variants/headers?species=hsapiens&studies=7"
		}, {
			"url": "localhost:8080/variants/block?studies=7&species=hsapiens&region=22:66000-66249"
		}, {
			"url": "localhost:8080/variants/block?studies=7&species=hsapiens&region=22:66250-66499"
		}, {
			"url": "localhost:8080/variants/block?studies=7&species=hsapiens&region=22:66500-66749"
		}, {
			"url": "localhost:8080/variants/block?studies=7&species=hsapiens&region=22:66750-66999"
		}, {
			"url": "localhost:8080/variants/block?studies=7&species=hsapiens&region=22:67000-67249"
		}, {
			"url": "localhost:8080/variants/block?studies=7&species=hsapiens&region=22:67250-67499"
		}, {
			"url": "localhost:8080/variants/block?studies=7&species=hsapiens&region=22:67500-67600"
		}]
	}
}

First URL here returns the header and the following URLs return blocks of VCF data in the streaming response.
Support for partial VCF writing

samtools/htsjdk#904
Writing blocks of VCF body without the header is necessary for streaming when the client requests consequent data blocks. We have been using HTSJDK, which is a famous java library in the field of bioinformatics, to convert the variant data into VCF format and write it to the stream. But the library lacked the support to write data partially. That means, if a user wants to write a VCF response, the header must be written in order to write the body.
The PR adds this support to HTSJDK and completes the prerequisite. This feature is also applicable to users with different use cases where writing individual data blocks is necessary, since we also added this support for other data formats such as BCF, not just VCF.
Annotations

EBIvariation/eva-tools#25
This PR adds support for including variant annotations such as consequences types in the VCF output.
Variants data can also include consequence annotations from Ensembl VEP such as Allele, Consequence, SYMBOL, Gene, Feature, BIOTYPE, cDNA_position and CDS_position. Consequences are added in the INFO field of the VCF file, using the key "CSQ". Data fields are encoded separated by "|"; the order of fields are written in the VCF header.
This document lists how VEP fields map to EVA fields.
This PR iterates through any consequences annotated in the variants, constructs the "CSQ" attribute and adds to the INFO field. It also adds the info-line to the header specifying order of the fields.
Header info-line example:
Consequence annotations from Ensembl VEP. Format: Allele|Consequence|SYMBOL|Gene|Feature|BIOTYPE|cDNA_position|CDS_position

CSQ attribute example:
A|regulatory_region_ablation&3_prime_UTR_variant|gene|ensembleGeneId|EnsembleTransId|bioType|10|10,A|feature_elongation|gene2||EnsembleTransId2||20|20

Related issues


Support for partial VCF writing - htsget extension for VCF
Use random database name in test databases
Multiple SLF4J bindings

Conclusion

Challenges

Working with 3rd party libraries

We use a famous java library for bioinformatics, HTSJDK, to convert the variant data into VCF format. But this library lacked the support to write data partially without first writing the header, so we had to add that support. Doing this was not straightforward as the library is somewhat complex and we had multiple ways of doing it (via the builder class, by modifiying the constructor, introducing a new public method to the writer API). After discussing with HTSJDK developers, we decided to provide a new method to set the VCF header first, so we can write the blocks to the stream without writing the header. This code had to go through multiple review rounds to make sure it doesn't break any existing fuctionality and had to add several test cases to verify it's working correctly.
Working with test data:

Test data available with the server is sometimes incomplete. For example, only few variants in the test data had annotations and those annotations had very few fields. We solved this by writing unit tests to verify constructed "CSQ" attribute with some known fields and then finally verifying it with real data before merging the pull request.
Work left

EDIT 27.09.2017: HtsGet Endpoint PR has been merged. All works of this project are now included in official repositories.
EDIT 12.09.2017: HTSJDK PR has been merged. HtsGet Endpoint will be merged soon!
HTSJDK pull request is still to be merged: This PR has gone through multiple review rounds and we are waiting for any further feedback from the developers, but they are not very responsive. This PR must be merged for HtsGet Endpoint to be merged.