This document is the final report for the GSOC project 2018 containing all the work I've done for the GA4GH project. A detailed description of the project is available on the project page.
A goal of the project was to create an API for discovering over BioSamples using the GA4GH metadata schema and stream sequencing data back from ENA via the htsget protocol. Furthermore, there is additional objective: providing Phenopackets export.
The Phenopackets serialisation is completely done and merged into Biosamples. You can see all the code I've produced in the pull request. More details on the specific additions are available in the links provided below.
What I did:
- Phenopacket serialisation endpoint
- Services to support phenopacket serialisation
- GA4GH services (precisely as for ga4gh search)
How to use it:
- Fetch code from phenopacket_integration branch
- Run it using docker-webapp.sh and docker-agents.sh
- If you have no samples in your local Biosamples you should submit it according instructions.
- Run example: http://localhost:8081/biosamples/samples/SAMEA100000.phenopacket
I was able to complete the task of building API to query BioSamples using GA4GH metadata. This API though relies on the ENA htsget service, which is not deployed in production yet. For this reason at the moment is not possible to merge my code into the BioSamples repository. You can see all the code I've produced in the pull request. More details on the specific additions are available in the links provided below.
What I did:
- GA4GH services: I'm providing this module as part of the pull request mentioned above.
- GA4GH resource assembler
- GA4GH searching controller
- Htsget services
- ENA htsget service
- Models for htsget service
What remains to do:
- Merge the pull-request into Biosamples
- Deploy htsget service and change the dummy link to this service to link to real host instead of testing localhost. (Marked by TODO comments)
Link to the original repository
This piece of the project is the implementation of htsget protocol for ENA. The protocol specifications are available here. I've completed the ENA htsget service, but as previously said this is still not merged into the EGA-data project nor deployed into production. You can see all the code I've produced in the pull request. More details on the specific additions are available in the links provided below.
What I did:
- Ticket controller - that returns tickets by accession -File controller - that streaming bam or cram files
- Download service - streams data from FTP servers of ENA
- Ticket service - gets the link to the FASTQ files and some additional data (file size and md5 hash) for the provided accession
- FASTQ converter - converts FASTQ files to bam or cram formats.
- Ticket serializer - serialises ticket according to htsget specifications
What remains to do:
- Merge my pull-request into EGA-dataedge repository
- Deploy the service in production
- Update all hosts in ENA htsget services to real service hosts (marked by TODO comments)