Dilschat/final_submission_description.md

## final_submission_description.md

      
    Raw
  

              final_submission_description.md
            
          
    GSOC 2018 GA4GH project submission description

This document is the final report for the GSOC project 2018 containing all the work I've done for the GA4GH project.
A detailed description of the project is available on the project page.
Biosamples

A goal of the project was to create an API for discovering over BioSamples using the GA4GH metadata schema and stream sequencing data back from ENA via the htsget protocol. Furthermore, there is additional objective: providing Phenopackets export.
Phenopackets serialisation

The Phenopackets serialisation is completely done and merged into Biosamples.
You can see all the code I've produced in the pull request.
More details on the specific additions are available in the links provided below.
What I did:

Phenopacket serialisation endpoint
Services to support phenopacket serialisation

Phenopacket exporter
OLS data retreiver


GA4GH services (precisely as for ga4gh search)

How to use it:

Fetch code from phenopacket_integration branch
Run it using docker-webapp.sh and docker-agents.sh
If you have no samples in your local Biosamples you should submit it according instructions.
Run example:
http://localhost:8081/biosamples/samples/SAMEA100000.phenopacket

GA4GH searching

I was able to complete the task of building API to query BioSamples using GA4GH metadata.
This API though relies on the ENA htsget service, which is not deployed in production yet.
For this reason at the moment is not possible to merge my code into the BioSamples repository.
You can see all the code I've produced in the pull request. More details on the specific additions are available in the links provided below.
What I did:

GA4GH services: I'm providing this module as part of the pull request mentioned above.
GA4GH resource assembler
GA4GH searching controller
Htsget services

ENA htsget service
Models for htsget service

ENA ticket
Ticket deserializer


What remains to do:

Merge the pull-request into Biosamples
Deploy htsget service and change the dummy link to this service to link to real host instead of testing localhost. (Marked by TODO comments)

Link to the original repository
Link to my fork
Htsget service (EGA-dataedge)

This piece of the project is the implementation of htsget protocol for ENA. The protocol specifications are available here.
I've completed the ENA htsget service, but as previously said this is still not merged into the EGA-data project nor deployed into production.
You can see all the code I've produced in the pull request. More details on the specific additions are available in the links provided below.
What I did:

Ticket controller - that returns tickets by accession
-File controller - that streaming bam or cram files
Download service - streams data from FTP servers of ENA
Ticket service - gets the link to the FASTQ files and some additional data (file size and md5 hash) for the provided accession
FASTQ converter - converts FASTQ files to bam or cram formats.
Ticket serializer - serialises ticket according to htsget specifications

What remains to do:

Merge my pull-request into EGA-dataedge repository
Deploy the service in production

Update all hosts in ENA htsget services to real service hosts (marked by TODO comments)


Link to the original repository
Link to my fork