Skip to content

Instantly share code, notes, and snippets.

@ekeilty17
Last active August 13, 2018 21:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ekeilty17/bef6562cbb65e71df844f32013454d91 to your computer and use it in GitHub Desktop.
Save ekeilty17/bef6562cbb65e71df844f32013454d91 to your computer and use it in GitHub Desktop.
Global Filesystem and Search Engine for Genomics Data

Global Filesystem and Search Engine for Genomics Data

Implementing the DOS Server. A standard API created by GA4GH.

Developed as part of Google Summer of Code 2018.

Table of Contents

Introduction

Global Alliance for Genomics and Health (GA4GH) is an international, nonprofit alliance formed to accelerate the potential of research and medicine to advance human health. They have developed the Data Object Service (DOS), which is an emerging standard for specifying location of data across different cloud environments. The goal of DOS is to create a generic API on top of existing object storage systems so workflow systems can access data in a single, standard way regardless of where it's stored. The standard API is split into two sections: data object management and data object querying. The former is done by a DOS Server while the latter is done by a DOS Registry (service registry).

View the DOS Registry schemas in Swagger UI

View the DOS Server schemas in Swagger UI

My Projects

As part of Google Summer of Code 2018 I developed from scratch 3 projects: an implementation of a DOS Server, a wrapper that loads data from PGP Canada into a DOS Server database, and a wrapper that loads data from a public GCP Bucket into a DOS Server database. These projects can be found at the following links

In case these repositories are updated in the future, the commit intended for GSoC 2018 final evaluation are labeled "Final GSoC Commit". Documenation on how to use each project can be found in the README.md of the repective github repositories.

Current State

The DOS Server uses the Springboot JPA framework connected to a MYSQL database with KeyCloak authenitcation. My implementation has the following functionality (unless otherwise specified, anything implemented for a Data Object is also implemented for a Data Bundle):

  • GET all Data Objects
  • GET Data Object by id
  • GET all Data Objects by alias
  • Versioning of Data Objects
  • GET all versions of a Data Object
  • GET previous version of a Data Object by id
  • POST, PUT, DELETE Data Objects
  • Custom Pagination
  • Data Object endpoints require admin authentication
  • Data Bundle endpoints require user or admin authentication

The PGP Wrapper and GCP Wrapper are both functional and both successfully load data from their respective cloud environments into a DOS Server.

Future Development

TODO Current State
KeyCloak authorization using access tokens Not supported
Create a docker image that automatically configures keycloak and mysql and deploys the DOS Server There is a develop branch where this is attempted but contains bugs
Support other versioning schemas Version number of a Data Object and Data Bundle must take the form x.x.x
system_metadata and user_metadata fields support key-value pairs with the key as any abitrary string and the value as any arbirary object Key must be a string and the value is serialized to a string regardless of its type
GET Data Object by checksum Not supported
GET Data Bundle by checksum Not supported

Conclusion

Working on this project was an amazing experience. It was a great introduction to the current tools being used in the tech industry and taught me a lot about the inner-workings of a startup company. I would like to thank the members of GA4GH and GSoC for providing me this opportunity. I would also like to thank Miro Cupak, Marc Fiume, and the rest of the DNAstack team for being so welcoming to me and helpful towards the completion of this project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment