Skip to content

Instantly share code, notes, and snippets.

@vaniisgh
Last active August 26, 2019 17:51
Show Gist options
  • Save vaniisgh/02af3ab8233fb7b6ac9314df0c3716f7 to your computer and use it in GitHub Desktop.
Save vaniisgh/02af3ab8233fb7b6ac9314df0c3716f7 to your computer and use it in GitHub Desktop.
GSoC 2019: Task distribution logic for GA4GH Cloud APIs

Task Distribution Logic for GA4GH Cloud APIs

This gist is a summary of work undertaken by my mentor akanitz and I during the 2019 Google Summer of Code under the Global Alliance for Genomics and Health.

Background

The Global Alliance for Genomics and Health aims to provide frameworks and standards for working with genomic and other health-related data. Thousands of healthcare practitioners and researchers access, process and analyse the data stored in various databases, and increasingly base prescriptions and treatment plans on the obtained results. As the actual data, as well as the tools and workflows used to generate them, are valuable commodities, it is important that they can be found, accessed and reused in an interoperable manner. The GA4GH standards promote and ensure these FAIR data principles throughout the field of biomedical data exchange and analysis.

The efforts of the GA4GH are optimised by organising them into work streams to improve and assist the working of the driver projects. Of these, the Cloud Work Stream is the one most relevant to the project. The Cloud Work Stream consists of a suite of API service specifications that are designed for portable execution of computational workflows and the sharing of data and tools.

This project aims to provide a proof of concept of a distribution logic for the execution of tasks based on the schemas developed by the Cloud Work Stream.

The Idea

Implementations based on the set of schemas and specifications provide the community with a means to standardise genomic data analysis by providing APIs for the user to select data and workflows, and execute these workflows and individual tasks using Data Repository Schema, Tool Repository Schema, Workflow Execution Schema, and Task Execution Schema, respectively. This led to the idea of developing a middleware for the optimisation of the current state of execution of workflows.

As shown in the image below the TEStribute receives a task and ranks the DRS and TES instances that are options for its execution based on time or costs. TEStribute

Current State of Execution of Workflows

The figure below depicts the current state of execution of workflows, the user can interact with any of the client facing applications like the Data Repository Service(DRS),Workflow Execution service(WES) or the Tool Repository Service(TRS), define workflows and tools, needed objects and let the workflow engine decide how the tasks defined in the workflow are to be executed. In this case, the WES breaks down the input from the client and distributes it over TES instances. Further the TES interacts with the DRS (to access as well as write data whenever necessary).

Current_State

Proposed Concept of TESTribute

The Task Distribution Logic is envisioned to be a middleware that assists the user in utilising their resources optimally by proving them the best possible case (in terms of cost and time) for running each task. Its aim is to find the best combination of TES and DRS instances, for each task. As a proof of concept for this idea, the TESTribute has been developed.

Current State of TESTribute

This release(v0.1.0) of the TESTribute was developed completely during the 2019 Google Summer of Code. Mock services (mock-DRS and mock-TES) and their corresponding clients (DRS-cli and TES-cli) are were developed to with minimum functioning requirements for the development of the Task Distribution Logic as well.

Objectives Achieved

As of today the project consists of two mock services:

  • mock-TES : To implement the envisioned functionality of the TES we required certain changes to the specifications. The modified specs as well as the original specs both are present in the mock-TES repository. The changes that have been made to include two endpoints. An info endpoint and another update endpoint that adds functionality to test the service (this endpoint is NOT an addition to the spec and is purely to test benchmarks of the TESTribute). Both the endpoints are :

    • /tasks/task-info : which replies on two models the tesResources as an input and tesTaskInfo as the response i.e., it takes the task requirements as in input and returns the estimates for cost and time.
    • /update-config : which updates the service configuration by modifying the values of variables in config i.e., it updates the unit costs and currency.
  • mock-DRS : No modifications to the existing specs were required for this server. Though similar to the mock-TES an endpoint has been provided to update :

    • /update-db : This endpoint updates the config i.e, it populates the database with new data objects, examples for the same can be found in the read-me.
  • Clients for DRS & TES : The TES-cli and DRS-cli both use bravado to generate requests for the mock-TES and mock-DRS respectively. Both of these packages can be found on PyPi and the version completed during the project period can be downloaded using the command :

    • get the release on pip.

      pip install tes_client==0.1.0
    • get the release on pip

      pip install drs_client==0.1.0
  • TESTribute : The TESTribute is the main repository, it has one
    exposed function rank_services() which requires a config file or defined inputs to return the ordered list of TES and DRS instances according to the user defined weight to cost and time. An example call would look like this :

  rank_services(
    drs_ids=[
        "id_input_file_1",
        "id_input_file_2"
    ],
    resource_requirements={
        "cpu_cores": "2",
        "ram_gb": "8",
        "disk_gb": "10",
        "execution_time_min": "300"
    },
    tes_uris=[
        "https://some.tes.service/",
        "https://another.tes.service/"
    ],
    drs_uris=[
        "https://some.drs.service/",
        "https://another.drs.service/"
    ],
    mode="cost",
    auth_header=None
)

It is not necessary to pass arguments for all parameters. Omission of any argument(s) will lead to the use corresponding default values defined in the config file or, alternatively, pass None leads to the same. The response object will look like this

{
    "rank": "integer",
    "TES": "TES_URL",
    drs_id_1: "DRS_URL",
    drs_id_2: "DRS_URL",
    ...
    "output_files": "DRS_URL",
    "drs_costs": "integer"
    "tes_costs": "integer"
}

Scope for improvement

The project lacks certain aspects that could be incorporated to improve its current state. Issues have been raised and it is planned to resolve them soon. A few links to such issues can be found here:

Future Work

The European life science infrastructure organization ELIXIR is developing WES-ELIXIR, a language-agnostic Python Flask-/Gunicorn-based WES microservice that wraps TES-compatible workflow engines behind a uniform WES API. Pluggable workflow engine support is planned to be implemented by wrapping different engines that can be assigned to each workflow language and version. Support for task distribution is planned by pointing workflow engine to a second microservice, proTES, which acts as a mock or “proxy” TES and allows injection of task distribution logic (and other middleware) before it relays the TES request received by the workflow engine to the most suitable TES instance. Post adding benchmarks and ensuring that the TESTribute saves time/cost, the task distribution logic module could be added inside proTES and would pass on rank-order TES instances available in a federated network and relay the original TES request to the most advantageous one via a built-in TES client as shows in the image below.

Task_distribution_middleware

My Journey

When I first began my journey as a Google Summer of Code applicant I was inspired by the Global Alliance for Genomics and Health and their efforts towards the attainment of FAIR data practices. As I read the use cases and schema specifications, I found myself longing to contribute in any way possible toward the advancement of the goal i.e., to provide users (anyone who wishes to run biological workflows) a seamless experience.

While writing the project proposal for the TESTribute, my mentor helped me understand the use case as well as the concept and helped me greatly in not just writing the proposal but also recognising the importance of practicality during ideation. Another important take away from this period was the need for clarity while proposing an idea. I now understand that the initial commitment to simplicity as well as sticking to the basics is what helped us bring this project to it current state as well as helped me maintain clarity for what the project is envisioned to achieve.

During the coding period, I worked on the creation of mock-services based on the Task Execution Service and Data Repository Service schemas as well as clients to interact with the services. Both services as well as clients were built to test and develop the TESTribute. My contributions to the same can be found at :

The time I spent working on the TESTribute has helped me learn alot. It has made me realise that whether it be code or documentation both must be
written with the user or reader in mind. I have learnt to keep in mind at every stage there are norms and standardisation practices, and conforming to them adds quality to work. This project gave me the opportunity to immerse myself in the world of open source and recognise the efforts made by so many developers that work tirelessly to foster an open, welcoming and productive work environment. The TESTribute has been developed almost from scratch, the experience of working on it has helped me realise my love my for building code (and hopefully get better at it). I am grateful to have a mentor that not only guided me but also took out time to help me complete the project, his effort, patience and initiative have truly inspired me and made this project valuable. In this short period of time his mentoring has helped me learn more than I could have ever imagined. I will forever be grateful for this opportunity and look back at this project fondly.

Acknowledgements

The project is a collaborative effort under the umbrella of the ELIXIR Cloud and AAI group. It was started during the 2019 Google Summer of Code as part of the Global Alliance for Genomics and Health.

banner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment