vaniisgh/GSoC19_TaskDistribution.md

## GSoC19_TaskDistribution.md

      
    Raw
  

              GSoC19_TaskDistribution.md
            
          
    Task Distribution Logic for GA4GH Cloud APIs

This gist is a summary of work undertaken by my mentor akanitz and I
during the 2019 Google Summer of Code under the Global Alliance for Genomics
and Health.
Background

The Global Alliance for Genomics and Health aims to provide frameworks and
standards for working with genomic and other health-related data. Thousands of
healthcare practitioners and researchers access, process and analyse the data
stored in various databases, and increasingly base prescriptions and treatment
plans on the obtained results. As the actual data, as well as the tools and
workflows used to generate them, are valuable commodities, it is important that
they can be found, accessed and reused in an interoperable manner. The GA4GH
standards promote and ensure these FAIR data principles throughout the field
of biomedical data exchange and analysis.
The efforts of the GA4GH are optimised by organising them into work streams
to improve and assist the working of the driver projects. Of these, the
Cloud Work Stream is the one most relevant to the project. The Cloud
Work Stream consists of a suite of API service specifications that are
designed for portable execution of computational workflows and the sharing of
data and tools.
This project aims to provide a proof of concept of a distribution logic for
the execution of tasks based on the schemas developed by the Cloud Work
Stream.
The Idea

Implementations based on the set of schemas and specifications provide the
community with a means to standardise genomic data analysis by providing APIs
for the user to select data and workflows, and execute these workflows and
individual tasks using Data Repository Schema, Tool Repository Schema,
Workflow Execution Schema, and Task Execution Schema, respectively. This
led to the idea of developing a middleware for the optimisation of the current
state of execution of workflows.
As shown in the image below the TEStribute receives a task and ranks the DRS and TES
instances that are options for its execution based on time or costs.

Current State of Execution of Workflows

The figure below depicts the current state of execution of workflows, the user
can interact with any of the client facing applications like the Data Repository
Service(DRS),Workflow Execution service(WES)
or the Tool Repository Service(TRS), define workflows and tools, needed
objects and let the workflow engine decide how the tasks defined in the
workflow are to be executed. In this case, the WES breaks down the input from
the client and distributes it over TES instances. Further the TES interacts with
the DRS (to access as well as write data whenever necessary).

Proposed Concept of TESTribute

The Task Distribution Logic is envisioned to be a middleware that assists the
user in utilising their resources optimally by proving them the best possible case
(in terms of cost and time) for running each task. Its aim is to find the
best combination of TES and DRS instances, for each task. As a proof of concept
for this idea, the TESTribute has been developed.
Current State of TESTribute

This release(v0.1.0) of the TESTribute was developed completely during the 2019
Google Summer of Code. Mock services (mock-DRS and mock-TES) and
their corresponding clients (DRS-cli and TES-cli) are were developed to
with minimum functioning requirements for the development of the Task Distribution
Logic as well.
Objectives Achieved

As of today the project consists of two mock services:


mock-TES : To implement the envisioned functionality of the TES we
required certain changes to the specifications. The modified specs as
well as the original specs both are present in the mock-TES
repository. The changes that have been made to include two endpoints. An info
endpoint and another update endpoint that adds functionality to test the service
(this endpoint is NOT an addition to the spec and is purely to test
benchmarks of the TESTribute).
Both the endpoints are :

/tasks/task-info : which replies on two models the tesResources
as an input and tesTaskInfo as the response i.e., it takes the task
requirements as in input and returns the estimates for cost and time.
/update-config : which updates the service configuration by modifying
the values of variables in config i.e., it updates the
unit costs and currency.


mock-DRS : No modifications to the existing specs were required for
this server. Though similar to the mock-TES an endpoint has been provided to update :

/update-db : This endpoint updates the config
i.e, it populates the database with new data objects, examples for the same
can be found in the read-me.


Clients for DRS & TES : The TES-cli and DRS-cli both
use bravado to generate requests for the mock-TES and mock-DRS respectively. Both
of these packages can be found on PyPi and the version completed during the project
period can be downloaded using the command :


get the release on pip.
pip install tes_client==0.1.0


get the release on pip
pip install drs_client==0.1.0


TESTribute : The TESTribute is the main repository, it has one

exposed function rank_services() which requires a config file or defined inputs to return
the ordered list of TES and DRS instances according to the user defined weight to
cost and time. An example call would look like this :


  rank_services(
    drs_ids=[
        "id_input_file_1",
        "id_input_file_2"
    ],
    resource_requirements={
        "cpu_cores": "2",
        "ram_gb": "8",
        "disk_gb": "10",
        "execution_time_min": "300"
    },
    tes_uris=[
        "https://some.tes.service/",
        "https://another.tes.service/"
    ],
    drs_uris=[
        "https://some.drs.service/",
        "https://another.drs.service/"
    ],
    mode="cost",
    auth_header=None
)
It is not necessary to pass arguments for all parameters. Omission of any argument(s)
will lead to the use corresponding default values defined in the config file or,
alternatively, pass None leads to the same. The response object will look like this
{
    "rank": "integer",
    "TES": "TES_URL",
    drs_id_1: "DRS_URL",
    drs_id_2: "DRS_URL",
    ...
    "output_files": "DRS_URL",
    "drs_costs": "integer"
    "tes_costs": "integer"
}

Scope for improvement

The project lacks certain aspects that could be incorporated to improve its
current state. Issues have been raised and it is planned to resolve them soon.
A few links to such issues can be found here:

Making the objects type safe
Adding costs and location of output files
To test benchmarks against randomised use of services
Adding logic to mock-TES responses
The scope for better specs
There is need to add testing

Future Work

The European life science infrastructure organization ELIXIR is developing
WES-ELIXIR, a language-agnostic Python Flask-/Gunicorn-based WES microservice
that wraps TES-compatible workflow engines behind a uniform WES API. Pluggable
workflow engine support is planned to be implemented by wrapping different
engines that can be assigned to each workflow language and version. Support for
task distribution is planned by pointing workflow engine to a second
microservice, proTES, which acts as a mock or “proxy” TES and allows injection
of task distribution logic (and other middleware) before it relays the TES
request received by the workflow engine to the most suitable TES instance. Post
adding benchmarks and ensuring that the TESTribute saves time/cost, the task
distribution logic module could be added inside proTES and would pass on rank-order
TES instances available in a federated network and relay the original TES request
to the most advantageous one via a built-in TES client as shows in the image below.

My Journey

When I first began my journey as a Google Summer of Code applicant I
was inspired by the Global Alliance for Genomics and Health and their efforts
towards the attainment of FAIR data practices. As I read the use cases and
schema specifications, I found myself longing to contribute in any way possible
toward the advancement of the goal i.e., to provide users (anyone who wishes to
run biological workflows) a seamless experience.
While writing the project proposal for the TESTribute, my mentor helped me
understand the use case as well as the concept and helped me greatly in not
just writing the proposal but also recognising the importance of practicality
during ideation. Another important take away from this period was the need for
clarity while proposing an idea. I now understand that the initial commitment to
simplicity as well as sticking to the basics is what helped us bring this project
to it current state as well as helped me maintain clarity for what the project is
envisioned to achieve.
During the coding period, I worked on the creation of mock-services based on
the Task Execution Service and Data Repository Service schemas as well
as clients to interact with the services. Both services as well as clients were
built to test and develop the TESTribute. My contributions to the same can
be found at :

Contribution to mock-TES
Contribution to TES-cli
Contribution to mock-DRS
Contribution to DRS-cli
Contribution to TESTribute

The time I spent working on the TESTribute has helped me learn alot. It
has made me realise that whether it be code or documentation both must be

written with the user or reader in mind. I have learnt to keep in mind at
every stage there are norms and standardisation practices, and conforming to
them adds quality to work. This project gave me the opportunity to immerse
myself in the world of open source and recognise the efforts made by so many
developers that work tirelessly to foster an open, welcoming and productive
work environment. The TESTribute has been developed almost from scratch,
the experience of working on it has helped me realise my love my for building
code (and hopefully get better at it). I am grateful to have a mentor that not only
guided me but also took out time to help me complete the project, his effort,
patience and initiative have truly inspired me and made this project valuable.
In this short period of time his mentoring has helped me learn more than I
could have ever imagined. I will forever be grateful for this opportunity and look
back at this project fondly.
Acknowledgements

The project is a collaborative effort under the umbrella of the ELIXIR Cloud
and AAI group. It was started during the 2019 Google Summer of Code as
part of the Global Alliance for Genomics and Health.


## img_ga4gh_cws.png

      
    Raw
  

              img_ga4gh_cws.png
            
          
## logo-banner.png

      
    Raw
  

              logo-banner.png
            
          
## TESTribute_working.png

      
    Raw
  

              TESTribute_working.png
            
          
## wes_elixir_protes.300dpi.png

      
    Raw
  

              wes_elixir_protes.300dpi.png