Google Summer of Code '21 Final Report
- Name: Vipul Chhabra
- Project: Implement GA4GH TES in Galaxy/Pulsar
- Organization: [Global Alliance for Genomics and Health][gsoc-org]
- Mentor's: Alex Kanitz, Kyle Ellrott, Björn Grüning, Marius
This Gist summarises the work done by me during the 2021 Google Summer of Code, working primarly on the Galaxy ecosystem for the Global Alliance for Genomics and Health organization, under the guidance of my mentor's Alex Kanitz, Kyle Ellrott , Björn Grüning, Marius.
The Galaxy is a "Galaxy is an open, web-based platform for accessible, reproducible, and transparent computational research."
Ref: Galaxy Website
Galaxy aims to provide easy access to tools useful in bioinformatics research without any programming experience for uploading data, running complex tools and workflows, and visualizing results.
Galaxy provides other features such as reproducibility and transparency. It captures information so that any user can understand and repeat a complete computational analysis and share or publish their analysis (histories, workflows, visualizations).
The Idea 💡
The Galaxy is a Free and Open Source Data Analysis platform that provides a home to more than 8,100 ready-to-use tools for users. The Galaxy platform also currently offers the distributed execution of data analysis workflows in the cloud using executors such as Pulsar, Kubernetes, etc. To increase the interoperability of the Galaxy Platform with GA4GH Cloud-compliant solutions, the goal of this project is to add support for the TES API to the Galaxy and modifying the Pulsar REST API according to the TES specifications.
The runner is designed to break down the job into smaller components. Galaxy passes all the job information to the runner directly using
job_wrapper. Then, the job is passed to TES and managed until completion.
The runner first finds the list and paths of the input, output, and tool files, get the docker image to be used for the execution of the job, and builds a command for it. All the environment variables defined for the job or used in the configuration are converted into the desired format. It creates the job script in the desired format according to the Job specification and sends it to the TES instance for execution.
The execution of the job in the TES has been divided into three parts -
Step 1 - All the output files are created at the desired location and directories.
Step 2 - The job is executed using the defined/default docker image, environment variables, and the command built by the runner.
Step 3 - All the output files and job directory is staged back to the galaxy.
The job is monitored simultaneously by the Galaxy, if the job fails due to errors/resources, the stdout/stderr received from the TES is passed to the user, and the job status is updated accordingly. If the job is completed successfully, the output files received are passed to the user.
What did I achieve? 🎉
At the end of 2 months, I contributed to building and integrating the TES Runner to the Galaxy
- Built a runner capable of performing the actions required by Galaxy.
- Built a docker-image useful for creation and staging-out of output files
A lot of emphasis was put on following good coding practices, and so I am proud to say that the project was well documented and integration tests were added for testing up to a certain extent.
The PR containing the required code, scripts, and tests can be found here.
What is left?
A Few of the things which have not been covered or needs improvements are:
- Integration Tests: Few more integration tests can be added for testing out all the functionalities
- Modifying Pulsar according to TES Specification: The pulsar needs to be modified according to TES Specifications for uniformity across all the services, although it was part of the stretch goals but could not be addressed in the given time frame.
One of the major goals of the Galaxy and ELIXIR Cloud & AAI ecosystem is to provide convenient access to the scientific data analysis tools with high computation power and with minimal programming experience. The runner built will help to reach one step nearer to the goal.
The above diagram shows an overall flow of the execution using the TES Runner.
My Journey 🚴
In my freshman year of university, I got introduced to open-source through a workshop organized by one of my seniors. Initially, I felt it a bit hard to understand how can I contribute to open-source, then my seniors guided me, and then I started learning with VCS, and slowly I started contributing to open source applications and have been contributing from 2018. In early 2021 I was searching for a new organization where I could contribute and learn more about the technologies I was interested in. After visiting, pages and project ideas of multiple came across the Global Alliance for Genomics and Health and found the ideas to have a significant impact in accelerating bio-informatics research.
On the initial conversation, I was welcomed by my mentor Alex Kanitz, and at that moment, I decided that this would be the community that I will feel happy to contribute to. My first pr was to add integration tests to the Foca
Later I got selected for the project Implementing TES in Galaxy/Pulsar. 🥳
GSoC'21 has been one of the best experiences of my life that I will remember for a long time to come. Over the last few months, apart from writing quality code. I have learned to take ownership of a project. I enjoyed working with my mentor and the community in general, and I thank them for giving me a fantastic experience. Unlike most of the students who participate in GSoC, my journey with my organization is not over yet, and I feel there is still a long way to go.