Project:
Data streaming in scientific workflows, implementation for Toil
At the start of GSoC
I posted a blog post in CWL
community with my proposed design of the project and a short introduction about Toil
. It was cross-posted on Open Bioinformatics Foundation (OBF) blog:
Working on a CWL-Toil project with the Open Bioinformatics Foundation
I accomplished most of my goals. The main software artifact is the implementation of input data streaming in toil-cwl-runner
. I managed to achieve a bonus goal of the project which was allowing data streaming for both AWS
and Google Cloud buckets
by making use of existing cloud connectors in Toil
. The feature was merged.
Pull Request: DataBiosphere/toil#3694
Issue: DataBiosphere/toil#3469
I also created tests for this feature:
CWL workflow for testing input streaming
Test the workflow in a Toil way
The feature required the support of named pipes in cwltool
, a CWL
project used by toil-cwl-runner
. I implemented the support for it and it was merged.
Pull Request: common-workflow-language/cwltool#1469
Issue: common-workflow-language/cwltool#1468
Test for this feature:
Test streaming in cwltool
I discovered a bug in toil-cwl-runner
which caused the files to be downloaded twice. I fixed it and it was merged:
Pull Request: DataBiosphere/toil#3670
Issue: DataBiosphere/toil#3665
I discovered a bug in a Toil
tutorial and proposed a solution:
Tutorial issue
To get familiarised with CWL
and Toil
, I tested different streaming examples in Toil with different setups and tools:
https://github.com/mhpopescu/toil-gsoc-tests
Additional work was implementing streaming outputs which is not finished yet and not merged.
https://github.com/mhpopescu/cwltool/tree/stream-outputs
https://github.com/mhpopescu/toil/tree/stream-outputs