Libraries:
- Dask distributed is a lightweight library for distributed computing
- Ray - is a distributed execution framework
- Celery - Distributed Task Queue
- Redis Queue - is a simple Python library for queueing jobs
- Apache Flink - Stateful Computations over Data Streams
- Apache Spark - Lightning-fast unified analytics engine
- JobLib
- dramatiq - A fast and reliable distributed task processing
- Thespian - Python Actor Model library
Languages:
usefull links:
- Scaling Pandas: Comparing Dask, Ray, Modin, Vaex, and RAPIDS
- Comparing Dask vs. Ray
- ray vs dask
- ray-graph
- Thoughts on Distributed Computing Frameworks
- OpenCluster: A Flexible Distributed Computing Framework for Astronomical Data Processing
- Spark, Dask, and Ray: Choosing the Right Framework
- Benchmarking Python Distributed AI Backends with Wordbatch
- Apache Airflow and Ray Orchestrating ML at Scale / Video
- github stars - 1 336
- first release - May 20, 2018
Developed by Nicholas Schrock ex-Facebook engineer and GraphQL co-creator.
Example
from dagster import (
execute_pipeline,
pipeline,
solid
)
@solid
def hello(context):
context.log.info('Hello dagster')
return list(range(10))
@pipeline
def hello_pipeline():
hello()
if __name__ == '__main__':
result = execute_pipeline(hello_pipeline)
assert result.success
- github stars - 1 688
- first release - Mar 19, 2017
Developed by Jeremiah Lowin member of The Apache Software Foundation and ex-commiter Apache Airflow. Currently Founder & CEO at Prefect.
Example
from prefect import (
task,
Flow
)
from prefect.engine.executors import DaskExecutor
@task(
tags=[
# we want that worker has at least 2 cores and 2Gb of memory
"dask-resource:CPU=2",
"dask-resource:MEM=1e9"])
def hello():
logger = prefect.context.get("logger")
logger.info('Hello prefect')
def main():
with Flow("Dummy") as flow:
hello()
flow.run(
executor=DaskExecutor(address='127.0.0.1:8786')
)
if __name__ == '__main__':
main()
- github stars - 2 988
- first release - Dec 1, 2019
Developed in Netflix by Savin Goyal tech lead for the ML Infra team.
- github stars - 15 835
- first release - Oct 5, 2014
Articles:
- github stars - 8 433
- first release - Nov 26, 2017
The Machine Learning Toolkit for Kubernetes developed in Google
Criterias of comparison
- submit - Distributed execution
- map - Dynamicaly distributed execution
- composition - Ability to combine tasks (like SubDag or similar)
- storage - ability store data in external storage like S3 or GCP
- cache - ability to share stored data across different run
- monitoring - ability to monitor or introspect running graph (throgh UI or similar)
submit | map | composition | storage | cache | monitoring | |
---|---|---|---|---|---|---|
Dagster | Y |
N | Y |
Y |
N |
Y |
Prefect | Y |
Y |
N | Y |
Cloud | Cloud |
Metaflow | Y |
Y |
N | S3 only | ? | ? |
KubeFlow | Y |
N | ? | ? | ? | ? |
Airflow | Y |
N | Y |
Y |
N |
Y |
Luigi | N | N | Y |
Y |
Y |
Y |
Y
- yes
N
- no
list of issues in different frameworks about dynamic map