habibutsu/notes.md

## notes.md

      
    Raw
  

              notes.md
            
          
    Distributed computation

Libraries:

Dask distributed is a lightweight library for distributed computing
Ray - is a distributed execution framework
Celery - Distributed Task Queue
Redis Queue - is a simple Python library for queueing jobs
Apache Flink - Stateful Computations over Data Streams
Apache Spark - Lightning-fast unified analytics engine
JobLib
dramatiq - A fast and reliable distributed task processing
Thespian - Python Actor Model library

Languages:

http://www.emeraldprogramminglanguage.org/
http://river.cs.usfca.edu/home

usefull links:

Scaling Pandas: Comparing Dask, Ray, Modin, Vaex, and RAPIDS
Comparing Dask vs. Ray
ray vs dask
ray-graph
Thoughts on Distributed Computing Frameworks
OpenCluster: A Flexible Distributed Computing Framework for Astronomical Data Processing
Spark, Dask, and Ray: Choosing the Right Framework
Benchmarking Python Distributed AI Backends with Wordbatch
Apache Airflow and Ray Orchestrating ML at Scale / Video

ETL frameworks


ETL - Extract Transform Load
DAG - Directed Acyclic Graph

Dagster


github stars - 1 336
first release - May 20, 2018

Developed by Nicholas Schrock ex-Facebook engineer and GraphQL co-creator.
Example
from dagster import (
    execute_pipeline,
    pipeline,
    solid
)

@solid
def hello(context):
    context.log.info('Hello dagster')
    return list(range(10))

@pipeline
def hello_pipeline():
    hello()

if __name__ == '__main__':
    result = execute_pipeline(hello_pipeline)
    assert result.success
Prefect


github stars - 1 688
first release - Mar 19, 2017

Developed by Jeremiah Lowin member of The Apache Software Foundation and ex-commiter Apache Airflow. Currently Founder & CEO at Prefect.
Why not Airflow
Example
from prefect import (
    task,
    Flow
)
from prefect.engine.executors import DaskExecutor

@task(
    tags=[
        # we want that worker has at least 2 cores and 2Gb of memory
        "dask-resource:CPU=2",
        "dask-resource:MEM=1e9"])
def hello():
    logger = prefect.context.get("logger")
    logger.info('Hello prefect')

def main():
    with Flow("Dummy") as flow:
        hello()

    flow.run(
        executor=DaskExecutor(address='127.0.0.1:8786')
    )

if __name__ == '__main__':
    main()
Metaflow


github stars - 2 988
first release - Dec 1, 2019

Developed in Netflix by Savin Goyal tech lead for the ML Infra team.
Apache Airflow


github stars - 15 835
first release - Oct 5, 2014

Articles:

XComs
XCom pull and push under the hood
When Airflow isn’t fast enough

KubeFlow


github stars - 8 433
first release - Nov 26, 2017

The Machine Learning Toolkit for Kubernetes developed in Google
Celery based


LightFlow
Selinon

Frameworks comparison

Criterias of comparison

submit - Distributed execution
map - Dynamicaly distributed execution
composition - Ability to combine tasks (like SubDag or similar)
storage - ability store data in external storage like S3 or GCP
cache - ability to share stored data across different run
monitoring - ability to monitor or introspect running graph (throgh UI or similar)


submit
map
composition
storage
cache
monitoring


Dagster
Y
N
Y
Y
N
Y


Prefect
Y
Y
N
Y
Cloud
Cloud


Metaflow
Y
Y
N
S3 only
?
?


KubeFlow
Y
N
?
?
?
?


Airflow
Y
N
Y
Y
N
Y


Luigi
N
N
Y
Y
Y
Y


Y - yes
N - no
issues

list of issues in different frameworks  about dynamic map

kubeflow/pipelines#3412
dagster-io/dagster#462
	submit	map	composition	storage	cache	monitoring
Dagster	`Y`	N	`Y`	`Y`	`N`	`Y`
Prefect	`Y`	`Y`	N	`Y`	Cloud	Cloud
Metaflow	`Y`	`Y`	N	S3 only	?	?
KubeFlow	`Y`	N	?	?	?	?
Airflow	`Y`	N	`Y`	`Y`	`N`	`Y`
Luigi	N	N	`Y`	`Y`	`Y`	`Y`