Skip to content

Instantly share code, notes, and snippets.

@habibutsu
Last active November 13, 2024 09:57
Show Gist options
  • Save habibutsu/904f1b65cb8c8df72cfb266573353752 to your computer and use it in GitHub Desktop.
Save habibutsu/904f1b65cb8c8df72cfb266573353752 to your computer and use it in GitHub Desktop.
ETL frameworks

Distributed computation

Libraries:

  • Dask distributed is a lightweight library for distributed computing
  • Ray - is a distributed execution framework
  • Celery - Distributed Task Queue
  • Redis Queue - is a simple Python library for queueing jobs
  • Apache Flink - Stateful Computations over Data Streams
  • Apache Spark - Lightning-fast unified analytics engine
  • JobLib
  • dramatiq - A fast and reliable distributed task processing
  • Thespian - Python Actor Model library

Languages:

usefull links:

ETL frameworks

  • ETL - Extract Transform Load
  • DAG - Directed Acyclic Graph
  • github stars - 1 336
  • first release - May 20, 2018

Developed by Nicholas Schrock ex-Facebook engineer and GraphQL co-creator.

Example

from dagster import (
    execute_pipeline,
    pipeline,
    solid
)

@solid
def hello(context):
    context.log.info('Hello dagster')
    return list(range(10))

@pipeline
def hello_pipeline():
    hello()

if __name__ == '__main__':
    result = execute_pipeline(hello_pipeline)
    assert result.success
  • github stars - 1 688
  • first release - Mar 19, 2017

Developed by Jeremiah Lowin member of The Apache Software Foundation and ex-commiter Apache Airflow. Currently Founder & CEO at Prefect.

Why not Airflow

Example

from prefect import (
    task,
    Flow
)
from prefect.engine.executors import DaskExecutor

@task(
    tags=[
        # we want that worker has at least 2 cores and 2Gb of memory
        "dask-resource:CPU=2",
        "dask-resource:MEM=1e9"])
def hello():
    logger = prefect.context.get("logger")
    logger.info('Hello prefect')

def main():
    with Flow("Dummy") as flow:
        hello()

    flow.run(
        executor=DaskExecutor(address='127.0.0.1:8786')
    )

if __name__ == '__main__':
    main()
  • github stars - 2 988
  • first release - Dec 1, 2019

Developed in Netflix by Savin Goyal tech lead for the ML Infra team.

  • github stars - 15 835
  • first release - Oct 5, 2014

Articles:

  • github stars - 8 433
  • first release - Nov 26, 2017

The Machine Learning Toolkit for Kubernetes developed in Google

Celery based

Frameworks comparison

Criterias of comparison

  • submit - Distributed execution
  • map - Dynamicaly distributed execution
  • composition - Ability to combine tasks (like SubDag or similar)
  • storage - ability store data in external storage like S3 or GCP
  • cache - ability to share stored data across different run
  • monitoring - ability to monitor or introspect running graph (throgh UI or similar)
submit map composition storage cache monitoring
Dagster Y N Y Y N Y
Prefect Y Y N Y Cloud Cloud
Metaflow Y Y N S3 only ? ?
KubeFlow Y N ? ? ? ?
Airflow Y N Y Y N Y
Luigi N N Y Y Y Y

Y - yes N - no

issues

list of issues in different frameworks about dynamic map

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment