Skip to content

Instantly share code, notes, and snippets.

@ericwong3
Last active July 19, 2022 19:32
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ericwong3/b9d0d3eb1d8846acf8485c028f67001d to your computer and use it in GitHub Desktop.
Save ericwong3/b9d0d3eb1d8846acf8485c028f67001d to your computer and use it in GitHub Desktop.
Airflow vs Prefect

(This is written after I implemented a simple, distributed pipeline on Airflow, and after spending hours on reading Prefect docs and looking for equivalent functions. The comparison might be slightly biased towards Airflow due to being more familiar with it, but best effort is made to ensure information about Prefect is accurate)

DAG and Tasks-related

Dynamic/Mapped Tasks

This refers to some tasks that requires unknown number of parallel runs unknown before execution. For example, to query the database and do "for each row, execute another task". This is important for parallelization of large tasks.

Prefect (+2) - Supported, it has dedicated Mapped Task feature

Airflow (-2) - No relevant feature

Versioned Task

Whether the system can support multiple version of the same workflow stored on the system, and on demand execute the old versions.

Prefect (+1) - Full support on the above statements, but since Flows are registered to the server directly, source control becomes unenforced

Airflow (-0) - No support, but same can be achieved will good old _v1 naming

On-demand Task / Task requiring Approval

There could be certain tasks that should be executed after a manual intervention or approval, does the system support this?

Prefect (+1) - Support tasks with manual trigger. However, the design is actually heavily designed around "approval", and that all manual triggered task will be treated as approval task and show up in approval dashboard.

Airflow (-1) - No support, but achievable with some convoluted design

Waiting for External Event

Can a task be a listener event waiting for some external event to happen? For example, to wait for a file to be uploaded and present on S3.

Prefect (-2) - Not supported

Airflow (+2) - This functionality is provided by the "Sensors"

Architecture

DB Connection Credentials/Secrets

We will need to store connection string and distribute them to executors, so that the executors can connect to the databases and perform data operations.

Prefect (-2) - There is no support for storing credentials, the best that can be done (without Cloud version) is to define the secret on each agent as local secrets, and then use "PrefectSecret" task to forward the secret to executors, as described in the Secrets doc.

Airflow (+2) - Connection credentials storage is a built-in feature of Airflow, which allow you to store hostname, username, password and even custom parameters for each connection (Encryption by fennet as advertised, not tested though), and Airflow "SDK" also comes with corresponding connector that allows developer to create data connections as simple as S3Hook("my_s3_connection_identifier").

Deploying the architecture

How complicated is it to deploy a functional environment with a remote executor, and code your flow to use it?

Prefect (-1) - Quite complicated, one needs to understand the concepts and relationships of Server, Agent, and Executor. Agent and executor has to be defined on each Flow, otherwise it goes to random agent and thus random executor, and one flow must entirely executed with a single executor.

Airflow (+1) - Easy to run, server can be started with the official Docker compose file; to start client, only a single command has to be ran. The executor listens on a queue name, and each individual task can select the executor by specifying the queue name. If no queue name is specified, the task goes to "default" queue and will be processed by worker not configured to work for specific queue.

Opinionated Stuff

User Interface

I personally like Airflow more because the interface looks more designed for developers and engineers, and it makes a lot of the underlying technical details available.

Prefect on the other hand appears to be designed for less technical users, and the technical details are buried deeper.

Programming Style

Prefect highlights itself being "less-DSL" at defining the Flow, i.e. the flow is "programmed" as you would do without Prefect, as an example:

@task
def return_data():
    return [1, 2, 3]

@task(trigger=manual_only)
def process_data(xs):
    d = [i + 2 for i in xs]
    return d
    
with Flow("Pause:Resume") as flow:
    process_data(return_data)

While this may be easy to read as Python code, but by looking at the flow definition (process_data(return_data)), it is not clear that there are two tasks in the flow. In my opinion, this is bad for building pipelines, as it is essential to let the programmers know what tasks are in the pipeline, and linking up the tasks should be an explicit work for the programmer. And this is only going to get worse as the flow expands and lengthens. There is also no form of modularization encouraged.

On the other hand, the Opeartor approach from Airflow by nature strongly encourages decoupling, as in the following:

from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago

with DAG(
    'tutorial',
    schedule_interval=timedelta(days=1),
    start_date=days_ago(2),
) as dag:

    t1 = BashOperator(
        task_id='print_date',
        bash_command='date',
    )

    t2 = BashOperator(
        task_id='sleep',
        depends_on_past=False,
        bash_command='sleep 5',
        retries=3,
    )

    t1 >> t2

Comparatively, it is obvious that there are two tasks on this DAG, and makes the programmer conscious about the two tasks having a direct relationship.

(The above examples demonstrate the coding style as a whole, it might lead you to think about how the data is passed between steps, and how Python code would be pickled and executed on remote executor, but that is not the point of this discussion)

Misc

Popularity

Airflow Prefect
GitHub Stars 21.7k 6.4k
StackOverflow Questions 6k 53

License

Airflow - Apache License 2.0

Prefect - Core: Apache License 2.0 / Server, UI: Prefect Community License

According to this blog post from Prefect, Prefect Community License "(still) allows their use for any purpose, including modification or distribution, other than directly competing with Prefect’s own products."

But uncertainty is private companies can certainly change the license they use as they want, as happened with MongoDB, so it is hard to say for the future.

@kaxil
Copy link

kaxil commented Jul 19, 2022

@ericwong3 - Airflow supports Dynamic/Mapped Tasks natively as of Airflow 2.3 (released in Apr 2022)

@kaxil
Copy link

kaxil commented Jul 19, 2022

Airflow has Taskflow API that supports a more pythonic way of writing DAGs for users who prefer that over the conventional way.

The following is a valid Python DAG

from airflow import DAG
from airflow.decorators import task


@task
def return_data():
    return [1, 2, 3]

@task
def process_data(xs):
    d = [i + 2 for i in xs]
    return d
    
with DAG("my_airflow_dag", start_date=pendulum.datetime(2021, 1, 1, tz="UTC")) as dag:
    process_data(return_data)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment