(This is written after I implemented a simple, distributed pipeline on Airflow, and after spending hours on reading Prefect docs and looking for equivalent functions. The comparison might be slightly biased towards Airflow due to being more familiar with it, but best effort is made to ensure information about Prefect is accurate)
DAG and Tasks-related
This refers to some tasks that requires unknown number of parallel runs unknown before execution. For example, to query the database and do "for each row, execute another task". This is important for parallelization of large tasks.
Prefect (+2) - Supported, it has dedicated Mapped Task feature
Airflow (-2) - No relevant feature
Whether the system can support multiple version of the same workflow stored on the system, and on demand execute the old versions.
Prefect (+1) - Full support on the above statements, but since Flows are registered to the server directly, source control becomes unenforced
Airflow (-0) - No support, but same can be achieved will good old
On-demand Task / Task requiring Approval
There could be certain tasks that should be executed after a manual intervention or approval, does the system support this?
Prefect (+1) - Support tasks with manual trigger. However, the design is actually heavily designed around "approval", and that all manual triggered task will be treated as approval task and show up in approval dashboard.
Airflow (-1) - No support, but achievable with some convoluted design
Waiting for External Event
Can a task be a listener event waiting for some external event to happen? For example, to wait for a file to be uploaded and present on S3.
Prefect (-2) - Not supported
Airflow (+2) - This functionality is provided by the "Sensors"
DB Connection Credentials/Secrets
We will need to store connection string and distribute them to executors, so that the executors can connect to the databases and perform data operations.
Prefect (-2) - There is no support for storing credentials, the best that can be done (without Cloud version) is to define the secret on each agent as local secrets, and then use "PrefectSecret" task to forward the secret to executors, as described in the Secrets doc.
Airflow (+2) - Connection credentials storage is a built-in feature of Airflow, which allow you to store hostname, username, password and even custom parameters for each connection (Encryption by fennet as advertised, not tested though), and Airflow "SDK" also comes with corresponding connector that allows developer to create data connections as simple as
Deploying the architecture
How complicated is it to deploy a functional environment with a remote executor, and code your flow to use it?
Prefect (-1) - Quite complicated, one needs to understand the concepts and relationships of Server, Agent, and Executor. Agent and executor has to be defined on each Flow, otherwise it goes to random agent and thus random executor, and one flow must entirely executed with a single executor.
Airflow (+1) - Easy to run, server can be started with the official Docker compose file; to start client, only a single command has to be ran. The executor listens on a queue name, and each individual task can select the executor by specifying the queue name. If no queue name is specified, the task goes to "default" queue and will be processed by worker not configured to work for specific queue.
I personally like Airflow more because the interface looks more designed for developers and engineers, and it makes a lot of the underlying technical details available.
Prefect on the other hand appears to be designed for less technical users, and the technical details are buried deeper.
Prefect highlights itself being "less-DSL" at defining the Flow, i.e. the flow is "programmed" as you would do without Prefect, as an example:
@task def return_data(): return [1, 2, 3] @task(trigger=manual_only) def process_data(xs): d = [i + 2 for i in xs] return d with Flow("Pause:Resume") as flow: process_data(return_data)
While this may be easy to read as Python code, but by looking at the flow definition (
process_data(return_data)), it is not clear that there are two tasks in the flow. In my opinion, this is bad for building pipelines, as it is essential to let the programmers know what tasks are in the pipeline, and linking up the tasks should be an explicit work for the programmer. And this is only going to get worse as the flow expands and lengthens. There is also no form of modularization encouraged.
On the other hand, the Opeartor approach from Airflow by nature strongly encourages decoupling, as in the following:
from airflow.operators.bash import BashOperator from airflow.utils.dates import days_ago with DAG( 'tutorial', schedule_interval=timedelta(days=1), start_date=days_ago(2), ) as dag: t1 = BashOperator( task_id='print_date', bash_command='date', ) t2 = BashOperator( task_id='sleep', depends_on_past=False, bash_command='sleep 5', retries=3, ) t1 >> t2
Comparatively, it is obvious that there are two tasks on this DAG, and makes the programmer conscious about the two tasks having a direct relationship.
(The above examples demonstrate the coding style as a whole, it might lead you to think about how the data is passed between steps, and how Python code would be pickled and executed on remote executor, but that is not the point of this discussion)
Airflow - Apache License 2.0
Prefect - Core: Apache License 2.0 / Server, UI: Prefect Community License
According to this blog post from Prefect, Prefect Community License "(still) allows their use for any purpose, including modification or distribution, other than directly competing with Prefect’s own products."
But uncertainty is private companies can certainly change the license they use as they want, as happened with MongoDB, so it is hard to say for the future.