Skip to content

Instantly share code, notes, and snippets.

@slopp
Last active June 20, 2023 21:49
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save slopp/7c64973c70b12ffbd506d4e64d80d65e to your computer and use it in GitHub Desktop.
Save slopp/7c64973c70b12ffbd506d4e64d80d65e to your computer and use it in GitHub Desktop.
Dagster Hot Takes

Dagster Hot Takes

Less On-Call Pages: Retries and Alerts

https://youtu.be/A6WtkMwe4VQ

Getting an on-call page is the worst. Unfortunately most task-based orchestrators page teams frequently, whenever jobs fail. With Dagster you can reduce this alert fatigue by using retry strategies and only getting notified when SLAs are violated.

Resources:

Airflow Lost the Battle

https://youtu.be/PnrNGbL8W8k

I hear time and time again that teams are sticking with Airflow because its "battle tested". This is true - Airflow has been tested - and its been found lacking. Teams struggle to scale Airflow, they fight against package dependencies, and they are torn between mega-dags or maintenance nightmares.

The new year is a great time to try a new orchestrator. Dagster is making it easier than ever to migrate existing Airflow dags into Dagster to help teams get started.

Resources:

Breakpoints or Print Statements

https://youtu.be/KVVU6kwZjNM

Developing data pipelines has historically been challenging because unlike software it can be hard to write, test, and debug code locally. As a result, data engineering teams often have to rely heavily on logging to troubleshoot production errors.

While logs are still critical, in Dagster it is also possible to iterate locally using best-in-class tools like IDE debuggers.

Resources:

When did GA4 Sync?

https://youtu.be/kTjRKjSC-MY

Most orchestrators schedule jobs by time, but unfortunately this can mean data engineering teams have to guess when things should run. Guessing leads to flakiness. If upstream data, often outside the team's control, lands at a different time, processing can fail or create stale and misleading results. A common culprit is Google Analytics data, which syncs daily, but not at a specific time.

In Dagster, there are three ways to automate when things run: scheduling, event driven scheduling (via sensors), and freshness SLAs. Sensors allow teams to write code that responds to external events to trigger jobs. No more guessing!

Resources:

We dont talk about upgrades

https://youtu.be/zLhq5lg86c8

Upgrades, whether its a change to a Python package or a code update, can be a scary experience. Unfortunately in Airflow, a simple dependency change like a pandas update can trigger dependency hell, a situation where production fails because of conflicts in your code and Airflow's dependencies.

Dagster was built to avoid this situation by isolating the orchestrator from the code it runs. This same isolation also supports large organizations that might have different teams with different project dependencies. Most importantly, this environment isolation does not come at the cost of asset isolation.

  • Different Python environments ✅
  • Global asset lineage across environments ✅

Resources:

Scheduling Notebooks Sucks

https://youtu.be/yVqOgEr1SO0

Notebooks are sort of like Excel -- they are very important, commonly used, and often disconnected from the rest of the data platform. If a business process relies on data arriving at 8am, its not uncommon to find someone manually running a notebook at 8:30, or scheduling a notebook to run at 9am. Invariably one day the data load will be late, and the notebook will fail or show stale data. Efforts to bring Notebooks into the data platform usually fail because orchestrators introduce too much complexity to local development.

In Dagster, notebooks are first class citizens. Dagster's local development tools make it possible to work with notebooks without worrying about production resources, and Dagster's asset-first approach makes it easy to define a notebook's without guessing where the notebook should live in a pipeline.

Once you're in production, notebooks can be run when their inputs are updated, no guesstimating required!

Resources:

PR to Prod ...YOLO

https://youtu.be/1ZILUXijvS8

One reason data engineering is challenging, historically, is that reviewing code changes requires a mystical ability to review the code AND ALSO imagine the impact the code will have in your messy production system. Software engineers have solved this problem through CI/CD, SREs use reconciliation plans, web designers use Vercel/Netlify previews and data engineers.... pray?

In Dagster, the code in your data pipelines is separated between the logical code responsible for creating assets and the IO code responsible for interacting with systems. This makes it possible to 1) develop locally very quickly and 2) create full copies of your data platform on each PR.

Gone are the days of finding syntax errors after 12 minutes of Databricks initialization, or realizing you got the type wrong when dbt fails in production, or just feeling anxious when you approve a PR.

Resources:

Can you add this column?

https://youtu.be/8X9TCQLQy0U

A bad way to start your day as a data engineer is with a ticket or email like this, because adding a column usually means writing ad-hoc code to backfill that column on historical data - which is often an error prone process that is hard to test and can interfere with tasks running in production (that must also be updated to add the new column going forward).

With dagster, partitions are a first class abstraction. New columns can be added as a single change to your asset code, and that same exact code can be scheduled and run be for future partitions and for backfills. Separate ad-hoc code isn't needed. Plus, with dagster branch deployments, testing the new code on current and historical partitions is possible without interrupting production.

Resources posted below, or star us on GitHub to get started.

The users table doesn't look right?

https://www.youtube.com/watch?v=kZpCX5Muhxk

A bad way to start your day as a data engineer is with a ticket or email from marketing like this, because there are 100 different tools and possible reasons the user table might be wrong.

With a regular orchestrator, you have to go spelunking through tasks to figure out which touch the users table.

In contrast, Dagster speaks the same language as your stakeholderes. Datasets are represented as a first class abstraction called assets.

I can easily see the lineage, and quickly identify that the users table is out of date with the upstream orders table. I can click on the logs for the latest run, I can view additional metadata about the users table, and I can launch a new run to fix the problem, or investigate adding a sensor or modifying a schedule for a permanent fix.

Resources posted below, or star us on GitHub to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment