Skip to content

Instantly share code, notes, and snippets.

View YoannAbriel's full-sized avatar

Yoann YoannAbriel

  • Geneva
  • 14:45 (UTC +02:00)
View GitHub Profile

Problem

On multi-scheduler Airflow deployments (e.g. AWS MWAA with 4+ schedulers), tasks intermittently fail in bulk with no apparent cause, but succeed on manual retry. The root issue is in _executable_task_instances_to_queued: when the scheduler checks task concurrency limits and cannot find the serialized DAG, it immediately bulk-fails all SCHEDULED task instances for that DAG via a raw SQL UPDATE. This is an overly aggressive response to what is often a transient race condition during DAG file parsing or serialization refresh cycles.

Root Cause

In scheduler_job_runner.py, the _executable_task_instances_to_queued method loads the serialized DAG only when checking per-task or per-dagrun concurrency limits. If scheduler_dag_bag.get_dag_for_run() returns None (serialized DAG transiently absent), the code executed a bulk UPDATE task_instance SET state='failed' for all SCHEDULED tasks of that DAG — instead of treating it as a transient miss.

Fix

fix: warn about hardcoded 24h visibility_timeout that kills long-running Celery tasks

Problem

When using the Celery executor with Redis/SQS brokers and no explicit visibility_timeout configured, Airflow silently applies a default of 86400 seconds (24 hours) in default_celery.py. Tasks running longer than 24 hours are terminated by the broker redelivering the message to another worker, which fails with ServerResponseError('Invalid auth token: Signature has expired'). Users have no indication this limit exists or how to change it.

The task_acks_late configuration description also incorrectly states it "effectively overrides the visibility timeout", which is not true for Redis/SQS brokers — the broker-level redelivery happens regardless of acknowledgment settings.

Root Cause

fix: stop parameterizing SQL keywords in MySQL bulk_load_custom

Problem

MySqlHook.bulk_load_custom() passes duplicate_key_handling (e.g. IGNORE, REPLACE) and extra_options as parameterized query values via cursor.execute(sql, parameters). The MySQL driver treats parameterized values as data and quotes them as string literals, producing invalid SQL like:

LOAD DATA LOCAL INFILE '/tmp/file' 'IGNORE' INTO TABLE `my_table` 'FIELDS TERMINATED BY ...'

Problem

When using .airflowignore with negation patterns to selectively un-ignore deeply nested directories, directory-only patterns (ending with /) incorrectly un-ignore files inside those directories.

For example, with this .airflowignore:

*
!abc/
!abc/def/
!abc/def/xyz/

Problem

When using .airflowignore with negation patterns to selectively un-ignore deeply nested directories, directory-only patterns (ending with /) incorrectly un-ignore files inside those directories.

For example, with this .airflowignore:

*
!abc/
!abc/def/
!abc/def/xyz/

fix: clear next_method and next_kwargs on task retry

Problem

When a deferrable operator fails during trigger resumption and enters retry, next_method and next_kwargs are not cleared in Airflow 3.x. This means the retry attempt skips execute() entirely and jumps straight to the stale next_method(**next_kwargs) callback from the previous attempt.

This causes two failure modes:

  1. The task processes "zombie" trigger events from the previous attempt
  2. The task fails because initial setup logic in execute() was bypassed

fix: clear next_method and next_kwargs on task retry

Problem

When a deferrable operator fails during trigger resumption and enters retry, next_method and next_kwargs are not cleared in Airflow 3.x. This means the retry attempt skips execute() entirely and jumps straight to the stale next_method(**next_kwargs) callback from the previous attempt.

This causes two failure modes:

  1. The task processes "zombie" trigger events from the previous attempt
  2. The task fails because initial setup logic in execute() was bypassed

fix: clear next_method and next_kwargs on task retry

When a deferred task fails during trigger resumption and is eligible for retry, the next_method and next_kwargs fields must be cleared so the retry starts fresh from execute() rather than resuming from the stale deferral callback.

This is the Airflow 3.x equivalent of the fix in #18146 / #18210 for Airflow 2.x. The transition to the Task SDK and Internal API introduced prepare_db_for_next_try() as the central method for preparing a TaskInstance for its next attempt, but it was missing the clear_next_method_args() call.

Changes

  • Added self.clear_next_method_args() call in TaskInstance.prepare_db_for_next_try() to ensure next_method and next_kwargs are reset before the new attempt ID is generated
  • Added tests verifying that prepare_db_for_next_try() clears deferred fields

Apache Airflow — 10 Medium-Difficulty Open Issues (No Competing PRs)

Generated: 2026-03-04


1. #62845 — Deferred TI's next_method and next_kwargs not cleared on retries in Airflow 3.x

URL: apache/airflow#62845

Summary: When a deferrable operator fails during trigger resumption and enters retry, the next attempt incorrectly persists next_method and next_kwargs from the previous attempt. Instead of starting fresh with execute(), the worker resumes with stale data, causing failures or processing zombie events. This was fixed in Airflow 2.x (#18146) but the reset logic is missing in Airflow 3.x's new Task SDK.

Apache Airflow — Issue Candidates for Contribution

Generated: 2026-03-04
Repo: apache/airflow
Criteria: Backend/logic bugs, small-medium scope, clear problem, no active PR, testable


🏆 Ranked Candidates