Forem: De' Clerke

Apache Airflow 2 vs 3: A Deep Technical Comparison for Data Engineers

De' Clerke — Wed, 15 Apr 2026 09:10:31 +0000

Apache Airflow 2 vs 3: A Deep Technical Comparison for Data Engineers 🚀

TL;DR — Airflow 3 dissolves the monolithic webserver into three independent
services, strips direct database access from task code, ships a fully stable
Task SDK, and rewrites the entire UI in React. If you are running Airflow 2 in
production, this article will tell you exactly what breaks, what improves, and
how to migrate without losing a night's sleep. 😴

Why This Comparison Matters ⚖️

Every major Airflow release has nudged the architecture forward. Airflow 2 gave us
the TaskFlow API, the Scheduler high-availability refactor, and provider packages.
Airflow 3 is different in kind, not just degree.

In the process of migrating a production Docker Compose stack for a healthcare ML
retraining pipeline from Airflow 2 patterns to Airflow 3, every single one of the
following hit in production:

CPU spike to 600% caused by a silent breaking change in JWT key management 📈
Tasks silently failing with Connection refused because localhost no longer means what it used to 🔌
A healthcheck that always reported unhealthy because port 8974 no longer exists ❌
A user creation step that silently did nothing because FAB is gone 👤

Each of these failures traces back to a deliberate, principled architectural
decision in Airflow 3. Once you understand why the changes were made, the fixes
are obvious — but without that context, Airflow 3 can feel like it is actively
working against you.

This article is that context. 💡

The 30-Second Summary ⏱️

Dimension	Airflow 2	Airflow 3
UI framework	Flask-AppBuilder (FAB)	React (FastAPI backend)
Webserver	`airflow webserver`	`airflow api-server`
DAG Processor	Embedded in scheduler	Mandatory separate service
Task Execution	Direct fork/subprocess	Task Execution API (AIP-72)
Metadata DB access from tasks	Allowed	Prohibited
Auth manager default	FAB (full RBAC)	SimpleAuthManager
REST API	v1 (Flask)	v2 (FastAPI, stable)
Default schedule	`@daily` (cron)	`None`
`catchup` default	`True`	`False`
SequentialExecutor	Available	Removed
SubDAGs	Available	Removed
SLAs	Available	Removed
Import path for `@dag`/`@task`	`airflow.decorators`	`airflow.sdk`
XCom pickling	Enabled by default	Disabled by default
Python minimum	3.8	3.9
PostgreSQL minimum	12	13

🏗️ Part 1 — The Architectural Paradigm Shift

Airflow 2: One Webserver to Rule Them All 🏛️

In Airflow 2, the mental model for a self-hosted deployment is relatively
straightforward. You run four processes:

airflow webserver       # Flask-AppBuilder UI + REST API v1 + auth
airflow scheduler       # parses DAGs + triggers task instances
airflow worker          # (CeleryExecutor) executes tasks
postgres/mysql          # metadata database

The webserver does double duty — it serves the browser UI and exposes the REST
API and handles authentication, all from a single Flask application. The
scheduler parses your dags/ directory inline, as part of its own main loop.

This is simple to reason about. It is also a single point of failure for three
completely separate concerns.🏚️

Airflow 3: Separation of Concerns as a First-Class Constraint

Airflow 3 decomposes the monolith into discrete, independently scalable services:🧩

airflow api-server          # FastAPI: UI + REST API v2 + auth (replaces webserver)
airflow scheduler           # triggers task instances only; NO DAG parsing
airflow dag-processor       # mandatory: parses DAGs, writes to serialized_dag table
airflow triggerer           # manages deferrable operators
postgres/mysql              # metadata database

The key insight: the scheduler in Airflow 3 does not parse DAGs. It reads the
serialized_dag table, which is populated exclusively by the dag-processor service.
If you start a scheduler without a dag-processor, it will start cleanly — and then
do nothing, because it has no serialized DAGs to schedule.🏜️

# Airflow 2: single scheduler did everything
[Scheduler process]
  ├── Parses dags/ directory
  ├── Updates serialized_dag table
  ├── Checks heartbeats
  └── Triggers TaskInstances

# Airflow 3: responsibilities split
[dag-processor]               [scheduler]
  └── Parses dags/                 ├── Reads serialized_dag
      Updates serialized_dag       ├── Checks heartbeats
                                   └── Triggers TaskInstances via Execution API

This split unlocks horizontal scalability. The dag-processor can be scaled
independently on compute-heavy deployments with thousands of DAG files, without
touching the scheduler's scheduling loop latency.⚡

Part 2 — The Task Execution API (AIP-72): The Biggest Change You Haven't Heard Of 🤫

How Airflow 2 Ran Tasks

In Airflow 2 with LocalExecutor, task execution worked like this:

Scheduler identifies a TaskInstance ready to run
Scheduler forks a subprocess
Subprocess imports your DAG file directly
Subprocess calls task.execute(context)
Task code has unrestricted access to settings.Session, DagRun, TaskInstance models — the entire Airflow metadata database 🗄️

Step 5 is a footgun. Task code could accidentally (or intentionally) query, modify,
or drop metadata. It tightly coupled your business logic to Airflow internals.💣

How Airflow 3 Runs Tasks

Airflow 3 introduces a Task Execution API — a lightweight HTTP interface that
sits between the task subprocess and the metadata database:🛡️

[Scheduler] ──triggers──► [Task Subprocess]
                                 │
                                 │ HTTP (JWT-authenticated)
                                 ▼
                          [API Server /execution/]
                                 │
                                 ▼
                          [Metadata Database]

Task code no longer talks to the database. It talks to the Execution API, which
enforces a controlled, auditable surface for every metadata operation. Direct
imports like from airflow.models import DagRun inside task code will raise errors
in Airflow 3.🚫

The JWT Problem (and Why It Caused a 600% CPU Spike)💥

The Execution API authenticates requests with JWT tokens. The scheduler signs each
task's token; the api-server verifies it. Both must use the same secret key.

In Airflow 3, if AIRFLOW__API_AUTH__JWT_SECRET is not explicitly set, each service
calls get_signing_key() and generates a random in-memory key. The scheduler's
random key ≠ the api-server's random key. Every task fails immediately with:

Invalid auth token: Signature verification failed

The fix is one environment variable, shared across all containers:🛠️

# docker-compose.yml — x-airflow-common environment block
AIRFLOW__API_AUTH__JWT_SECRET: "your-static-secret-change-in-prod"

The 600% CPU spike came from a related issue: the api-server, when launched with
--workers > 1 (uvicorn default), spawns worker processes via
multiprocessing.spawn. Each spawned process re-initialises its own random JWT key
and immediately crashes when it receives a token signed by the master process. The
crash loop runs at full speed:🏎️

[api-server] Waiting for child process [12]...
[api-server] Child process [12] died unexpectedly
[api-server] Waiting for child process [13]...
[api-server] Child process [13] died unexpectedly

Fix: enforce a single worker until this is resolved upstream.

command: api-server --workers 1

The `EXECUTION_API_SERVER_URL` Problem📍

Every scheduler container needs to know where the Execution API lives. The default
is http://localhost:8080/execution/. In a Docker Compose deployment, localhost
inside the scheduler container is the scheduler container's own loopback interface.
The api-server is a different container on a different network namespace.🌐

# Airflow 2: localhost was fine (single process model)
# Airflow 3 Docker: localhost = wrong container

Result: every task fails with httpx.ConnectError: [Errno 111] Connection refused,
even when the api-server is perfectly healthy.🛑

Fix:

AIRFLOW__CORE__EXECUTION_API_SERVER_URL: "http://airflow-api-server:8080/execution/"

Part 3 — Authentication: FAB Out, SimpleAuthManager In🔐

Flask-AppBuilder in Airflow 2

Airflow 2 used Flask-AppBuilder (FAB) for authentication. FAB gave you:

Full RBAC with built-in roles (Admin, Op, User, Viewer, Public)
OAuth integrations (Google, GitHub, LDAP, etc.)
A complete user management UI
_AIRFLOW_WWW_USER_CREATE environment variable for bootstrapping admin users🛠️

# Airflow 2: works as expected
_AIRFLOW_WWW_USER_CREATE: "true"
_AIRFLOW_WWW_USER_USERNAME: "admin"
_AIRFLOW_WWW_USER_PASSWORD: "admin"
_AIRFLOW_WWW_USER_ROLE: "Admin"

SimpleAuthManager in Airflow 3

Airflow 3 ships SimpleAuthManager as the default. It stores users and passwords in a plain-text JSON file:📁

{
  "admin": "my_secure_password"
}

FAB is not gone — it is available as an explicit provider — but it is no longer the default. The _AIRFLOW_WWW_USER_CREATE variable is silently ignored when SimpleAuthManager is active. You will see this in your init logs:📝

Skipping user creation as auth manager different from Fab is used

There is no warning that your carefully configured user variables did nothing.⚠️

To bootstrap a user with SimpleAuthManager in Docker Compose:

# Step 1: configure the users list and passwords file location
AIRFLOW__CORE__SIMPLE_AUTH_MANAGER_USERS: "admin:Admin"
AIRFLOW__CORE__SIMPLE_AUTH_MANAGER_PASSWORDS_FILE: "/opt/airflow/project/simple_auth_manager_passwords.json"

# Step 2: write the passwords file in your init container
command:
  - -c
  - |
    python3 -c "
    import json
    open('/opt/airflow/project/simple_auth_manager_passwords.json','w').write(
        json.dumps({'admin': 'your_password'})
    )"
    exec /entrypoint airflow version

The passwords file must be accessible to all containers — use a shared bind mount.🔗

Choosing Between SimpleAuthManager and FAB

Scenario	Recommendation
Local dev / CI / demos	SimpleAuthManager — fast, zero config
Small team, basic username/password	SimpleAuthManager
Enterprise SSO (LDAP, OAuth, SAML)	FAB provider (`apache-airflow-providers-fab`)
Multi-team RBAC with fine-grained permissions	FAB provider
Kubernetes deployments	FAB provider or custom `AuthManager` implementation

Part 4 — Breaking Changes Catalogue📑

4.1 SubDAGs → TaskGroups and Assets📦

SubDAGs are removed in Airflow 3. They were always problematic — they introduced deadlock risks with pool management, made the graph view confusing, and performed poorly at scale.📉

# Airflow 2 (SubDAG pattern — do not migrate this verbatim)
from airflow.operators.subdag import SubDagOperator

process_data = SubDagOperator(
    task_id="process_data",
    subdag=create_subdag(dag.dag_id, "process_data", args),
    dag=dag
)

# Airflow 3 migration: TaskGroups for visual grouping
from airflow.sdk import dag, task
from airflow.utils.task_group import TaskGroup

@dag(schedule=None)
def my_pipeline():
    with TaskGroup("process_data") as process_data:
        @task
        def validate(): ...

        @task
        def transform(): ...

        validate() >> transform()

For cross-DAG dependencies that SubDAGs were sometimes used for, the preferred Airflow 3 pattern is Asset-based scheduling:💎

from airflow.sdk import Asset

raw_data = Asset("s3://my-bucket/raw/")

@dag(schedule=[raw_data]) # this DAG runs when raw_data is updated
def downstream_pipeline(): ...

4.2 SequentialExecutor Removed🚫

SequentialExecutor (runs one task at a time, no parallelism) is gone. The replacement for local development is LocalExecutor with a PostgreSQL or SQLite backend.🗃️

# Airflow 2: SequentialExecutor was the default for fresh installs
AIRFLOW__CORE__EXECUTOR: SequentialExecutor

# Airflow 3: use LocalExecutor
AIRFLOW__CORE__EXECUTOR: LocalExecutor

Note: LocalExecutor requires a real database backend (PostgreSQL recommended). SQLite with LocalExecutor is technically functional but unsupported for production.⚠️

4.3 SLA Misses Removed⏰

The SLA miss feature is gone. It was notoriously unreliable — callbacks fired inconsistently depending on scheduler restart timing, and the implementation was tightly coupled to the old execution model.🏚️

# Airflow 2 (no longer works in Airflow 3)
@dag(
    sla_miss_callback=my_sla_handler
)
def my_dag():
    slow_task = PythonOperator(
        task_id="slow_task",
        python_callable=run_slow_thing,
        sla=timedelta(hours=2) # removed
    )

Migration options:🛠️

Airflow 3.2+: Use Deadline Alerts (scheduler-native, much more reliable)
External monitoring: Instrument task duration in your observability stack (Prometheus, Datadog, etc.) and alert from there

4.4 REST API v1 Removed → FastAPI v2🔌

The REST API v1 (Flask-based, under /api/v1/) is completely removed. Airflow 3 ships a stable, FastAPI-backed REST API under /api/v2/.🚀

The v2 API is not backward-compatible. Common breakage points:🧨

# v1 endpoint (broken in Airflow 3)
GET /api/v1/dags/{dag_id}/dagRuns

# v2 endpoint (Airflow 3)
GET /api/v2/dags/{dag_id}/dagRuns

Beyond the URL prefix change, the response schemas have also changed. Any custom integrations, CI scripts, or tooling that hit the Airflow API directly will require updates.🛠️

The new health endpoint is:🩺
GET /api/v2/monitor/health

{
  "metadatabase": {"status": "healthy"},
  "scheduler": {"status": "healthy"},
  "triggerer": {"status": "healthy"},
  "dag_processor": {"status": "healthy"}
}

Note that dag_processor is a new key — it did not exist in Airflow 2 health responses.📝

4.5 Removed Context Variables🏷️

Several context variables that were available in TaskInstance.context are removed:

# These no longer exist in Airflow 3 task context
execution_date      # use logical_date
tomorrow_ds         # compute manually
yesterday_ds        # compute manually
prev_ds             # compute manually
prev_execution_date # removed
next_execution_date # removed

The execution_date rename to logical_date reflects a deeper semantic change: in Airflow 3, logical_date represents run_after (when the DAG should run) rather than data_interval_start (the start of the data window). For event-driven and manual DAGs, this distinction matters.🧐

# Airflow 2
def my_task(**context):
    run_date = context["execution_date"] # deprecated

# Airflow 3
def my_task(**context):
    run_date = context["logical_date"] # correct

4.6 XCom Pickling Disabled🥒

XCom pickling is disabled by default in Airflow 3. In Airflow 2, Python objects were serialized via pickle and stored in the metadata database. This allowed arbitrary Python objects to flow between tasks but introduced security risks (arbitrary code execution on deserialization) and size limits.🛡️

# Airflow 2: this worked silently
@task
def extract():
    return {"data": some_sklearn_model} # pickled into XCom

# Airflow 3: raises an error with default XCom backend
# Use JSON-serializable return values or a custom XCom backend
@task
def extract():
    return {"rows": 1000, "path": "s3://bucket/output.parquet"} # safe

For large artifacts (models, DataFrames), the recommended pattern is to write to external storage (S3, GCS, local filesystem) and pass only the path as XCom.💾

Part 5 — What's New in Airflow 3✨

5.1 The `airflow.sdk` Namespace🏗️

Airflow 3 ships a stable, versioned Task SDK. All DAG authoring primitives now live under airflow.sdk:📦

# Airflow 2 import paths (still work in early Airflow 3, will be removed)
from airflow.decorators import dag, task
from airflow.models.dag import DAG
from airflow.sensors.base import BaseSensorOperator
from airflow.datasets import Dataset

# Airflow 3 canonical imports
from airflow.sdk import dag, task, DAG, Asset
from airflow.sdk import BaseSensorOperator

The SDK is designed to have a stable interface across minor versions. The intent is that DAGs written against airflow.sdk should be forward-compatible with future Airflow releases without import-path churn.🚀

Important for Docker deployments: The airflow.sdk import chain triggers a connection attempt to the Task Execution API at import time. If the api-server is unavailable or CPU-starved, the dag-processor will hang on this import and eventually be SIGKILL'd by its own parse timeout. Fix the api-server first; everything else follows.🚨

5.2 DAG Versioning (AIP-66)📑

Airflow 3 introduces first-class DAG versioning. Multiple versions of the same DAG can exist simultaneously in the serialized_dag table, and running DagRuns execute against the DAG version they were triggered with — not the latest version.🕰️

dag_id: "healthcare_retrain"
├── version 1: train → validate (runs triggered before 2026-04-10)
└── version 2: load_data → train → validate (runs triggered after 2026-04-10)

This solves a long-standing pain point: in Airflow 2, modifying a DAG while runs were in-flight could corrupt active DagRuns if the task structure changed.✅

5.3 Asset-Based Scheduling (AIP-74, AIP-75)💎

The Airflow 2 Dataset concept has been renamed to Asset and significantly expanded. Assets replace cron-based scheduling for data-driven pipelines:🔄

from airflow.sdk import dag, task, Asset

# Producer DAG
raw_asset = Asset("s3://my-datalake/raw/events.parquet")

@dag(schedule="@hourly")
def ingest_events():
    @task(outlets=[raw_asset])
    def fetch_and_write():
        # ... write to S3
        pass
    fetch_and_write()

# Consumer DAG — runs when raw_asset is updated, not on a clock
@dag(schedule=[raw_asset])
def process_events():
    @task
    def transform():
        pass
    transform()

Assets enable a push-driven scheduling model where downstream DAGs run when their data dependencies are satisfied, not when a clock fires.🌊

5.4 Edge Executor (AIP-69)🌐

The Edge Executor allows Airflow tasks to run on lightweight remote workers without CeleryExecutor's operational overhead. Workers register with the api-server via HTTP polling and execute tasks locally, making it viable for:🦾

IoT / edge compute deployments
Low-resource VMs that can not run a Celery broker
Multi-cloud task distribution without VPN tunnels

# airflow.cfg / env var
AIRFLOW__CORE__EXECUTOR: EdgeExecutor

5.5 Scheduler-Managed Backfills (AIP-78)🔙

Backfills in Airflow 2 were CLI-driven one-shot operations. Airflow 3 makes backfills first-class scheduler concepts:🗓️

# Airflow 3: create a scheduler-managed backfill
airflow dags backfill create --dag-id my_dag \
--from-date 2024-01-01 --to-date 2024-12-31

# Inspect backfill state
airflow dags backfill list --dag-id my_dag

Scheduler-managed backfills respect pool limits, run in parallel with live DagRuns, and are visible in the UI — eliminating the "backfill is a black box" experience from Airflow 2.🖤

5.6 React UI (AIP-38, AIP-84)🎨

The Airflow 3 UI is a full rewrite in React, backed by the FastAPI REST API v2. Practical implications:🖱️

Significantly faster rendering for DAGs with hundreds of tasks⚡
Grid view replaces the old Tree view as the primary timeline view📊
The legacy Graph view (force-directed) is replaced with a cleaner task-level dependency graph🔗
The UI now works correctly in all modern browsers without Flask session issues🌐
Dark mode is available natively🌙

Part 6 — Import Path Migration Guide🗺️

This is the table you want bookmarked during a migration:

Airflow 2 import	Airflow 3 import
`from airflow.decorators import dag, task`	`from airflow.sdk import dag, task`
`from airflow.models.dag import DAG`	`from airflow.sdk import DAG`
`from airflow.sensors.base import BaseSensorOperator`	`from airflow.sdk import BaseSensorOperator`
`from airflow.datasets import Dataset`	`from airflow.sdk import Asset`
`from airflow.models import Variable`	`from airflow.sdk import Variable`
`from airflow.models import Connection`	`from airflow.sdk import Connection`
`from airflow.operators.python import PythonOperator`	`apache-airflow-providers-standard` package
`from airflow.operators.bash import BashOperator`	`apache-airflow-providers-standard` package
`from airflow.sensors.filesystem import FileSensor`	`apache-airflow-providers-standard` package

Many common operators (Python, Bash, File sensors) have moved to apache-airflow-providers-standard. Install this package explicitly:🛠️

pip install apache-airflow-providers-standard

Automated Migration with Ruff🐶

Airflow 3 ships with Ruff lint rules specifically for migration:

pip install "ruff>=0.13.1"

# Check for mandatory breaking changes (AIR301)
ruff check dags/ --select AIR301 --preview

# Auto-fix safe renames
ruff check dags/ --select AIR301 --fix --unsafe-fixes --preview

# Check for recommended updates (AIR302: deprecated-but-not-yet-removed)
ruff check dags/ --select AIR302 --preview

Example output:📝
dags/retrain_dag.py:3:1: AIR301 airflow.decorators.dag is removed in Airflow 3.0. Use airflow.sdk.dag instead.
[*] AIR301 auto-fix available

Part 7 — Docker Compose: What Breaks, What to Add🐳

If you are running Airflow 2 via Docker Compose, here is a precise list of changes required for Airflow 3.

Services to Add➕

airflow-dag-processor:
  <<: *airflow-common
  command: dag-processor
  healthcheck:
    test: ["CMD", "airflow", "jobs", "check", "--job-type", "DagProcessorJob", "--local"]
    interval: 60s
    timeout: 60s
    retries: 3
    start_period: 300s
  restart: always
  depends_on:
    airflow-init:
      condition: service_completed_successfully

Services to Rename✏️

# Airflow 2
airflow-webserver:
  command: webserver
  ports:
    - "8080:8080"

# Airflow 3
airflow-api-server:
  command: api-server --workers 1 # --workers 1 is critical (see Part 2)
  ports:
    - "8080:8080"

Environment Variables to Add🌍

x-airflow-common:
  &airflow-common
  environment:
    # Critical: prevents Connection refused in scheduler
    AIRFLOW__CORE__EXECUTION_API_SERVER_URL: "http://airflow-api-server:8080/execution/"

    # Critical: prevents JWT Signature verification failed
    AIRFLOW__API_AUTH__JWT_SECRET: "change-this-in-production"

    # Required for SimpleAuthManager user configuration
    AIRFLOW__CORE__SIMPLE_AUTH_MANAGER_USERS: "admin:Admin"
    AIRFLOW__CORE__SIMPLE_AUTH_MANAGER_PASSWORDS_FILE: "/opt/airflow/project/simple_auth_manager_passwords.json"

Healthcheck Changes🩺

# Airflow 2 scheduler healthcheck (port 8974 no longer exists in Airflow 3)
healthcheck:
  test: ["CMD", "curl", "--fail", "http://localhost:8974/health"]

# Airflow 3 scheduler healthcheck
healthcheck:
  test: ["CMD", "airflow", "jobs", "check", "--job-type", "SchedulerJob", "--local"]
  interval: 60s
  timeout: 60s # airflow jobs check takes ~42s (full Python + DB round-trip)
  retries: 3
  start_period: 300s # covers pip install time on first start

The --local flag is essential. --hostname $(hostname) compares the container's $HOSTNAME env var against the hostname Airflow registered in the database — these often differ (9811c4ea8dec vs airflow-scheduler.internal), causing perpetual unhealthy status even when the service is running correctly.🔍

Part 8 — Configuration Migration⚙️

Changed Defaults That Will Surprise You😮

catchup_by_default: was True in Airflow 2, False in Airflow 3
# If you have DAGs with start_date in the past and no explicit catchup=True,
# they will NOT backfill on first deploy — this is usually what you want,
# but verify before deploying
[scheduler]
catchup_by_default = False # Airflow 3 default

# Default schedule: was @daily implicit in some contexts, now None
# DAGs with no schedule parameter will not run automatically
[scheduler]
# Use schedule=None explicitly if that's your intent

Renamed Configuration Keys✏️

# Airflow 2 → Airflow 3 config key mapping
[webserver]
web_server_host = 0.0.0.0 → [api]
host = 0.0.0.0

[webserver]
error_logfile = ... → REMOVED (no replacement)

Automated Config Migration🛠️

# Check your airflow.cfg for deprecated/invalid keys
airflow config lint

# Apply automatic fixes
airflow config update --fix

Part 9 — Migration Path🛤️

If you are upgrading a production Airflow 2 deployment, follow this sequence:

Phase 1 — Prepare (Still on Airflow 2)🏗️

Upgrade to Airflow 2.7+ — the schema migration from earlier versions significantly increases airflow db migrate time; get that done first.⏳
Clean the metadata database — airflow db clean removes old DagRun/TaskInstance records and dramatically speeds up the schema migration.🧹
Run Ruff AIR301 checks — ruff check dags/ --select AIR301 --preview.🐶
Fix all deprecation warnings — zero warnings in Airflow 2.9 means fewer surprises in Airflow 3.⚠️
Audit direct database access — grep your task code for from airflow.models imports; these will break.🔍

# Find tasks using direct metadata DB access
grep -r "from airflow.models" dags/ --include="*.py"
grep -r "settings.Session" dags/ --include="*.py"
grep -r "DagRun|TaskInstance|Variable" dags/ --include="*.py" | grep "import"

Phase 2 — Upgrade⬆️

Back up your metadata database — non-negotiable.💾
Update your Docker image to apache/airflow:3.0.0.🐳
Add dag-processor service to your Compose/Kubernetes manifests.🧩
Rename webserver → api-server in service definitions.✏️
Set the three critical env vars:🌍
- AIRFLOW__CORE__EXECUTION_API_SERVER_URL
- AIRFLOW__API_AUTH__JWT_SECRET
- SimpleAuthManager password config
Run airflow db migrate.🔄
Update all import paths (use Ruff auto-fix first, then manual review).🛠️
Update healthchecks to airflow jobs check --local.🩺

Phase 3 — Validate✅

# Check all services healthy
curl http://localhost:8080/api/v2/monitor/health | python3 -m json.tool

# Expected output
{
  "metadatabase": {"status": "healthy"},
  "scheduler": {"status": "healthy"},
  "triggerer": {"status": "healthy"},
  "dag_processor": {"status": "healthy"}
}

# Trigger a test DAG
airflow dags trigger your_test_dag

# Check task state
airflow tasks states-for-dag-run your_test_dag <run_id>

Part 10 — Should You Upgrade?🤔

Upgrade Now If:🚀

You are starting a new project — there is no reason to build on Airflow 2.✨
You have simple DAGs (PythonOperator, BashOperator, standard providers) — the migration is mostly find-and-replace on import paths.🛠️
You want DAG versioning — this solves real operational pain.🕰️
You are running on Kubernetes — the separation of concerns maps cleanly to individual pod scaling.🏗️

Wait If:🛑

You depend heavily on FAB's OAuth/LDAP integrations and have not tested the FAB provider on Airflow 3.🔐
You have extensive SLA miss callback logic and no monitoring alternative ready.⏰
Your codebase has heavy direct metadata database access in task code — refactoring that to the Python Client is non-trivial.🗃️
You use CeleryKubernetesExecutor or LocalKubernetesExecutor — both are removed; you need to evaluate the Multiple Executor Configuration feature instead.🧩
You have custom Flask-AppBuilder views or blueprints — these require porting to FastAPI.🎨

The Honest Assessment⚖️

Airflow 3 is the version the project should have been architecturally from the beginning. The separation of the dag-processor, the Task Execution API, and the prohibition on direct metadata access are the right engineering decisions. They make Airflow significantly more secure, more scalable, and more maintainable at the cost of a one-time migration investment.🦾

The upgrade complexity is proportional to how much your codebase relied on Airflow 2's leaky abstractions: direct database access, FAB internals, SLA callbacks, and SubDAGs. If you followed Airflow 2 best practices (TaskFlow API, provider operators, no direct DB access), the migration is a half-day of import path updates and Docker Compose additions.🛠️

If you did not, this upgrade is the forcing function to do it properly.🚀

Conclusion🏁

The jump from Airflow 2 to Airflow 3 is the most significant change in the project's history. The webserver is gone. The scheduler no longer parses DAGs. Tasks no longer touch the metadata database. The JWT-authenticated Execution API connects them all.🔗

Each of these changes surfaces as a concrete failure mode in the first deployment: CPU spikes from JWT key divergence, Connection refused from wrong service URLs, silent healthcheck failures from removed ports, and silently no-op user creation from a replaced auth manager.🧨

Understanding the why behind the architecture — isolation, security, scalability — converts each failure from mysterious to obvious. The fixes are not workarounds; they are the intended configuration patterns for a distributed, multi-service orchestration system.🧠

Airflow 3 is what a modern data orchestrator should look like. Migrate when you are ready, migrate properly, and you will not look back.🚀

Resources

Written from direct production experience migrating a healthcare ML retraining pipeline from Airflow 2 patterns to Airflow 3.0.0 on Docker Compose, April 2026 📝.

⚡ High-Performance Warehousing: Partitioning & Clustering

De' Clerke — Wed, 04 Feb 2026 16:52:03 +0000

In my previous posts, we discussed how to structure a Data Warehouse. But as your data grows from thousands to billions of rows, even a perfect Star Schema can become slow. To keep queries lightning-fast, we use two primary optimization techniques: Partitioning and Clustering.

1. Partitioning: Divide and Conquer

Partitioning is a technique that divides large tables into smaller, more manageable segments based on a specific column, like a date or a region.

The Analogy

Imagine a library with millions of exam papers. If they are all in one giant pile, finding a specific paper is impossible. But if you divide them into separate boxes by Subject, you only need to search the "Math" box to find a math paper.

Strategies

Horizontal Partitioning: Divides tables based on row values (e.g., separating sales by month).
Vertical Partitioning: Divides tables based on columns, separating frequently accessed data from rarely used or sensitive information (like moving Social Security Numbers to a separate, restricted segment).

Why use it?

Query Performance: The database engine only scans the relevant partitions, which significantly reduces I/O operations.
Maintenance Efficiency: You can back up or archive specific partitions without touching the entire table.

2. Clustering: Keeping Neighbors Close

While partitioning splits data into "boxes," Clustering organizes how the data is physically stored on the disk within those boxes.

The Analogy

Think of a library again. Inside the "History" box, you group books by Author. If a student wants all books by a specific author, they are all sitting right next to each other on the shelf, so the student doesn't have to walk back and forth.

Benefits

I/O Reduction: Related records are read in a single disk operation.
Cache Efficiency: Accessing one record automatically brings its "neighbors" into the cache.
Compression: Similar values cluster together, which allows the database to compress the data more effectively.

Partitioning vs. Clustering: Which when?

Feature	Partitioning	Clustering
Logic	Logical division into segments	Physical organization on disk
Common Use	Date, Year, or Region	ID, Category, or frequent filter keys
Impact	Great for "skipping" huge amounts of data	Great for speeding up searches within a dataset

Summary and Key Takeaways

Building a successful data warehouse isn't just about storing data; it's about making it accessible.

OLTP vs OLAP: Separate your "doing" from your "thinking".
Star vs Snowflake: Choose a schema that balances speed and storage.
Partitioning & Clustering: Use these to ensure your warehouse scales as your business grows.

⭐ Star vs. ❄️ Snowflake: Designing the Data Warehouse

De' Clerke — Wed, 04 Feb 2026 16:33:02 +0000

In my previous post, we explored why we use OLAP systems (Data Warehouses) for analytics. But once you have a warehouse, how do you organize the data inside it? This is where Data Modeling comes in.

To make data easy to query, we use Dimensional Modeling, which organizes data into two types of tables: Facts and Dimensions.

The Building Blocks: Facts & Dimensions

1. Fact Tables

These are the central repositories for measurable business metrics.

What they store: Quantitative measurements (facts) like sales amounts, quantities, or durations.
Structure: Usually the largest tables, containing foreign keys that link to related dimension tables.
Example: A Sales_Fact table containing revenue, quantity, and discount.

2. Dimension Tables

These provide the descriptive context that makes fact table measurements meaningful.

What they store: Attributes used for filtering and grouping, such as product names, customer demographics, or dates.
Structure: Typically smaller in terms of row count and often denormalized for speed.

The Star Schema

The Star Schema is the most fundamental and widely used pattern. It looks like a star because the central fact table is surrounded by a single layer of dimension tables.

Why use a Star Schema?

Query Simplicity: It requires fewer joins, making it easier for business users to understand and query.
Performance: Because there are fewer joins, queries generally execute faster.
Tool Compatibility: Most BI tools (like Tableau or Power BI) are optimized for this structure.

The Snowflake Schema

The Snowflake Schema is an extension of the star schema. In this model, the dimension tables are normalized into multiple related tables.

Why use a Snowflake Schema?

Storage Efficiency: Normalization reduces data redundancy, which is helpful if your dimension tables are massive.
Data Integrity: It reduces the risk of inconsistencies because attributes are updated in only one place.
Maintenance: Changes to hierarchical data (like a product category) are easier to manage.

Side-by-Side Comparison

Feature	Star Schema	Snowflake Schema
Complexity	Simple (1 join per dimension)	Complex (multiple joins)
Data Redundancy	Higher (Denormalized)	Lower (Normalized)
Query Speed	Generally faster	Potentially slower due to joins
User Experience	Intuitive for business users	Less intuitive

Which one should you choose?

Choose Star Schema if you prioritize query speed and want to make it easy for non-technical users to build their own reports.
Choose Snowflake Schema if you have very large dimension tables where storage costs are a concern or if you need to strictly enforce data integrity.

Summary: The Star Schema is built for speed and simplicity, while the Snowflake Schema is built for storage efficiency and organization.

🏦 OLTP vs. OLAP: Why One Database Isn't Enough

De' Clerke — Wed, 04 Feb 2026 10:57:08 +0000

If you’ve ever wondered why companies don't just run their big data reports directly on their production database, you’re asking the right question. In the world of data engineering, we solve this by separating systems into two categories: OLTP and OLAP.

The Core Concept

To understand the difference, think of a supermarket manager’s office. The checkout counters handle individual transactions as they happen—that’s OLTP. The manager’s office, however, stores years of sales records to analyze trends and plan for the future—that’s OLAP.

1. OLTP (Online Transaction Processing)

OLTP systems are the "workhorses" of day-to-day business. They are designed to handle real-time operations where data changes frequently.

Primary Purpose: Process individual transactions in real-time.
Common Operations: INSERT, UPDATE, and DELETE.
Database Structure: Highly normalized (many small tables) to reduce redundancy and ensure fast writes.
Examples: Banking systems, e-commerce checkouts, and inventory tracking.

2. OLAP (Online Analytical Processing)

OLAP systems (Data Warehouses) are built for the "big picture". They are optimized for complex analysis and reporting rather than processing single transactions.

Primary Purpose: Enable complex data mining and business intelligence.
Common Operations: Complex SELECT statements with heavy GROUP BY and aggregations.
Database Structure: Denormalized (fewer, larger tables) to reduce the need for joins during analysis.
Examples: Business intelligence dashboards and market trend analysis tools.

Side-by-Side Comparison

Feature	OLTP (The "Doer")	OLAP (The "Thinker")
Data Focus	Current, live data	Historical data
Optimization	Fast write operations	Fast read operations
Query Pattern	Simple queries on few records	Complex queries on massive datasets
User Base	Operational staff & Customers	Analysts, Data Scientists, & Execs

Why the Separation Matters?

Separating these workloads is critical for Performance and Stability.

Workload Isolation: You don't want a heavy "Year-over-Year Sales" report slowing down the checkout counter for a customer.
Data Quality: Data warehouses use ETL/ELT processes to cleanse and standardize data from multiple sources before it reaches the OLAP system.

Example: Query Patterns

An OLTP query is surgical and fast:

SELECT customer_name, balance 
FROM accounts 
WHERE account_id = 98765;

An OLAP query is broad and resource-intensive:

SELECT 
    region, 
    SUM(sale_amount) AS total_revenue,
    AVG(sale_amount) AS avg_transaction
FROM sales_fact
WHERE sale_date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY region;

Summary: OLTP is about accuracy and speed in the moment; OLAP is about insight and context over time.

🔄 ETL vs. ELT: The Evolution of Data Integration

De' Clerke — Wed, 04 Feb 2026 10:15:35 +0000

In my last post, we looked at how databases store information. But how does that data actually get there? As a data engineer, most of your time is spent designing the "pipelines" that move data from source to destination.

Two main methodologies dominate this space: ETL and ELT. These define whether data is transformed before or after it hits your target system. Let's break down the evolution.

What are ETL and ELT?

Both acronyms represent three core steps: Extract, Transform, and Load. The difference lies entirely in the sequence and where the heavy lifting happens.

1. ETL (Extract, Transform, Load)

This is the traditional approach. Data is extracted, transformed in a separate processing layer, and then loaded into the target system.

Workflow: Data moves from Source → Transformation Engine → Target.
Strengths: Ensures high data quality and security (masking) before the data is stored.
Best for: Complex transformations or when target systems have limited resources.

2. ELT (Extract, Load, Transform)

ELT is the modern, cloud-native approach. Raw data is loaded directly into the target system, and transformations are performed using the target's own computational power.

Workflow: Data moves from Source → Target → Transformation.
Strengths: Faster loading times and high scalability using cloud warehouses like Snowflake or BigQuery.
Best for: Big Data scenarios and agile analytics where requirements change rapidly.

Comparison Matrix

Aspect	ETL	ELT
Processing Location	External transformation engine	Within target system
Data Quality	High (pre-loading validation)	Variable (post-loading validation)
Flexibility	Lower (rigid schemas)	Higher (on-demand views)
Maintenance	Complex schema management	Easier to adapt to changes

Modern Architectures: The Medallion Approach

Many modern data teams use a "Hybrid" or Medallion Architecture to balance both worlds. This organizes data into layers:

Bronze (Raw): The ELT starting point. Raw data is dumped here exactly as it came from the source.
Silver (Filtered): Data is cleaned, standardized, and joined.
Gold (Business-Ready): Highly transformed and aggregated data ready for analytics.

Example: Transformation in Action

In an ETL workflow, you might use Python to clean data before loading it:

import pandas as pd

df = pd.read_csv("raw_sales_data.csv")
df['total_price'] = df['quantity'] * df['unit_price']
df['order_date'] = pd.to_datetime(df['order_date'])
df_cleaned = df.dropna()

df_cleaned.to_sql("sales_table", engine)

In an ELT workflow, you load the raw data first and then use SQL (often managed by tools like dbt) inside your warehouse:

CREATE TABLE analytics.fact_sales AS
SELECT 
    order_id,
    quantity * unit_price AS total_price,
    CAST(order_date AS DATE) AS order_date
FROM raw_data.sales_staging
WHERE order_id IS NOT NULL;

Conclusion: Which should you choose?

The choice depends on your infrastructure and speed requirements.

Use ETL if you have strict regulatory compliance, need to mask data before storage, or have limited target resources.
Use ELT if you are working with cloud-native architectures (BigQuery, Redshift) and need to provide near real-time insights for big data.

👉 Summary: ETL = Cleanliness at the gate. ELT = Agility at scale.

Understanding Databases: SQL, NoSQL, Schemas,DDL, and DML

De' Clerke — Mon, 29 Sep 2025 11:17:34 +0000

🗄️ Database Essentials

Databases sit at the core of every modern application. Whether you're building a social media platform, an online store, or a data pipeline, you need a reliable way to store, organize, and access information. Let’s break down the essentials.

What is a Database?

A database is an organized collection of data that can be stored, managed, and retrieved efficiently. Instead of scattering data across files or spreadsheets, databases provide a structured system where data can be queried, updated, and maintained consistently.

Types of Databases

Databases come in many flavors, but the two most common categories are SQL (relational) and NoSQL (non-relational).

1. SQL Databases

Structure: Tables with rows and columns.
Examples: MySQL, PostgreSQL, Oracle, SQL Server.
Strengths:
- Strong consistency and reliability.
- Support for complex queries and relationships (joins).
- Schema-based design ensures data integrity.

2. NoSQL Databases

Structure: Can be document-based, key-value pairs, wide-column stores, or graph databases.
Examples: MongoDB (document), Redis (key-value), Cassandra (wide-column), Neo4j (graph).
Strengths:
- Flexible schema (data doesn’t need to fit into fixed tables).
- Handles unstructured or semi-structured data.
- Scales horizontally with ease, often preferred in big data and real-time apps.

Schemas

A schema is the blueprint of a database. It defines how data is organized, what data types are allowed, and how different entities relate to each other.

In SQL databases, schemas are strict and must be defined before data is added.
In NoSQL databases, schemas are often flexible, allowing each document or record to have different fields.

SQL Schema Example:

CREATE TABLE Users (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(100) UNIQUE
);

CREATE TABLE Orders (
    id INT PRIMARY KEY,
    user_id INT,
    product VARCHAR(100),
    FOREIGN KEY (user_id) REFERENCES Users(id)
);

NoSQL Schema Example:

{
  "id": "user_1",
  "name": "Alice",
  "email": "alice@mail.com",
  "orders": [
    {"product": "Laptop"},
    {"product": "Smartphone"}
  ]
}

Think of a schema as the rules of the game: SQL enforces strict rules, while NoSQL gives you room to improvise.

When to Use SQL vs NoSQL

Use SQL when:
- Data is highly structured with clear relationships.
- You need ACID transactions (e.g., banking, e-commerce checkout).
- Queries involve complex joins and aggregations.
Use NoSQL when:
- Data is semi-structured, rapidly changing, or unstructured.
- Applications need high scalability and performance at massive scale.
- You’re working with big data, caching, or real-time analytics.

👉 Summary: - SQL = consistency and structure. - NoSQL = flexibility and speed.

DDL vs DML

Within databases, two key categories of SQL commands are DDL and DML:

DDL (Data Definition Language):
- Defines and manages database structures like tables, schemas, indexes.
- Examples: CREATE, ALTER, DROP.
- Blueprint design.
DML (Data Manipulation Language):
- Works with the actual data inside the structures.
- Examples: INSERT, UPDATE, DELETE, SELECT.
- Content management.

Example: DDL and DML in Action

CREATE TABLE Users (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(100) UNIQUE
);

INSERT INTO Users (id, name, email)
VALUES (1, 'Alice', 'alice@example.com');

SELECT * FROM Users;

Forem: De' Clerke

Apache Airflow 2 vs 3: A Deep Technical Comparison for Data Engineers

Apache Airflow 2 vs 3: A Deep Technical Comparison for Data Engineers 🚀

Why This Comparison Matters ⚖️

The 30-Second Summary ⏱️

🏗️ Part 1 — The Architectural Paradigm Shift

Airflow 2: One Webserver to Rule Them All 🏛️

Airflow 3: Separation of Concerns as a First-Class Constraint

Part 2 — The Task Execution API (AIP-72): The Biggest Change You Haven't Heard Of 🤫

How Airflow 2 Ran Tasks

How Airflow 3 Runs Tasks

The JWT Problem (and Why It Caused a 600% CPU Spike)💥

The EXECUTION_API_SERVER_URL Problem📍

Part 3 — Authentication: FAB Out, SimpleAuthManager In🔐

Flask-AppBuilder in Airflow 2

SimpleAuthManager in Airflow 3

Choosing Between SimpleAuthManager and FAB

Part 4 — Breaking Changes Catalogue📑

4.1 SubDAGs → TaskGroups and Assets📦

4.2 SequentialExecutor Removed🚫

4.3 SLA Misses Removed⏰

4.4 REST API v1 Removed → FastAPI v2🔌

4.5 Removed Context Variables🏷️

4.6 XCom Pickling Disabled🥒

Part 5 — What's New in Airflow 3✨

5.1 The airflow.sdk Namespace🏗️

5.2 DAG Versioning (AIP-66)📑

5.3 Asset-Based Scheduling (AIP-74, AIP-75)💎

5.4 Edge Executor (AIP-69)🌐

5.5 Scheduler-Managed Backfills (AIP-78)🔙

5.6 React UI (AIP-38, AIP-84)🎨

Part 6 — Import Path Migration Guide🗺️

Automated Migration with Ruff🐶

Part 7 — Docker Compose: What Breaks, What to Add🐳

Services to Add➕

Services to Rename✏️

Environment Variables to Add🌍

Healthcheck Changes🩺

Part 8 — Configuration Migration⚙️

Changed Defaults That Will Surprise You😮

Renamed Configuration Keys✏️

Automated Config Migration🛠️

Part 9 — Migration Path🛤️

Phase 1 — Prepare (Still on Airflow 2)🏗️

Phase 2 — Upgrade⬆️

Phase 3 — Validate✅

Part 10 — Should You Upgrade?🤔

Upgrade Now If:🚀

Wait If:🛑

The Honest Assessment⚖️

Conclusion🏁

Resources

⚡ High-Performance Warehousing: Partitioning & Clustering

1. Partitioning: Divide and Conquer

The Analogy

Strategies

Why use it?

2. Clustering: Keeping Neighbors Close

The Analogy

Benefits

Partitioning vs. Clustering: Which when?

Summary and Key Takeaways

⭐ Star vs. ❄️ Snowflake: Designing the Data Warehouse

The Building Blocks: Facts & Dimensions

1. Fact Tables

2. Dimension Tables

The Star Schema

Why use a Star Schema?

The Snowflake Schema

Why use a Snowflake Schema?

Side-by-Side Comparison

Which one should you choose?

🏦 OLTP vs. OLAP: Why One Database Isn't Enough

The Core Concept

1. OLTP (Online Transaction Processing)

2. OLAP (Online Analytical Processing)

Side-by-Side Comparison

Why the Separation Matters?

Example: Query Patterns

🔄 ETL vs. ELT: The Evolution of Data Integration

The `EXECUTION_API_SERVER_URL` Problem📍

5.1 The `airflow.sdk` Namespace🏗️