Scaling ML training with Airflow Dynamic Task Mapping and Kubernetes

Orchestrating machine learning pipelines is where classic DevOps tooling meets model-driven complexity. If you need many parallel training runs (hyperparameter sweeps, per-shard training, or large-scale feature engineering), combining Apache Airflow’s dynamic-task features with Kubernetes’ per-task isolation gives you a predictable, scalable pattern. This article walks through the pattern, the why, a compact code example, and practical best practices for production-ready ML workloads.

Why this pattern matters

Two problems recur in ML pipelines:

Airflow’s Dynamic Task Mapping lets you create tasks at runtime based on data discovered when the DAG runs, instead of baking a fixed number of tasks into the DAG file. Running those tasks as Kubernetes pods—either via the KubernetesExecutor (task-per-pod) or KubernetesPodOperator—gives each task its own container image, resource requests, and node selection. Together this pattern scales parallel ML workloads without brittle DAG generation and with fine-grained resource control. (airflow.apache.org)

The architecture at a glance

This model separates orchestration (Airflow) from compute execution (Kubernetes), which aligns with cloud-native best practices for ML workloads. (astronomer.io)

A compact example

Below is a minimal TaskFlow example that demonstrates discovering a set of experiments and mapping them to per-experiment pods. The focus is on the mapping pattern; production code should externalize config, image names, and secrets.

from airflow.decorators import dag, task
from airflow.utils.dates import days_ago
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator

@dag(start_date=days_ago(1), schedule_interval=None, catchup=False)
def hp_sweep():

    @task
    def discover_experiments():
        # Return a list of dicts, each dict becomes mapped kwargs
        return [
            {"exp_id": "a1", "lr": 0.01},
            {"exp_id": "a2", "lr": 0.001},
            # ... dynamically generated
        ]

    def make_kube_task(exp):
        return KubernetesPodOperator(
            task_id=f"train__{exp['exp_id']}",
            name=f"train-{exp['exp_id']}",
            image="myorg/ml-train:latest",
            cmds=["python","train.py"],
            arguments=[
                "--exp-id", exp["exp_id"],
                "--lr", str(exp["lr"]),
                "--output", "/tmp/results.json"
            ],
            resources={
                "limit_memory": "16Gi",
                "limit_cpu": "4",
                "limit_gpus": "1"   # some providers use 'nvidia.com/gpu'
            },
            get_logs=True,
            is_delete_operator_pod=True,
        )

    experiments = discover_experiments()
    (experiments
        .map(lambda exp: make_kube_task(exp))  # dynamic mapping: one pod per exp
    )

dag = hp_sweep()

Notes:

Practical considerations and best practices

Choosing between KubernetesExecutor and KubernetesPodOperator

Both approaches run tasks on Kubernetes, but they target different operational models:

Observability, testing and CI/CD

When this pattern isn’t the right fit

Closing checklist

Orchestrating large ML sweeps this way gives you a repeatable, observable pattern that leverages Airflow for control and Kubernetes for execution. The result: faster iteration, controlled costs, and clearer separation between the orchestration layer and the heavy lifting of model training.