Ephemeral GPU workloads: running scalable ML training with Airflow on Kubernetes

Bridging DevOps and MLOps means giving model training the same repeatable, observable, and automated workflows we expect from data pipelines. A practical, widely-adopted pattern is to run ML training and experiments as ephemeral Kubernetes pods launched by Apache Airflow. This lets platform engineers keep cluster resources isolated and cost-aware, while data scientists get reproducible runs and parallel hyperparameter sweeps. In this article I’ll walk through the architecture and concrete patterns you can apply today: choosing the right execution model in Airflow, pod-level resource configuration (including GPUs), dynamic parallelism for experiments, and operational considerations like storage, secrets, and cleanup.

Why this pattern matters

Airflow + Kubernetes: the building blocks Airflow integrates with Kubernetes in two popular ways: the KubernetesExecutor (where each task execution runs in its own pod) and the KubernetesPodOperator (which lets a task explicitly create a pod). Both give Airflow the ability to leverage Kubernetes for isolated task runs; the PodOperator is the most flexible when you want to control pod specs from the DAG. (airflow.apache.org)

Kubernetes provides explicit support for scheduling GPUs through vendor device plugins (for example, NVIDIA’s device plugin), which expose GPUs as schedulable resources (e.g., nvidia.com/gpu). That lets Pods request GPUs in their resource limits so the Kubernetes scheduler places them on GPU nodes. (kubernetes.io)

A recommended scenario: when a task needs special resources (GPUs, drivers, accelerated runtimes) use KubernetesPodOperator to launch a short-lived pod that runs the training job, writes model artifacts to remote storage, and then terminates. This pattern isolates the heavy job and keeps the Airflow control plane lightweight. (astronomer.io)

Pattern decisions: KubernetesExecutor vs KubernetesPodOperator

How to declare a GPU-enabled training pod (core example) Below is a minimal example showing a KubernetesPodOperator that requests a GPU, mounts a PVC for intermediate data, and uses a node selector and toleration so it lands on a GPU node pool. The same pattern works with cloud-managed Kubernetes (GKE, EKS, AKS) once the vendor device plugin / runtime is installed.

from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
from datetime import datetime

with DAG('train_on_gpu', start_date=datetime(2024,1,1), schedule_interval=None) as dag:
    train = KubernetesPodOperator(
        task_id='train_model',
        name='train-model-pod',
        namespace='ml-jobs',
        image='myrepo/ml-train:2026-06-01',
        cmds=["python","/app/train.py"],
        # Request 1 NVIDIA GPU and some CPU/memory
        resources={
            "limit_memory": "32Gi",
            "limit_cpu": "8",
            "limit_resources": {"nvidia.com/gpu": "1"}
        },
        # make sure the pod is scheduled on GPU nodes
        node_selector={"cloud.google.com/gke-accelerator": "nvidia-tesla-a100"},
        tolerations=[{"key":"nvidia.com/gpu","operator":"Exists","effect":"NoSchedule"}],
        # mount a PVC where training writes artifacts
        volumes=[{"name":"model-pvc","persistentVolumeClaim":{"claimName":"model-pvc"}}],
        volume_mounts=[{"name":"model-pvc","mountPath":"/models"}],
        service_account_name="ml-job-runner",
        get_logs=True,
        is_delete_operator_pod=True,
    )

Notes:

Parallel experiments: dynamic task mapping Airflow’s dynamic task mapping lets you create many similar tasks at runtime from a list of parameter sets. This is ideal for hyperparameter sweeps: one task generates a list of parameter configs, then the KubernetesPodOperator (or a mapped TaskFlow task) is expanded so Kubernetes runs one pod per experiment in parallel. Airflow supports mapping with the TaskFlow API or .expand()/ .expand_kwargs on operators. Use mapped pods to parallelize training while preserving a single DAG definition. (airflow.apache.org)

Example sketch (conceptual):

Be mindful of limits: set sensible max_active_tasks or Airflow’s max_map_length to avoid creating thousands of simultaneous pods unless your cluster can handle it. (onprema.com)

Scheduling GPUs and node pools: operational tips

Storage, secrets, and networking

Logging, retries, and cleanup

Monitoring and cost control

Common gotchas

A short checklist before you run your first GPU experiments

Closing thoughts Running ephemeral, GPU-enabled pods from Airflow combines the reproducibility and structure of workflow orchestration with the elasticity and resource isolation of Kubernetes. Using KubernetesPodOperator for training workloads, together with dynamic task mapping for parallel experiments, gives teams a flexible, maintainable MLOps pattern that fits naturally into DevOps-managed infrastructure. Be deliberate about pod specs (GPU limits, node selectors, tolerations), checkpointing for preemptible resources, and rightsizing — these operational details make the difference between a fragile setup and a scalable MLOps platform. (airflow.apache.org)

References