on
Ephemeral GPU workloads: running scalable ML training with Airflow on Kubernetes
Bridging DevOps and MLOps means giving model training the same repeatable, observable, and automated workflows we expect from data pipelines. A practical, widely-adopted pattern is to run ML training and experiments as ephemeral Kubernetes pods launched by Apache Airflow. This lets platform engineers keep cluster resources isolated and cost-aware, while data scientists get reproducible runs and parallel hyperparameter sweeps. In this article I’ll walk through the architecture and concrete patterns you can apply today: choosing the right execution model in Airflow, pod-level resource configuration (including GPUs), dynamic parallelism for experiments, and operational considerations like storage, secrets, and cleanup.
Why this pattern matters
- Work isolation: ML training often needs different libs, drivers, GPUs, and strict resource boundaries. Spinning an isolated pod per task prevents “dependency soup.”
- Elasticity and cost control: Kubernetes node pools (including GPU nodes) can scale independently so expensive resources are used only when a job runs.
- Reproducibility: A container image and pod spec capture the exact environment for each training job, improving reproducible experiments.
Airflow + Kubernetes: the building blocks Airflow integrates with Kubernetes in two popular ways: the KubernetesExecutor (where each task execution runs in its own pod) and the KubernetesPodOperator (which lets a task explicitly create a pod). Both give Airflow the ability to leverage Kubernetes for isolated task runs; the PodOperator is the most flexible when you want to control pod specs from the DAG. (airflow.apache.org)
Kubernetes provides explicit support for scheduling GPUs through vendor device plugins (for example, NVIDIA’s device plugin), which expose GPUs as schedulable resources (e.g., nvidia.com/gpu). That lets Pods request GPUs in their resource limits so the Kubernetes scheduler places them on GPU nodes. (kubernetes.io)
A recommended scenario: when a task needs special resources (GPUs, drivers, accelerated runtimes) use KubernetesPodOperator to launch a short-lived pod that runs the training job, writes model artifacts to remote storage, and then terminates. This pattern isolates the heavy job and keeps the Airflow control plane lightweight. (astronomer.io)
Pattern decisions: KubernetesExecutor vs KubernetesPodOperator
- KubernetesExecutor: good when most of your tasks are homogeneous and you want Airflow to manage pod lifecycle for all scheduled tasks. It replaces worker processes with one pod per task automatically. (airflow.apache.org)
- KubernetesPodOperator: better when tasks need custom pod specs (GPU limits, node selectors, init containers, large ephemeral volumes). Use it for ML training, hyperparameter jobs, and anything needing device plugins or special runtimes. (airflow.apache.org)
How to declare a GPU-enabled training pod (core example) Below is a minimal example showing a KubernetesPodOperator that requests a GPU, mounts a PVC for intermediate data, and uses a node selector and toleration so it lands on a GPU node pool. The same pattern works with cloud-managed Kubernetes (GKE, EKS, AKS) once the vendor device plugin / runtime is installed.
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
from datetime import datetime
with DAG('train_on_gpu', start_date=datetime(2024,1,1), schedule_interval=None) as dag:
train = KubernetesPodOperator(
task_id='train_model',
name='train-model-pod',
namespace='ml-jobs',
image='myrepo/ml-train:2026-06-01',
cmds=["python","/app/train.py"],
# Request 1 NVIDIA GPU and some CPU/memory
resources={
"limit_memory": "32Gi",
"limit_cpu": "8",
"limit_resources": {"nvidia.com/gpu": "1"}
},
# make sure the pod is scheduled on GPU nodes
node_selector={"cloud.google.com/gke-accelerator": "nvidia-tesla-a100"},
tolerations=[{"key":"nvidia.com/gpu","operator":"Exists","effect":"NoSchedule"}],
# mount a PVC where training writes artifacts
volumes=[{"name":"model-pvc","persistentVolumeClaim":{"claimName":"model-pvc"}}],
volume_mounts=[{"name":"model-pvc","mountPath":"/models"}],
service_account_name="ml-job-runner",
get_logs=True,
is_delete_operator_pod=True,
)
Notes:
- Exact parameter names vary slightly by Airflow provider version; the important pieces are the GPU resource request (nvidia.com/gpu), node selectors and tolerations, and an image that includes the correct CUDA/cuDNN runtime. Kubernetes treats GPUs as allocatable resources; they are normally declared in pod limits. (kubernetes.io)
Parallel experiments: dynamic task mapping Airflow’s dynamic task mapping lets you create many similar tasks at runtime from a list of parameter sets. This is ideal for hyperparameter sweeps: one task generates a list of parameter configs, then the KubernetesPodOperator (or a mapped TaskFlow task) is expanded so Kubernetes runs one pod per experiment in parallel. Airflow supports mapping with the TaskFlow API or .expand()/ .expand_kwargs on operators. Use mapped pods to parallelize training while preserving a single DAG definition. (airflow.apache.org)
Example sketch (conceptual):
- upstream task: gather dataset versions or generate experiment configs
- downstream mapped operator: KubernetesPodOperator.partial(…).expand_kwargs(experiment_configs)
Be mindful of limits: set sensible max_active_tasks or Airflow’s max_map_length to avoid creating thousands of simultaneous pods unless your cluster can handle it. (onprema.com)
Scheduling GPUs and node pools: operational tips
- Install the vendor device plugin and container runtime on GPU nodes (for NVIDIA, deploy the k8s-device-plugin and NVIDIA Container Toolkit). That ensures GPUs appear to the scheduler as resources like nvidia.com/gpu. (github.com)
- Use dedicated GPU node pools (or tainted nodes) so ordinary workloads can’t accidentally consume GPUs; add nodeSelector and tolerations in the pod spec so only intended pods are scheduled on GPU nodes. kOps and cloud providers provide documented ways to configure GPU node pools and runtime classes. (kops.sigs.k8s.io)
- Consider model size and GPU type: some clusters may contain mixed GPU types (A100, T4). If you need a particular model, expose labels (or use third-party schedulers) to ensure correct placement. Many teams use node labels for GPU model selection.
Storage, secrets, and networking
- Use remote object storage (S3, GCS, Azure Blob) for datasets and artifacts. Have the pod pull training data at startup and push models/artifacts at the end; this avoids tying pods to node-local disks.
- Mounting a PVC can make sense for intermediate large files or when your training library expects POSIX storage, but be careful: shared PVCs on GPU nodes can become a contention point.
- Inject credentials securely via Kubernetes Secrets or service accounts rather than baking keys into images. The KubernetesPodOperator supports mounting secrets into the launched pod. (astronomer.io)
Logging, retries, and cleanup
- Set get_logs=True on KubernetesPodOperator to stream container logs back to Airflow’s UI; this centralizes troubleshooting. (airflow.apache.org)
- Handle evictions and preemptions: cloud spot/preemptible GPUs are cheaper but can be reclaimed. Build checkpoints into training and make your jobs resilient to pod restarts or retries. Note: historically there are edge cases around pod re-use and evicted pods; verify behavior against your Airflow provider version. (github.com)
- Delete pods after completion unless you explicitly need to keep them for debugging. KubernetesPodOperator supports pod cleanup settings.
Monitoring and cost control
- Rightsize training containers and use node autoscaling for GPU pools so you don’t pay for idle GPUs. Platform teams should set reasonable resource requests/limits and configure cluster autoscaler policies to add/remove GPU nodes. Astronomer and other practitioners emphasize rightsizing Airflow components and ensuring the executor and scheduler don’t become resource bottlenecks when you scale pod creation. (astronomer.io)
- Track per-job GPU time and chargeback to teams where appropriate. Combine Kubernetes metrics (node/pod GPU usage) with Airflow task metadata for billing and optimization.
Common gotchas
- Missing device plugin or container runtime on GPU nodes -> pods requesting GPUs will fail to schedule. Always validate the GPU device plugin daemonset and node allocatable resources. (kubernetes.io)
- Large numbers of mapped tasks can overwhelm the scheduler or the cluster. Use limits, staggered launches, or a queueing mechanism if you expect thousands of simultaneous experiments. (onprema.com)
- Image size and start-up time: large ML images slow down pod startup. Consider smaller images, multi-stage builds, or an init step that pulls large model weights from object storage only when needed.
A short checklist before you run your first GPU experiments
- Confirm GPU device plugin + runtime is installed and GPUs appear as resources (kubectl describe nodes). (kubernetes.io)
- Build a small test pod that requests a GPU and runs nvidia-smi to validate drivers and scheduling.
- Use a dedicated service account for ML jobs with least privilege for storage and logging buckets.
- Start with one or a few parallel runs to validate stability, then increase mapped parallelism while monitoring scheduler and cluster load. (airflow.apache.org)
Closing thoughts Running ephemeral, GPU-enabled pods from Airflow combines the reproducibility and structure of workflow orchestration with the elasticity and resource isolation of Kubernetes. Using KubernetesPodOperator for training workloads, together with dynamic task mapping for parallel experiments, gives teams a flexible, maintainable MLOps pattern that fits naturally into DevOps-managed infrastructure. Be deliberate about pod specs (GPU limits, node selectors, tolerations), checkpointing for preemptible resources, and rightsizing — these operational details make the difference between a fragile setup and a scalable MLOps platform. (airflow.apache.org)
References
- Airflow KubernetesExecutor and provider docs (Kubernetes integration). (airflow.apache.org)
- Kubernetes scheduling GPUs / device plugins documentation. (kubernetes.io)
- NVIDIA k8s-device-plugin (official repo / tooling). (github.com)
- KubernetesPodOperator reference and usage (Airflow provider docs). (airflow.apache.org)
- Astronomer best-practices on rightsizing Airflow resources and dynamic tasks. (astronomer.io)