GitOps-driven model rollouts: automating ML deployments with Argo CD and KServe

Deploying machine learning models reliably is harder than shipping regular services: models are data-dependent, versioned artifacts, and they can silently degrade in production. GitOps — using Git as the single source of truth and an automated reconciler to drive cluster state — gives teams a repeatable, auditable path from “trained model” to “serving in production.” In this article I walk through a practical, modern pattern for automating ML model deployment with GitOps, using Argo CD for continuous delivery and KServe for model serving, and show how registries, signature checks, and observability tie together to make rollouts safe and reversible. (argoproj.github.io)

Why GitOps helps for ML

A practical stack to get to production

End-to-end workflow (concrete steps)

  1. Train and register
    • A training job produces a model artifact and registers it in MLflow (or your registry) with metadata: dataset hash, training run id, validation metrics, and an artifact URI (s3://…). (mlflow.org)
  2. Candidate promotion triggers a pipeline
    • When a model is marked “candidate” or “ready for staging,” a CI job runs smoke tests, computes additional metrics (e.g., fairness checks, latency profiles), and if those pass it writes a KServe InferenceService manifest to the Git repo (or adds a PR) that points to the model’s storageUri. The CI job can also sign the model artifact or produce attestations. This is the registry→Git handoff. (uplatz.com)
  3. Argo CD reconciles
    • Argo CD sees the change and applies the manifest to the cluster. Because Argo CD is declarative, the entire deployment (service, ingress, ConfigMaps, KServe InferenceService) is versioned and auditable in Git. (argoproj.github.io)
  4. Canary / progressive rollout
    • KServe supports canary rollouts: the InferenceService can be configured to route a percent of traffic to the new revision, with automatic promotion or rollback logic if health checks fail. Alternatively, teams use Argo Rollouts for advanced progressive delivery integrated with mesh/ingress metrics. Both approaches let you increase exposure gradually and revert automatically on problems. (kserve.github.io)

Example snippets

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-model
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: s3://models/my-model/sha256-abc123
  rollout:
    canaryTrafficPercent: 10

KServe will split traffic (10% to the new revision) and follow the configured rollout rules; a failed health check prevents promotion. (kserve.github.io)

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: ml-serving-my-model
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://git.example.com/ml/serving-manifests.git
    path: clusters/prod/my-model
    targetRevision: HEAD
  destination:
    server: https://kubernetes.default.svc
    namespace: ml-serving
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Argo CD will reconcile this path to the cluster automatically. (argo-cd.readthedocs.io)

Safety first: signing, admission and attestations Treat models as supply-chain artifacts: store models (and optionally packaged ModelKits) in an OCI-backed registry or artifact store and sign them with Cosign. That makes the artifact provenance verifiable. Use an admission controller policy (Kyverno or Gatekeeper) to block pods that reference unsigned images or model artifacts — this prevents bypassing CI/GitOps reviews. Timelines, signatures, and transparency logs (Rekor) give you traceability for audits. (github.com)

Automated promotion and rollback: metrics that matter If you want the pipeline to promote a canary to 100% automatically, define clear, measurable KPIs and let the rollout controller act on them:

Operational tips — what trips teams up

Real-world integrations and tools

When not to use full GitOps for a model

Wrapping up: a repeatable pattern A reliable GitOps path for ML looks like this: train & register → CI validates & writes a manifest (with pinned artifact) → PR and code review → Argo CD reconciles → KServe runs a canary rollout and exposes metrics → automated or human-driven promotion/rollback. Add artifact signing and admission enforcement to harden the supply chain, and instrument the system so rollouts are observability-driven rather than guesswork. That pattern gives teams an auditable, low-friction path to get models into production without losing control. (uplatz.com)

If you want, I can:

Which would be most helpful next?