on
A GitOps Blueprint to Unite DevOps and MLOps for LLM and ML Services
If your software delivery rhythm feels like two bands playing different tempos—one for app code (DevOps) and one for models (MLOps)—you’re not alone. The good news: a handful of recent building blocks make it practical to conduct both with one baton. In the last year, KServe added OpenAI‑compatible endpoints for LLMs, GitHub Actions gained a ready‑to‑use step to gate AI quality in CI, and model registries like MLflow matured aliasing and promotion flows. Put together, you can ship AI the same way you ship microservices: declarative, test‑gated, and versioned from Git. (kserve.github.io)
This article gives you a practical, vendor‑neutral blueprint to unify DevOps and MLOps using:
- KServe for standardized, Kubernetes‑native model serving (predictive and generative).
- Argo CD for GitOps‑style continuous delivery.
- MLflow Model Registry for versioning and promotion.
- Evidently’s GitHub Action for automated AI quality checks in CI.
- Optional: managed training with Vertex AI Pipelines + Cloud Build if you’d rather not run your own CT. (kserve.github.io)
Why now? Because the interfaces and tooling finally line up. KServe speaks familiar OpenAI endpoints for LLM workloads, which means your application code—and even SDKs—can point at your own cluster the same way they would at a hosted LLM API. That removes a huge integration wrinkle. Meanwhile, GitOps tools like Argo CD handle the same “desired state in Git, reconciled to clusters” pattern you already use for microservices. (kserve.github.io)
The target architecture at a glance
- One “app + model” monorepo (or two repos—one for training, one for manifests). App code, prompts, tests, and pipeline configs live next to each other.
- CI runs unit tests, then AI quality checks on a fixed dataset before anything ships.
- Passing builds register a candidate model in MLflow (with aliases like candidate/champion), then update a KServe manifest via pull request to your “manifests” repo.
- Argo CD syncs the approved manifest to your cluster. KServe exposes an OpenAI‑compatible endpoint that your app can call without code changes. (mlflow.org)
Under the hood, KServe also supports the Open Inference Protocol (V2) for classic predictive models and can front different runtimes (Hugging Face, vLLM, Triton, etc.). If you need many models per cluster with high cache‑efficiency, ModelMesh adds the “router + on‑demand loading” layer. (github.com)
Step‑by‑step blueprint
1) Treat model quality like code quality (gates in CI)
Stop merging prompt or model changes without testing their behavior. Evidently’s new GitHub Action runs evaluation suites on every PR or commit and fails the build if your thresholds are missed. It wraps the Evidently CLI and can store artifacts locally or in Evidently Cloud for trending. Think of it as unit tests for model behavior, from classification metrics to “LLM‑as‑a‑judge” checks. (github.com)
Example GitHub workflow snippet to run a drift or LLM quality suite:
name: ci-ai-quality
on:
pull_request:
push:
branches: [main]
jobs:
eval:
runs-on: ubuntu-latest
permissions:
contents: read
statuses: write
steps:
- uses: actions/checkout@v4
# Run Evidently report/tests; fail CI if tests fail
- name: Run AI quality checks
uses: evidentlyai/evidently-report-action@v1
with:
config_path: "evidently_config.json"
input_path: "data/current.csv"
reference_path: "data/reference.csv"
output: "reports/run-$"
test_summary: "true"
upload_artifacts: "true"
This one small gate builds the habit: no PR merges unless the model’s measured quality is at least as good as yesterday. (github.com)
2) Register models and promote by alias, not by guesswork
When tests pass, register the model and use MLflow aliases like candidate and champion. Your serving layer can point at an alias and you “promote” by moving the pointer—no YAML rewrites needed if your serving runtime loads by model URI. Aliases are ideal for gated promotions and fast rollbacks. (mlflow.org)
A tiny Python sketch:
import mlflow
from mlflow import MlflowClient
mlflow.set_tracking_uri("http://mlflow.yourdomain")
client = MlflowClient()
name = "demand-forecaster"
registered = mlflow.register_model(
model_uri="runs:/<run_id>/model",
name=name,
)
client.set_registered_model_alias(name, "candidate", registered.version)
If staging goes well, repoint the champion alias to the new version and your production traffic follows. (mlflow.org)
3) Deliver like you do everything else: GitOps with Argo CD
Keep your KServe InferenceService manifests under version control. Argo CD watches that repo and keeps clusters reconciled. Rollbacks become “git revert,” drift is visible, and promotion is a PR and a human approval, not a kubectl command at midnight. (argo-cd.readthedocs.io)
Minimal Argo CD Application:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: llm-qa
spec:
project: default
source:
repoURL: https://github.com/yourorg/ai-manifests
targetRevision: main
path: kserve/llm-qa
destination:
server: https://kubernetes.default.svc
namespace: ml-inference
syncPolicy:
automated:
prune: true
selfHeal: true
4) Serve via a standard API: KServe with OpenAI‑compatible endpoints
Here’s a bare‑bones KServe service for an LLM using the Hugging Face runtime; it exposes familiar OpenAI endpoints like /v1/chat/completions:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-llama3
namespace: ml-inference
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=llama3
- --model_id=meta-llama/meta-llama-3-8b-instruct
resources:
limits:
cpu: "6"
memory: "24Gi"
nvidia.com/gpu: "1"
requests:
cpu: "6"
memory: "24Gi"
nvidia.com/gpu: "1"
Once deployed, your app can use the normal OpenAI SDK—just change base_url:
from openai import OpenAI
client = OpenAI(base_url=f"http://{SERVICE_HOSTNAME}/openai/v1", api_key="empty")
resp = client.chat.completions.create(
model="llama3",
messages=[{"role":"user","content":"Quick sanity check: 2+2?"}],
)
print(resp.choices[0].message.content)
KServe’s data plane supports these OpenAI‑style endpoints alongside V1/V2 predictive protocols, so the same platform can serve both classic ML and LLMs. (kserve.github.io)
Tip: For high‑throughput LLM serving, consider the vLLM backend; KServe supports that path too. (kserve.github.io)
5) Optional: managed CT with Vertex AI Pipelines + Cloud Build
Prefer not to host your own training orchestrator? Vertex AI Pipelines provide multi‑step training workflows (preprocess → train → eval → deploy) and can be triggered by Cloud Build for CI/CD and continuous training. You can still log to MLflow and feed the same GitOps loop on the serving side. (cloud.google.com)
Putting it together: a practical workflow
- Developer opens a PR changing a prompt or model code. CI runs regular tests and an Evidently suite. If any AI test fails, the PR stays red. (github.com)
- On merge, training runs (self‑hosted or Vertex). The best run is registered to MLflow as a new version and tagged candidate. (mlflow.org)
- A promotion job updates the manifest repo (if you hard‑pin versions) or simply re‑points the champion alias to the new version. Argo CD reconciles the change. (argo-cd.readthedocs.io)
- KServe exposes a stable OpenAI‑style endpoint. Your app code keeps calling the same path while you swap model versions behind the scenes. (kserve.github.io)
Progressive delivery and scale
- Canary rollouts: KServe supports canaries and traffic‑splitting. Use this to mirror a small percentage of traffic to a candidate model and compare. Then repoint the alias. (github.com)
- Extreme scale: If you host thousands of models, add ModelMesh to load models on demand and route requests like a distributed cache. It integrates with KServe runtimes (e.g., Triton, MLServer). (ibm.com)
- Interoperability: The Open Inference Protocol provides a common data plane across Triton, MLServer, TorchServe, and KServe. This helps you avoid lock‑in at the serving layer. (github.com)
What about foundation models and vendor runtimes?
If you’re packaging vendor‑optimized inference services, treat them the same way: Helm charts in Git, Argo CD for reconciliation. NVIDIA’s NIM microservices ship official Helm charts and an Operator for Kubernetes. You can manage those charts declaratively with Argo CD, and in some scenarios even front them through KServe for a uniform API surface. (docs.nvidia.com)
Guardrails: the boring but important bits
- Test data stability: keep a fixed test set for CI (and rotate it on schedule) so you’re measuring model changes, not data randomness. Evidently supports data drift tests when you need to watch the inputs. (docs.evidentlyai.com)
- Secrets and registry access: if your runtime pulls models from Hugging Face or a private registry, store tokens as Kubernetes Secrets and mount via the KServe spec. (kserve.github.io)
- Cost awareness: KServe supports scale‑to‑zero and GPU autoscaling; combine with GitOps policies so idle models don’t burn budget. (github.com)
- Compliance and traceability: MLflow’s lineage, tags, and descriptions give you an audit trail. Keep promotion rules in code so your approvals are reviewable. (mlflow.org)
A minimal starter repo layout
- app/
- src/ (service code or client calling the OpenAI endpoint)
- tests/
- models/
- train.py, pipeline.yaml (or Vertex components)
- mlflow_utils.py
- quality/
- evidently_config.json
- data/reference.csv
- data/current_sample.csv
- k8s/
- kserve/inferenceservice.yaml
- argo/application.yaml
With a couple hundred lines of YAML and a small test dataset, you’ll have a pipeline where:
- Quality gates run before promotion.
- Promotion is a PR and an alias flip.
- Deployments are drift‑free and reversible via Git.
- Your app calls the same endpoint regardless of which model you serve today.
That’s Unified DevOps + MLOps in practice—not a buzzword, just the same disciplined software delivery you already use, applied to AI.
Further reading and docs used here:
- KServe data plane and OpenAI‑compatible endpoints; KServe Hugging Face/vLLM examples. (kserve.github.io)
- Argo CD GitOps model and reconciliation features. (argo-cd.readthedocs.io)
- MLflow Model Registry concepts and aliasing. (mlflow.org)
- Evidently GitHub Action for CI gating of AI quality. (github.com)
- Vertex AI Pipelines + Cloud Build reference architecture for CI/CD/CT. (cloud.google.com)
- ModelMesh for large‑scale multi‑model routing. (ibm.com)
- Open Inference Protocol for cross‑runtime interoperability. (github.com)
- NVIDIA NIM Helm/operator for inference microservices (if you prefer that runtime). (docs.nvidia.com)
With these pieces in place, your delivery soundtrack goes from chaotic jam session to a steady groove—one pipeline, one Git history, and clear gates from idea to inference.