Written by Albert Friedman
on October 01, 2025

Unifying DevOps and MLOps: CI Quality Gates and Live Monitoring for GenAI

Shipping AI features safely now demands one pipeline that treats models like code: tested on every pull request, deployed with progressive delivery, and monitored in production with the same rigor as app services. This post outlines a practical, recent-by-default workflow that blends DevOps and MLOps using GitHub Actions for CI/CD, MLflow for automated LLM evaluations, and Databricks Model Serving with Inference Tables for post-deploy observability and auditability.

Why now: beyond the business urgency to ship GenAI, compliance pressure is real. The EU AI Act’s obligations for general‑purpose AI (GPAI) became applicable on August 2, 2025, with official tools (Code of Practice, guidance, and a training‑data summary template) to help providers operationalize transparency, safety, and copyright practices. Even if you don’t sell in the EU, these expectations are quickly becoming global good practice. (digital-strategy.ec.europa.eu)

What’s new that makes this easier:

MLflow’s built‑in LLM evaluation API lets you score models (and prompts) with heuristic and “LLM‑as‑a‑judge” metrics, so you can enforce quality gates in CI before anything reaches prod. (mlflow.org)
Databricks Model Serving provides a unified, OpenAI‑compatible interface across open and proprietary models and now supports Inference Tables for all endpoint types (including externally hosted models), making it straightforward to capture inputs/outputs for monitoring and debugging. (databricks.com)
You can turn on Inference Tables with a single API field (auto_capture_config) and even log augmented feature lookups, which is handy for ML governance and post‑hoc analysis. (docs.databricks.com)
Secure CI authentication is cleaner with GitHub OIDC “workload identity federation,” so your pipeline avoids long‑lived tokens entirely. (learn.microsoft.com)

A reference pipeline you can copy

Trigger: Pull request against main
- Train or load your candidate model/prompt.
- Run MLflow evaluations and fail fast if metrics don’t meet thresholds.
Merge to main
- Deploy a “challenger” version to production behind traffic splitting.
- Enable Inference Tables to capture requests/responses (and, optionally, joined feature values).
- Schedule dashboards/alerts over these logs; rollback or retrain automatically if quality drifts.
Optional: If you deploy on Kubernetes, KServe supports canary traffic percentages as a native rollout strategy. (kserve.github.io)

Step 1 — Add a CI quality gate with MLflow evaluate Create ci/eval.py:

# ci/eval.py
import sys, pandas as pd, mlflow
from mlflow.metrics import latency
from mlflow.metrics.genai import answer_correctness

# Example eval data (replace with your task-specific set)
eval_data = pd.DataFrame({
    "inputs": [
        "Summarize: The quick brown fox jumps over the lazy dog.",
        "Explain what feature stores do in ML."
    ],
    "ground_truth": [
        "A concise summary of a pangram.",
        "They centralize and version features for training and serving."
    ]
})

# Any callable or MLflow pyfunc model works; here a simple function signature
def my_model(data):
    # Return a list of strings aligned with inputs
    return ["A concise summary of a pangram.", "They manage features for ML pipelines."]

results = mlflow.evaluate(
    model=my_model,
    data=eval_data,
    targets="ground_truth",
    model_type="question-answering",
    extra_metrics=[answer_correctness(), latency()],
)

# Gate on two metrics from the aggregate results
acc = results.metrics.get("answer_correctness_score", 0)
p50_lat = results.metrics.get("latency_p50_ms", 1e9)

print(f"answer_correctness_score={acc:.3f} latency_p50_ms={p50_lat:.1f}")

if acc < 0.80 or p50_lat > 800:
    sys.exit("Quality gate failed")

MLflow’s evaluation interface supports both traditional metrics (e.g., ROUGE, BLEU) and LLM‑as‑a‑judge metrics such as answer correctness. You can point the model argument to an MLflow model URI or an external endpoint and still use the same evaluate() call, making the gate reusable across providers. (mlflow.org)

Step 2 — Wire the gate into GitHub Actions and deploy on pass Add .github/workflows/ci.yml:

name: ci-mlops
on:
  pull_request:
  push:
    branches: [ main ]

permissions:
  id-token: write   # needed for OIDC
  contents: read

jobs:
  test_and_deploy:
    runs-on: ubuntu-latest
    environment: prod            # encodes into OIDC 'sub'
    env:
      DATABRICKS_AUTH_TYPE: github-oidc
      DATABRICKS_HOST: $
      DATABRICKS_CLIENT_ID: $
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install deps
        run: pip install mlflow pandas

      - name: Run MLflow evaluate gate
        run: python ci/eval.py

      - name: Install Databricks CLI
        if: github.ref == 'refs/heads/main'
        uses: databricks/setup-cli@main

      - name: Canary deploy with inference logging enabled
        if: github.ref == 'refs/heads/main'
        run: |
          # Update endpoint to 10% challenger, enable Inference Tables
          databricks serving-endpoints update-config my-genai \
            --json @config/update.json

GitHub OIDC eliminates the need for storing long‑lived PATs in secrets; the CLI exchanges the short‑lived identity token for a Databricks OAuth token at run time. This approach is now the recommended way to authenticate automated workflows. (learn.microsoft.com)

Step 3 — Enable canary + Inference Tables in your endpoint config Create config/update.json:

{
  "served_entities": [
    {
      "name": "current",
      "entity_name": "main.default.my_model",
      "entity_version": "12",
      "workload_size": "Small",
      "scale_to_zero_enabled": true
    },
    {
      "name": "challenger",
      "entity_name": "main.default.my_model",
      "entity_version": "13",
      "workload_size": "Small",
      "scale_to_zero_enabled": true
    }
  ],
  "traffic_config": {
    "routes": [
      { "served_model_name": "current", "traffic_percentage": "90" },
      { "served_model_name": "challenger", "traffic_percentage": "10" }
    ]
  },
  "auto_capture_config": {
    "catalog_name": "governed_ai",
    "schema_name": "inference_logs",
    "table_name_prefix": "genai_endpoint"
  }
}

The auto_capture_config block turns on Inference Tables, logging requests and responses into a Unity Catalog table (for example, governed_ai.inference_logs.genai_endpoint_payload). You can enable this at creation or update time via the API/CLI; Databricks also documents how to toggle it in the UI. (docs.databricks.com)

Step 4 — Monitor, alert, and (optionally) log joined features With Inference Tables enabled, you can query logs directly, or plug them into Lakehouse Monitoring notebooks for LLM quality metrics (readability, toxicity, correctness) and drift dashboards with alerts and retraining hooks. If your endpoint performs feature lookups, Databricks supports saving the augmented DataFrame to the inference table for endpoints created from February 2025, which strengthens debugging and governance. (databricks.com)

Notes on portability and progressive delivery

The canary idea works outside Databricks too. If you’re on Kubernetes, KServe has first‑class canary rollout controls via canaryTrafficPercent, and recent releases added strong support for LLM runtimes and OpenAI‑style endpoints—handy if your app already speaks that protocol. (kserve.github.io)
Databricks Model Serving exposes a unified API (including OpenAI‑compatible) across internal and external models, which simplifies side‑by‑side testing when your org is standardizing on one interface. (databricks.com)

Compliance is a team sport The EU AI Act doesn’t mandate specific tools, but it does expect evidence: data provenance, testing, safety, and transparency. The Code of Practice and templates published by the European Commission are concrete references your teams can bake into the pipeline alongside your CI gates and logs. Capturing requests/responses and evaluation scores gives you a durable audit trail, while your documentation process can pull from the same artifacts. (digital-strategy.ec.europa.eu)

Wrap‑up

Put model quality in CI using MLflow evaluate and fail fast on PRs.
Use OIDC in GitHub Actions so your pipeline is secure and token‑free.
Deploy challengers behind traffic splitting and turn on Inference Tables.
Monitor continuously; trigger alerts and retraining from the same data.

This is what “Unified DevOps + MLOps” looks like in practice: one pipeline, clear gates, and production telemetry that closes the loop from code to model to user—and back again. (mlflow.org)

← → Top