SLOs for the age of LLMs: practical SLIs, SLOs, and SLAs when "quality" is a moving target

Generative AI has changed what we mean by “service quality.” For traditional web APIs you measured uptime and latency; for large language model (LLM) services you must also measure correctness, hallucination risk, and tokenized cost behaviour — all while keeping an eye on throughput and latency. This article walks through how to think about SLIs, SLOs, and SLAs for LLM-backed services, with concrete examples and an operational mindset that maps SRE fundamentals to the messy reality of model-driven systems.

Why this matters now

Quick definitions (the shared vocabulary)

These definitions are the scaffolding — the real work is choosing which SLIs matter for an LLM pipeline and making them practical to measure in production.

What’s different about LLM services

Choosing SLIs for LLM-backed services (practical list) Think in layers: system, model-serving, and application-level quality.

  1. System-level SLIs (classic, still essential)
    • Availability: fraction of requests that receive a timely response. (e.g., requests with HTTP 200 within 10s). (sreschool.com)
    • Latency: p50/p95/p99 of end-to-end request time (including retrieval, decoding, token streaming). (sreschool.com)
    • Error rate: proportion of requests returning 5xx or timeouts. (sre.google)
  2. Serving-level SLIs (model- and cost-aware)
    • Throughput (tokens/sec or requests/sec) under target latency.
    • Resource-efficiency SLI: average cost per successful request (useful for correlating reliability vs spend).
  3. Quality SLIs (application-specific, critical for LLMs)
    • Factuality/hallucination rate: percent of sampled responses flagged as hallucinations by automated checks or human review.
    • Relevance/accuracy: percent of responses meeting a minimum relevance score from a retriever+ranker or a secondary evaluator model.
    • Safety/abuse indicator: percent of responses containing policy-violating content (as measured by detectors).

A few rules of thumb for choosing SLIs

Sample SLOs (concrete examples)

Implementing SLIs in code (example: PromQL-style)

A practical reference and examples of computing SLIs with PromQL-style queries are used by practitioners building SLO pipelines. (backendbytes.com)

Error budgets for LLMs: what changes Error budgets are still the right tool: they convert SLOs into operational leeway and a governance mechanism for releases. But with LLMs:

Validation: how to make quality SLIs trustworthy

Operational playbook snippets

Case studies and research signals

A closing analogy: music and SLOs Think of your service like a live concert. Traditional SLIs measure whether the lights are on and the speakers work (availability, latency). With LLMs you must also judge whether the band played the right song, in tune, and within set length (correctness, relevance, cost). You need sound engineers (observability), stage managers (error budgets and release controls), and critics (audits) — each role maps to an SRE practice that keeps the show reliable and enjoyable.

Final thoughts SLIs, SLOs, and SLAs remain the core language for reliability, but LLMs force us to expand what we measure and how we tolerate failure. Adopt a layered approach: keep the classical system SLIs, add serving-level and quality SLIs, and treat error budgets as multi-dimensional. Instrument early, automate sampling and human audits, and don’t treat model outputs as binary — reliability for generative systems is about acceptable risk and graceful trade-offs, not perfection.

Selected references and further reading

Keep the dialogue between modelers and SREs open: the more you instrument and measure, the smarter your reliability trade-offs will be — and the better your users’ experience will sound.