on
SLOs for the age of LLMs: practical SLIs, SLOs, and SLAs when "quality" is a moving target
Generative AI has changed what we mean by “service quality.” For traditional web APIs you measured uptime and latency; for large language model (LLM) services you must also measure correctness, hallucination risk, and tokenized cost behaviour — all while keeping an eye on throughput and latency. This article walks through how to think about SLIs, SLOs, and SLAs for LLM-backed services, with concrete examples and an operational mindset that maps SRE fundamentals to the messy reality of model-driven systems.
Why this matters now
- Organizations are rapidly adopting generative AI, so SRE teams are being asked to apply reliability guardrails to systems whose outputs are probabilistic, stateful, and expensive to produce. (deloitte.com)
- The classical SRE triad — SLIs, SLOs, SLAs — still applies, but the choice of indicators and how you measure them must change to capture model quality, cost, and safety trade-offs. (sre.google)
Quick definitions (the shared vocabulary)
- SLI (Service Level Indicator): a measured signal about some observable (latency, error rate, correctness, etc.). (sre.google)
- SLO (Service Level Objective): a target on an SLI (e.g., 99.9% of requests < 300 ms). (sre.google)
- SLA (Service Level Agreement): a contract with customers that may include financial remedies for missed SLOs.
These definitions are the scaffolding — the real work is choosing which SLIs matter for an LLM pipeline and making them practical to measure in production.
What’s different about LLM services
- Outputs are not binary correct/incorrect. “Correctness” is often fuzzy: relevance, factuality, and hallucination rate all live on different scales. Academic and engineering work has already begun to formalize per-stage SLOs for multi-stage LLM serving (e.g., retrieval, generation, rerank), which means SLOs may be stage-specific rather than monolithic. (arxiv.org)
- Latency and cost interplay. Meeting a strict latency SLO might force you to use smaller models or more caching, which affects quality. Conversely, keeping a high-quality SLO may require larger models and higher compute, affecting your error budget and economics. Research shows inference engines and serving stacks can be tuned to optimize for SLOs. (arxiv.org)
- Failure modes are different. A model can “succeed” technically (return a 200) while producing an unsafe or incorrect answer. Error budgets and SRE practices need to account for these semantic failures, not just HTTP-level errors. (dzone.com)
Choosing SLIs for LLM-backed services (practical list) Think in layers: system, model-serving, and application-level quality.
- System-level SLIs (classic, still essential)
- Availability: fraction of requests that receive a timely response. (e.g., requests with HTTP 200 within 10s). (sreschool.com)
- Latency: p50/p95/p99 of end-to-end request time (including retrieval, decoding, token streaming). (sreschool.com)
- Error rate: proportion of requests returning 5xx or timeouts. (sre.google)
- Serving-level SLIs (model- and cost-aware)
- Throughput (tokens/sec or requests/sec) under target latency.
- Resource-efficiency SLI: average cost per successful request (useful for correlating reliability vs spend).
- Quality SLIs (application-specific, critical for LLMs)
- Factuality/hallucination rate: percent of sampled responses flagged as hallucinations by automated checks or human review.
- Relevance/accuracy: percent of responses meeting a minimum relevance score from a retriever+ranker or a secondary evaluator model.
- Safety/abuse indicator: percent of responses containing policy-violating content (as measured by detectors).
A few rules of thumb for choosing SLIs
- Start with the simplest useful metric. If relevance is the core user need, measure end-to-end task success rather than internal model perplexity. (sre.google)
- Prefer indicators you can compute automatically in production with reasonable fidelity — even if imperfect — and periodically reconcile them with human audits.
- Different user journeys need different SLOs. Interactive chat has tight latency SLOs; batch summarization cares more about correctness and cost per request.
Sample SLOs (concrete examples)
- Availability SLO: 99.9% of requests receive an HTTP 200 within 3 seconds (30-day rolling window).
- Latency SLO (interactive): 95% of responses < 500 ms end-to-end.
- Quality SLO: On 1,000 sampled responses per week, at least 95% score >= 0.7 on a relevance-checker model and hallucination rate < 2%.
- Cost-aware SLO: Average inference cost per request must remain below $0.02 for the core inference tier.
Implementing SLIs in code (example: PromQL-style)
- Availability SLI (percent successful requests over 30 days):
- successful_requests / total_requests
- Latency SLI (p95):
- histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[30d])) by (le))
A practical reference and examples of computing SLIs with PromQL-style queries are used by practitioners building SLO pipelines. (backendbytes.com)
Error budgets for LLMs: what changes Error budgets are still the right tool: they convert SLOs into operational leeway and a governance mechanism for releases. But with LLMs:
- You must budget for semantic failures (e.g., allowed hallucination rate) in addition to uptime.
- Spend your error budget along different axes: latency spikes, correctness regressions, and safety incidents. Each axis might have its own budget so a lapse in one area doesn’t mask health in another. Practitioners writing about ML-focused SRE recommend separate or multi-dimensional error budgets for these workloads. (dzone.com)
Validation: how to make quality SLIs trustworthy
- Shadow evaluation: run a new model in shadow and compare its responses on a sample to a golden baseline; measure the delta on your quality SLIs.
- Human-in-the-loop audits: schedule weekly human reviews for a random slice; use these to calibrate automated detectors.
- Canary + multi-stage SLOs: enforce SLOs at each stage (retrieval, generation, rerank) so that regressions are easier to localize. This multi-stage approach is emerging in research for multi-SLO LLM serving. (arxiv.org)
Operational playbook snippets
- Instrument early. Add counters/timers for tokens produced, candidate reranks, detector results (flagged hallucination, safety hits), and cost.
- Automate sampling and evaluation. Automate a pipeline that samples N responses per day, runs them through an evaluator model, and pushes indicators to dashboards.
- Treat quality alerts like reliability alerts. If your quality SLO’s error budget is spent, slow down releases, increase human review, and allocate engineering time to fix the cause.
Case studies and research signals
- Several recent research efforts and system proposals tackle SLO-aware LLM serving: from per-stage SLO frameworks to inference engine tuning systems that optimize serving for SLO constraints. These demonstrate both the feasibility and the need to treat quality SLOs as first-class operational metrics. (arxiv.org)
A closing analogy: music and SLOs Think of your service like a live concert. Traditional SLIs measure whether the lights are on and the speakers work (availability, latency). With LLMs you must also judge whether the band played the right song, in tune, and within set length (correctness, relevance, cost). You need sound engineers (observability), stage managers (error budgets and release controls), and critics (audits) — each role maps to an SRE practice that keeps the show reliable and enjoyable.
Final thoughts SLIs, SLOs, and SLAs remain the core language for reliability, but LLMs force us to expand what we measure and how we tolerate failure. Adopt a layered approach: keep the classical system SLIs, add serving-level and quality SLIs, and treat error budgets as multi-dimensional. Instrument early, automate sampling and human audits, and don’t treat model outputs as binary — reliability for generative systems is about acceptable risk and graceful trade-offs, not perfection.
Selected references and further reading
- Google SRE book — service level objectives and indicators for traditional services. (sre.google)
- Deloitte: enterprise adoption patterns for generative AI and considerations for operational governance. (deloitte.com)
- SLOs-Serve and related academic work on multi-SLO LLM serving (per-stage objectives). (arxiv.org)
- SCOOT and other research on tuning inference engines to meet SLOs. (arxiv.org)
- “Building SRE Error Budgets for AI/ML Workloads” — practical guidance on unique ML failure modes and error budgets. (dzone.com)
Keep the dialogue between modelers and SREs open: the more you instrument and measure, the smarter your reliability trade-offs will be — and the better your users’ experience will sound.