on
Measuring the Unseen: SLIs, SLOs, and SLAs for Generative AI Services
Generative AI — chatbots, multimodal assistants, code generators — behaves less like a traditional request/response API and more like a live performance: every call has rhythm (tokens per second), tempo (time-to-first-token), and occasionally a wrong note (hallucination). That shift means the classic SRE trio — SLIs, SLOs, and SLAs — still matters, but what you measure and how you set targets needs to change. This article explains the fundamentals and shows how to apply them to modern generative AI services.
Quick refresher: SLIs, SLOs, SLAs — short and practical
- Service Level Indicator (SLI): a measured value that represents some aspect of user experience (e.g., request latency, error rate, availability). Think of SLIs as the knobs and meters on your mixing console. (en.wikipedia.org)
- Service Level Objective (SLO): a target or goal for an SLI over a time window (e.g., 95% of requests have p95 latency < 300 ms). SLOs are your setlist — they tell the team which songs you’ll play and how tight the performance should be. (sre.google)
- Service Level Agreement (SLA): a contractual promise to customers that typically includes penalties if targets aren’t met. SLAs are the headline act: the legally binding commitment that follows the setlist. (en.wikipedia.org)
Why generative AI changes what you should measure
Generative systems often break a single request into stages (prompt processing, model scheduling/loading, token generation). Two consequences matter for SREs:
- Latency is multidimensional: users care about time-to-first-token (TTFT), and for streaming responses the pace between tokens (time-between-tokens) matters as much as end-to-end completion time. Papers and operational work on serving LLMs highlight these stage-level SLOs. (arxiv.org)
- Correctness is subjective and domain-dependent: “error rate” isn’t just 5xx HTTP codes — it can be hallucination rate, factuality score, or safety-violation incidents, which require different instruments and often human-in-the-loop verification. Google Cloud and other providers recommend adapting SRE practices specifically for AI/ML reliability. (cloud.google.com)
Put simply: measuring only HTTP 5xx and average latency is like grading a live concert by whether the lights worked — necessary but not sufficient.
Practical SLIs for generative AI (what to measure)
Pick SLIs that map to user journeys. Common, useful SLIs for generative workloads:
- Time-to-first-token (TTFT): the wall-clock time from request arrival to the first token streamed back. Critical for perceived responsiveness. (arxiv.org)
- Token throughput / tokens-per-second: how many tokens the model emits per second during streaming responses. Useful for capacity planning and QoS tiers.
- Tail latency (p95, p99) of TTFT and completion latency: capture extreme latency behavior, which is what users really notice.
- Availability (request success rate): percentage of requests that complete without infrastructure errors (5xx) or timeouts.
- Model correctness or hallucination rate: percent of responses failing an automated classifier or human review for factuality/safety. This often requires periodic labeling or proxy metrics (e.g., contradictions detected by a verifier model).
- Cost-per-request or compute-utilization SLI: for internal economics and error-budget trade-offs in autoscaling decisions.
When possible, instrument at the stage level: queue time, model load time, inference time, and streaming pace. That granularity helps you tie a spike in TTFT to a GPU warm-up or model swap, not just “the model is slow.”
Writing SLOs and using error budgets — examples and mindset
SLOs should be business-facing, measurable, and actionable. A typical set for a user-facing chat API might look like:
- TTFT p95 < 300 ms (rolling 30 days).
- Streaming throughput: 90% of streaming sessions sustain ≥ 10 tokens/sec after the first 5 tokens.
- Availability: 99.9% successful requests (rolling 30 days).
- Hallucination rate: ≤ 1% for top-tier factual queries (measured by periodic sampling and verification over 30 days).
- Error budget: allow 0.1% of requests per month to exceed SLOs before triggering mitigation playbooks.
Error budgets let teams trade reliability for feature velocity: once the budget is consumed, prioritize reliability work or throttle risky launches. For generative models, error budgets can be split across dimensions (latency budget vs. correctness budget) because you might accept a small latency regression to reduce hallucinations or vice versa. Academic and operational work on multi-SLO serving recommends treating stage-specific SLOs and adaptive allocation as first-class concerns. (arxiv.org)
Instrumentation tips: measuring the right things reliably
- Use client-observed metrics for user-perceived latency (TTFT from the edge), and correlate with server-side stage metrics to find root causes.
- Record percentiles using histograms (not averages). For Prometheus, histogram_quantile or summaries are typical approaches for p95/p99. Example PromQL for p95 request latency:
histogram_quantile(0.95, sum(rate(request_duration_seconds_bucket[5m])) by (le))For TTFT, name the metric request_time_to_first_token_seconds and use the same pattern.
- Track correctness via automated verifiers where possible (redundant models, fact-checker), and keep a labeled sample for periodic human validation; automated signals alone can drift.
- Instrument cost and resource signals (GPU utilization, queue lengths) so you can connect SLO degradation to capacity issues.
Policy and SLA framing
SLAs for generative services must reflect the realities of both technical variability and legal risk:
- Be explicit about what counts as a “successful request” (e.g., excludes responses flagged for safety or that required human intervention).
- If correctness/safety is in scope, offer differentiated SLAs (e.g., “Tier A customers get human-in-the-loop verification with 99.99% availability”).
- Put reasonable limits and exclusions (model updates, third-party model changes, training-data drift). Contracts should align incentives: if you promise high factuality, ensure resourcing and monitoring match that promise.
Operationalizing multi-dimensional SLOs
Generative systems are multi-dimensional — latency, throughput, and correctness are linked. Some practical rules of thumb:
- Start simple: pick 2–3 high-leverage SLIs (TTFT, availability, hallucination rate) and iterate. Over-instrumentation without clear owners leads to alert fatigue.
- Make SLOs visible and tied to teams: dashboards that show remaining error budget and which releases consumed budget are powerful motivators.
- Automate autoscaling around SLOs where possible: modern research shows that operator-level scaling and stage-aware autoscaling preserve SLOs more efficiently for large models than naive strategies. Use capacity decisions that are SLO-aware rather than purely utilization-based. (arxiv.org)
A short reality check
Generative AI services add subjective, business, and safety dimensions to reliability. You’ll never reduce everything to a single “uptime” number — and you shouldn’t try. The goal of SLIs and SLOs in this space is to translate user experience and risk into measurable, actionable targets so engineering can plan capacity, prioritize fixes, and negotiate realistic SLAs.
Measured the right way, SLIs are your stage monitors; SLOs are the rehearsal schedule; SLAs are the ticket promise to the audience. Keep the music tight, but accept that improvisation is part of the art — if you instrument well and make trade-offs explicit, you’ll make better decisions under pressure.
References
- Service Level Indicator overview. (en.wikipedia.org)
- Guidance on SRE and SLO practices and adoption. (sre.google)
- Google Cloud: AI/ML reliability and how SRE practices apply. (cloud.google.com)
- Research on latency-focused SLOs for generative models (time-to-first-token, time-between-tokens). (arxiv.org)
- SLOs-Serve: multi-stage SLO serving for LLM pipelines. (arxiv.org)