on
Making SLOs Sing for Generative AI: SLIs (TTFT/TPOT), SLOs, and SLAs Explained
Generative AI services — chatbots, assistants, code generators — changed the choreography of reliability. Instead of a single uptime percentage, these systems have a rhythm: the first token that shows up, the steadiness of token generation, and the way long requests steal the stage. This article explains SLIs, SLOs, and SLAs in plain terms and shows how to pick measurements and targets that actually reflect user experience for LLM-powered endpoints.
Why the classic trio still matters (but looks different)
- SLI (Service Level Indicator): an observable measurement that reflects how a service behaves.
- SLO (Service Level Objective): the target you set on an SLI over a defined window.
- SLA (Service Level Agreement): a legal/business promise — the SLO plus consequences if you miss it.
Those definitions are the SRE baseline: measure, target, and promise — but generative systems need indicators that capture interaction quality, not just “is the server up?” (sre.google)
Two LLM-native SLIs you’ll see everywhere For interactive generative services, two latency-focused SLIs matter more than simple request success or CPU utilization:
- Time To First Token (TTFT): time from request arrival to the first output token. It captures the “first bite” of the response — the moment the user feels the system is alive. (emergentmind.com)
- Time Per Output Token (TPOT) or inter-token latency: the pace at which subsequent tokens stream back. For streaming UX, a steady TPOT keeps dialogue fluid.
Think of a streaming reply like a live music performance: TTFT is the pause before the first note; TPOT is the tempo after the song starts. Both matter — a fast start with a glacial tempo still feels bad, and vice versa.
How to convert those SLIs into SLOs (practical rules)
- Pick the user experience first. Are users doing short chat queries, or long document generations? Interactive chat needs tight TTFT; batch generation tolerates longer starts but needs predictable throughput.
- Use percentiles, not averages. SLOs are usually written as P95 or P99 thresholds (e.g., “P95 TTFT < 300 ms over 30 days”).
- Set the window to capture realistic variability: 7, 30, or 90 days depending on seasonality and business cycles.
- Don’t confuse current performance with a target. A common anti-pattern is “set SLO = what we already achieve.” Targets should be ambitious but achievable and linked to an error budget.
Example SLOs (templates)
- Interactive chat endpoint:
- SLI: TTFT measured per request
- SLO: 95% of requests have TTFT ≤ 250 ms over a 30-day window
- Streaming generation:
- SLI: TPOT measured as median inter-token latency
- SLO: median TPOT ≤ 40 ms, and P95 TPOT ≤ 120 ms over 30 days
A simple SLO record (pseudo-JSON)
{
"service": "chat-endpoint",
"sli": "ttft_ms",
"slo": {
"target": 250,
"percentile": 95,
"window_days": 30
},
"aggregation": "per_request"
}
Why error budgets and multi-SLO thinking matter for LLMs Error budgets — the allowed fraction of time an SLO can be missed — turn reliability into a planning signal. For LLM services you often need multiple SLOs (TTFT, TPOT, availability, correctness signals for retrieval) and to trade between them. Recent research and production systems explicitly design schedulers and serving stacks that are “SLO-aware,” because heterogeneous requests (short interactive vs. long batch) compete for the same GPUs and I/O. That means your autoscaling, batching, and prioritization must respect multiple SLOs simultaneously; otherwise you’ll improve throughput at the cost of TTFT violations for interactive users. (arxiv.org)
Measuring accurately: telemetry that maps to the user
- Measure per-request at the edge (capture arrival time, first token emitted, per-token timestamps). Aggregation after the fact hides spikes.
- Instrument both client-facing latency and internal stage latencies (prefill, decode, network). This helps isolate whether the issue is model compute, I/O, or request queuing.
- Use sampling wisely: capture complete traces for slow or high-cost requests, and sample the rest for trend analysis.
Google Cloud and other observability tooling now include SLO creation and dashboards that make these practices actionable — the tools can show SLO attainment, burn rate, and error-budget alerts, which are essential for operational decisions. (cloud.google.com)
Common pitfalls (and how to think about them)
- Measuring the wrong thing: “Is the server up?” won’t tell you if responses are unusable due to high TTFT or stuttering TPOT.
- One-size-fits-all SLO: a single global SLO for all request types either over-promises for heavy jobs or under-delights interactive users. Segment request classes and set appropriate SLOs.
- Overfitting to percentiles: P99 targets are noble but costly; make sure the business value justifies the cost.
Operational trade-offs and the art of negotiation SLOs create a conversation between product, engineering, and finance. Tight TTFT increases GPU provisioning and clever scheduling complexity; relaxed SLOs save cost but may frustrate users. Use error budgets as a currency: when the budget is healthy, teams can ship features; when it’s depleted, reliability work and mitigations take priority. That simple economy — ship vs. stabilize — is the core advantage of SRE thinking.
Final note: reliability is subjective — measure what users feel For LLM services, traditional “uptime” is necessary but not sufficient. Reliability that users notice is about responsiveness and steady throughput. Define SLIs that reflect the first token and the steady stream, set SLOs that match your user journeys, and make SLAs only when your business is ready to back them with consequences. Recent systems and papers show this isn’t theoretical: LLM-serving research increasingly designs schedulers and autoscalers around TTFT and TPOT SLOs, which proves the approach scales beyond labs to production. (emergentmind.com)
If you remember one thing: design SLOs like a playlist. The opening note (TTFT) sets expectations; the tempo (TPOT) keeps people listening. Tune both, keep an eye on your error budget, and your generative AI service will feel reliably musical.