on
SLIs, SLOs, and SLAs — a practical guide for modern services
Reliability promises live at three levels: SLIs (what you measure), SLOs (what you aim for), and SLAs (what you contract). Getting them right means measuring what users actually experience, setting realistic targets that balance velocity and risk, and translating commitments into clear operational policy. Below I walk through the fundamentals and show concrete examples you can use for web apps and inference services.
Quick definitions (the short, useful versions)
- SLI — Service Level Indicator: a measurable signal of user experience (e.g., request success rate, request latency, or page render time). (sre.google)
- SLO — Service Level Objective: a time-bounded reliability target expressed over an SLI (e.g., “99.9% of GET requests complete in <200 ms over 30 days”). (sre.google)
- SLA — Service Level Agreement: a formal contract describing financial or legal consequences if SLOs aren’t met. SLAs are contracts; SLOs are engineering targets. (sre.google)
Why these three, and the central role of error budgets
SLOs create an explicit gap between perfect reliability and what you actually need. That gap is the error budget — the allowable amount of “bad behavior” inside your measurement window. Error budgets are a practical tool to balance feature velocity and stability: if you burn budget quickly, you should slow down changes and fix stability issues; if budget is plentiful, you can move faster. Google’s SRE guidance lays out this approach and gives practical examples of how teams use error budgets to make release decisions. (sre.google)
Choose SLIs that map to user experience
Pick SLIs that represent the user’s journey, not only internal signals.
Examples:
- Web apps: success rate (HTTP 2xx), frontend latency, and user-centric metrics like Core Web Vitals (LCP, INP, CLS) which capture perceived page quality. Core Web Vitals were explicitly designed to represent user experience and are commonly reused as SLIs. (web.dev)
- APIs: request success rate, p99 latency for critical calls, and semantic correctness checks (sometimes implemented as canaries that validate business logic end-to-end). (docs.aws.amazon.com)
- ML inference: latency at tail quantiles (p95/p99), per-model throughput, and correctness/quality checks (e.g., silent error or drift detection). Recent research and production work emphasizes SLO-aware schedulers and dynamic SLO handling for inference workloads. (arxiv.org)
Guidelines for picking SLIs:
- Start with the user’s “happy path” (the primary flow you want to protect).
- Avoid mixing signals in a single SLI (e.g., don’t conflate client-side timeouts with server processing time).
- Prefer ratios (good requests / total requests) or quantiles over raw counters, because they are easier to reason about for targets.
Example SLOs and how to think about targets
SLOs should be:
- Specific: SLI, threshold, and measurement window (e.g., “99.9% of checkout POSTs succeed within 500 ms, measured over 30 days”).
- Measurable: instrumented and queryable in your observability system.
- Business-informed: chosen by product and SRE together, informed by user impact and cost.
Simple example targets commonly seen in practice:
- Availability: 99.9% (≈43.2 minutes downtime per 30 days) or 99.95% (≈21.6 minutes/month) — each additional “nine” increases cost and complexity. (techbytes.app)
- Latency: 95% under X ms and 99% under Y ms for user-facing calls (combine quantiles with availability where appropriate). (sre.google)
Practical measurement notes
- Windowing: Use rolling windows (e.g., 30 days) for SLO calculations to smooth bursts and reflect recent behavior. (sre.google)
- Counting units: Define “good” and “bad” clearly—are retries counted? Are client-side errors counted? Many teams instrument both server-side success and client-side total experience, and treat them as separate SLIs. (docs.aws.amazon.com)
- Canaries and semantic checks: For feature correctness and subtle failures, synthetic canaries that exercise real business logic are invaluable; they provide a viewpoint complementary to raw telemetry. (docs.aws.amazon.com)
Example Prometheus-style SLI (availability ratio):
# fraction of successful requests over 30 days (pseudo-PromQL)
sum(rate(http_requests_total{job="api",status=~"2.."}[30d]))
/
sum(rate(http_requests_total{job="api"}[30d]))
Special considerations for modern workloads
- Frontend UX: For consumer web apps, replace or augment raw server-side SLIs with client-side experience metrics (Core Web Vitals) to measure perceived performance. This aligns engineering work to what users actually feel. (web.dev)
- ML/Inference: Inference workloads bring variable compute cost and tail-latency issues. Research and production systems increasingly use SLO-aware schedulers or dynamic scaling to meet per-request SLOs while controlling costs. If you serve models, include correctness and tail-latency SLIs and consider per-model SLOs. (arxiv.org)
Put policy around the numbers
An SLO without an error-budget policy is just a target on a dashboard. Formalize:
- Who owns the budget? (team or service)
- What actions occur at thresholds? (alerts, freeze on releases, mitigation sprints)
- How do you communicate budget burn to product and stakeholders?
Google’s error-budget playbook is a useful template for structuring these policies and translating budget burn into concrete operational responses. (sre.google)
Final thought
SLIs, SLOs, and SLAs are a system of measurement, incentives, and commitments. Keep the SLIs user-focused, choose SLO targets jointly with product and SRE, and treat the error budget as the operational lever that balances innovation and reliability. Use canaries and client-side signals for coverage, and for novel workloads (like ML inference) add SLO-aware controls to handle cost and tail latency.