SLIs, SLOs, and SLAs — a practical guide for modern services

Reliability promises live at three levels: SLIs (what you measure), SLOs (what you aim for), and SLAs (what you contract). Getting them right means measuring what users actually experience, setting realistic targets that balance velocity and risk, and translating commitments into clear operational policy. Below I walk through the fundamentals and show concrete examples you can use for web apps and inference services.

Quick definitions (the short, useful versions)

Why these three, and the central role of error budgets

SLOs create an explicit gap between perfect reliability and what you actually need. That gap is the error budget — the allowable amount of “bad behavior” inside your measurement window. Error budgets are a practical tool to balance feature velocity and stability: if you burn budget quickly, you should slow down changes and fix stability issues; if budget is plentiful, you can move faster. Google’s SRE guidance lays out this approach and gives practical examples of how teams use error budgets to make release decisions. (sre.google)

Choose SLIs that map to user experience

Pick SLIs that represent the user’s journey, not only internal signals.

Examples:

Guidelines for picking SLIs:

Example SLOs and how to think about targets

SLOs should be:

Simple example targets commonly seen in practice:

Practical measurement notes

Example Prometheus-style SLI (availability ratio):

# fraction of successful requests over 30 days (pseudo-PromQL)
sum(rate(http_requests_total{job="api",status=~"2.."}[30d])) 
/ 
sum(rate(http_requests_total{job="api"}[30d]))

Special considerations for modern workloads

Put policy around the numbers

An SLO without an error-budget policy is just a target on a dashboard. Formalize:

Google’s error-budget playbook is a useful template for structuring these policies and translating budget burn into concrete operational responses. (sre.google)

Final thought

SLIs, SLOs, and SLAs are a system of measurement, incentives, and commitments. Keep the SLIs user-focused, choose SLO targets jointly with product and SRE, and treat the error budget as the operational lever that balances innovation and reliability. Use canaries and client-side signals for coverage, and for novel workloads (like ML inference) add SLO-aware controls to handle cost and tail latency.