Stop Chasing Nines: Make User Experience Your Reliability Metric

When your favorite song skips, you don’t care that your stereo’s “uptime” is 99.99%. You care that the music doesn’t feel right. Software reliability is exactly the same: what matters is whether people can do the thing they came to do—right now—without errors, confusing states, or strange delays.

In the last few months, we’ve had a string of reminders. Reddit had another U.S. outage on September 26, 2025; one public tracker pegged it at roughly 45 minutes. Earlier this summer, Reddit also saw global incidents in June and July, each spiking to tens of thousands (and in one case over a hundred thousand) user reports before recovery. These are perfect examples of how reliability shows up for real people: as login failures, blank screens, and “try again later” loops—not as decimal places on a status page. (downforeveryoneorjustme.com)

A few weeks before that, a misconfiguration knocked Cloudflare’s 1.1.1.1 public DNS resolver off the internet for 62 minutes. For anyone relying on 1.1.1.1, “the internet” effectively broke. That’s the user experience definition of reliability—can I load anything?—and it doesn’t care whether internal services were “up” in isolation. (blog.cloudflare.com)

And because reliability is a team sport across dependencies, the June 12 incidents showed how a third-party cloud can ripple outward. A Google Cloud outage cascaded into many popular apps going dark; Cloudflare, which uses GCP for some services, saw multiple products degrade that day, even as core CDN and DNS kept flowing. If your app is “up” but your identity provider, router, or DNS isn’t, your users still can’t finish their task. (techcrunch.com)

The point: reliability is a feeling at the edge. So let’s measure it that way.

Why traditional uptime lies to users

“Four nines” availability can hide a thousand papercuts. A 200 response that renders an empty timeline is technically “available,” but experientially broken. Google’s SRE guidance is blunt on this: SLOs should be written as user-centric actions. Instead of measuring only request success rate or median latency, track whether the user’s critical journey—“search, add to cart, pay”—actually completes within a reasonable time and with sensible states. (sre.google)

That framing also acknowledges reality: every extra “nine” costs more and helps less if it doesn’t move the experience. Worse, shared dependencies often fail together, so multiplying “99.9% x 99.9%” doesn’t predict user-perceived availability. Model the journeys, not just the parts. (sre.google)

Measure reliability the way users live it

Here’s a practical set of experience-centric SLIs you can adopt quickly:

Google’s SRE workbook recommends building SLOs around “critical user journeys,” even when they cross multiple components. That’s your north star. (sre.google)

What to instrument (without boiling the ocean)

You don’t need an observability moonshot to start.

A minimal starting query for a checkout-like flow might look like:

journey_success_rate = 
  completed_journeys_28d / started_journeys_28d

p95_time_to_done = percentile(time_to_done_ms, 95, last_28d)

alert if journey_success_rate < 0.985 for 20m
or p95_time_to_done > 8000ms for 20m

Keep it human: if you can’t explain your SLI to a product manager in two minutes, it’s probably not user-centric enough.

Operate with experience-level error budgets

Once you define experience SLIs, manage them with error budgets the same way you do for backend latency. If your “create post” success rate dips below target, slow rollouts and spend engineering cycles on the regression. That conversation lands better with everyone because it’s grounded in what users felt, not in abstract metrics. Google’s Customer Reliability Engineering workshop material leans into exactly this framing to align teams. (sre.google)

Experience budgets also guide incident response. In Cloudflare’s June 12 write-up, they describe “kill switches” to keep people moving during upstream failures—degrading nonessential checks so legit users weren’t completely blocked. That’s experience-preserving resilience in action: favor “some functionality now” over “perfect later.” (blog.cloudflare.com)

Case study mash-up: dependencies and perception

One more twist: sometimes a massive event is a non-event to users. Last week, Cloudflare said it autonomously mitigated a record 22.2 Tbps DDoS in 40 seconds. If users didn’t feel it, that’s a win—but it only registers as a win if you’re measuring the experience. (techradar.com)

A starter playbook you can ship this quarter

The quiet success metric

ThousandEyes tracked 302 global outage events the week of September 15–21 alone. That’s the background noise your users live with now, and it’s not slowing down. If you want reliability that feels invisible, you have to measure the feeling. Your logs can say “200 OK” while the human says “nope.” Optimize for the human. (networkworld.com)

You don’t need an orchestra of dashboards to start—just pick the melody that matters: the moment a person completes the thing they came for. Everything else is accompaniment. When you tune your reliability to that melody, the music stops skipping—and users keep listening.