on
Stop Chasing Nines: Make User Experience Your Reliability Metric
When your favorite song skips, you don’t care that your stereo’s “uptime” is 99.99%. You care that the music doesn’t feel right. Software reliability is exactly the same: what matters is whether people can do the thing they came to do—right now—without errors, confusing states, or strange delays.
In the last few months, we’ve had a string of reminders. Reddit had another U.S. outage on September 26, 2025; one public tracker pegged it at roughly 45 minutes. Earlier this summer, Reddit also saw global incidents in June and July, each spiking to tens of thousands (and in one case over a hundred thousand) user reports before recovery. These are perfect examples of how reliability shows up for real people: as login failures, blank screens, and “try again later” loops—not as decimal places on a status page. (downforeveryoneorjustme.com)
A few weeks before that, a misconfiguration knocked Cloudflare’s 1.1.1.1 public DNS resolver off the internet for 62 minutes. For anyone relying on 1.1.1.1, “the internet” effectively broke. That’s the user experience definition of reliability—can I load anything?—and it doesn’t care whether internal services were “up” in isolation. (blog.cloudflare.com)
And because reliability is a team sport across dependencies, the June 12 incidents showed how a third-party cloud can ripple outward. A Google Cloud outage cascaded into many popular apps going dark; Cloudflare, which uses GCP for some services, saw multiple products degrade that day, even as core CDN and DNS kept flowing. If your app is “up” but your identity provider, router, or DNS isn’t, your users still can’t finish their task. (techcrunch.com)
The point: reliability is a feeling at the edge. So let’s measure it that way.
Why traditional uptime lies to users
“Four nines” availability can hide a thousand papercuts. A 200 response that renders an empty timeline is technically “available,” but experientially broken. Google’s SRE guidance is blunt on this: SLOs should be written as user-centric actions. Instead of measuring only request success rate or median latency, track whether the user’s critical journey—“search, add to cart, pay”—actually completes within a reasonable time and with sensible states. (sre.google)
That framing also acknowledges reality: every extra “nine” costs more and helps less if it doesn’t move the experience. Worse, shared dependencies often fail together, so multiplying “99.9% x 99.9%” doesn’t predict user-perceived availability. Model the journeys, not just the parts. (sre.google)
Measure reliability the way users live it
Here’s a practical set of experience-centric SLIs you can adopt quickly:
- Journey success rate: Percentage of attempts that complete a critical flow (e.g., “open app → view feed” or “create post → see it live”) without retries or error banners.
- Time to task done (TTD): 95th percentile time from first intent to successful completion for that journey.
- Experience availability: Share of users who can complete at least one key action in the last 5 minutes—useful during partial outages.
- Degradation detection: Rate of UI fallbacks (skeletons, offline mode, “try again”) served per user-minute.
- Error state clarity: Share of errors with a human-actionable message and recover path (not just codes).
- Frustration signals: Rage taps/clicks, back-and-forth navigation, or repeat submissions within a short window.
Google’s SRE workbook recommends building SLOs around “critical user journeys,” even when they cross multiple components. That’s your north star. (sre.google)
What to instrument (without boiling the ocean)
You don’t need an observability moonshot to start.
- Real User Monitoring (RUM): Add lightweight beacons to track journey starts, stops, and timing. Capture “experience availability” as “journey completed in last N minutes by user cohort.”
- Synthetic path checks: Headless scripts that run the whole flow every minute from multiple regions: log in, create a post, see it appear. Fail if any step is slow or wrong.
- Dependency surfacing: Tag journey events with the dependencies they touched (DNS, auth, payments). This ties “why” to “what users felt.”
- Cohort slicing: Break down by region, platform, and network—many “global” incidents are regional shells in disguise.
- Error UX telemetry: Log the UI error ID the user saw and whether “retry” fixed it—this is the fastest way to find confusing states.
A minimal starting query for a checkout-like flow might look like:
journey_success_rate =
completed_journeys_28d / started_journeys_28d
p95_time_to_done = percentile(time_to_done_ms, 95, last_28d)
alert if journey_success_rate < 0.985 for 20m
or p95_time_to_done > 8000ms for 20m
Keep it human: if you can’t explain your SLI to a product manager in two minutes, it’s probably not user-centric enough.
Operate with experience-level error budgets
Once you define experience SLIs, manage them with error budgets the same way you do for backend latency. If your “create post” success rate dips below target, slow rollouts and spend engineering cycles on the regression. That conversation lands better with everyone because it’s grounded in what users felt, not in abstract metrics. Google’s Customer Reliability Engineering workshop material leans into exactly this framing to align teams. (sre.google)
Experience budgets also guide incident response. In Cloudflare’s June 12 write-up, they describe “kill switches” to keep people moving during upstream failures—degrading nonessential checks so legit users weren’t completely blocked. That’s experience-preserving resilience in action: favor “some functionality now” over “perfect later.” (blog.cloudflare.com)
Case study mash-up: dependencies and perception
- DNS as the experience choke point: On July 14, Cloudflare’s 1.1.1.1 outage showed how quickly user-perceived reliability can crater even when your app is healthy. If the resolver fails, your experience fails. This argues for journey SLIs that include DNS resolution time/failure rates, and for client-side caching or alternate resolvers during brownouts. (blog.cloudflare.com)
- Cloud outage, local pain: The June 12 Google Cloud incident cascaded to consumer apps. Even if your dashboards were green, your login or media pipeline might depend on a region that wasn’t. Instrument dependencies inside your journey traces; fail open carefully when it’s safe. (techcrunch.com)
- Social platforms and user trust: Reddit’s repeated 2025 incidents (June, July, and late September) illustrate how outages translate into immediate user frustration and, sometimes, market reactions. Tracking only server-side uptime would miss most of that story; journey SLOs make it visible and actionable. (reuters.com)
One more twist: sometimes a massive event is a non-event to users. Last week, Cloudflare said it autonomously mitigated a record 22.2 Tbps DDoS in 40 seconds. If users didn’t feel it, that’s a win—but it only registers as a win if you’re measuring the experience. (techradar.com)
A starter playbook you can ship this quarter
- Pick three user journeys. Example: “open app → see feed,” “search → view item,” “start checkout → pay.”
- Define a simple SLO per journey: success rate over 28 days and p95 time-to-done. Set a target you can actually meet, then tighten later.
- Wire up RUM events and two synthetic checks per region. Keep payloads tiny; protect privacy.
- Create an experience budget dashboard and page on breach. Start with “breach for 20 minutes” to avoid alert fatigue.
- Add two graceful-degradation switches per journey: cached feed, offline queue for posts, delayed verification for low-risk actions.
- Practice blameless reviews around journeys, not services: “Why couldn’t a user post for 12 minutes?” beats “Service X returned 502s.”
The quiet success metric
ThousandEyes tracked 302 global outage events the week of September 15–21 alone. That’s the background noise your users live with now, and it’s not slowing down. If you want reliability that feels invisible, you have to measure the feeling. Your logs can say “200 OK” while the human says “nope.” Optimize for the human. (networkworld.com)
You don’t need an orchestra of dashboards to start—just pick the melody that matters: the moment a person completes the thing they came for. Everything else is accompaniment. When you tune your reliability to that melody, the music stops skipping—and users keep listening.