Wrong SLI metrics mean green dashboards while users churn. The discipline of picking signals that move with what users actually feel, and the ones that look reliable but lie.

On this page

SLI Design: Picking Metrics That Actually Correlate With User Experience

The most expensive SLI mistake is the one where the dashboard stays green while users churn. Pick the wrong indicator — average latency instead of p99, 5xx-rate instead of error-by-customer — and you end up with a reliability program that's correct about things nobody cares about. After a few iterations across our services, this is the discipline we run for picking SLIs that move when user experience moves.

The point of an SLI, restated #

A Service Level Indicator is a metric you measure that's supposed to correlate with whether your users are having a good time. Picking the right one is the hard part. The SLO (Service Level Objective) is the threshold; the SLI is the measurement. Get the SLI wrong and the SLO is meaningless.

The test for a good SLI: when users report problems, the SLI should be moving in the wrong direction. When users are happy, it should be in target. Sounds tautological; it's not — most production SLIs we've inherited fail this test.

The five-metric audit #

For every customer-facing service, we periodically run this check:

List what users actually care about for that service. For a checkout API: did my payment succeed? Was it fast?
Map each "what users care about" to a measurable metric. Payment success → success rate of POST /checkout. Speed → response time of POST /checkout.
Sanity-check the metric by looking at past incidents. During the last bad day, did this metric move?
Tune the aggregation — averages hide tail latency; medians hide the 5% bad day. Use p95 or p99 for latency; rate-of-bad-events for errors.
Sanity-check the inverse — when the metric was bad, were users actually unhappy? Or did we just generate noise?

Most services have 1–3 SLIs that survive this audit. Six SLIs per service is a sign of confusion, not rigor.

Common bad SLIs #

A few that look reasonable and aren't:

Average latency. Averages hide the slow tail. P50 of 80ms looks great even when p99 is 8 seconds. Users who hit the tail churn. Use p95 or p99.

5xx error rate alone. A service can be 100% 200s while returning wrong data. Pair status-code with content correctness when possible (does the response contain a valid result?).

Uptime from external pinger. A synthetic check that hits /health every minute can return 200 while a particular user's request fails because of a database lock contention specific to their data. Sample the actual user-traffic outcomes, not a synthetic.

CPU/memory utilization. Resource metrics are interesting but they're causes, not user experience. Users don't care if your CPU is at 50% or 90% as long as their requests succeed and are fast.

Aggregate error rate across all endpoints. Hides which endpoint is broken. If /search is at 0.1% errors and /checkout is at 10%, the aggregate looks fine because /search dominates volume. Per-endpoint or per-critical-flow SLIs.

The SLIs that survived in our system #

For our customer-facing API:

Request success rate per critical operation. Not all endpoints; the ones users actually depend on. POST /checkout, POST /login, GET /me. Each gets its own SLI. Aggregating these into one number loses signal.

P95 and P99 latency on the same critical operations. Two numbers, not one. P95 catches "this is slow" trends; P99 catches "the tail is exploding" issues that show up before most users notice.

Background job completion rate. For services that do async work after the response. Did the email actually send? Did the webhook fire? A request that returns 200 doesn't mean the downstream work succeeded.

Per-customer error rate for our top-50 customers. The aggregate looks fine even when one large customer is having a 30% error rate; per-customer surfaces it. Catches issues that would otherwise show up only when the account team gets a complaint email.

That's it. Four SLI types per critical-path service. Sometimes a fifth (for services that have a specific user-facing freshness requirement — e.g. "the dashboard is showing data less than 5 minutes old"). Rarely more.

How we test correlation #

Two ways:

Backtest against past incidents. For each significant incident in the last 6 months, look at the SLI during the incident window. Did it move? How much? If the SLI was steady during a user-reported outage, the SLI is wrong.

Forward correlation against user feedback. Track customer support ticket volume and user-reported issues. Plot them against the SLI. The correlation should be visible. If support tickets spike during periods when the SLI was green, something's missed.

We've reworked SLIs twice based on this exercise. Both times we found we were tracking technical health (server CPU, queue depth) and missing user-facing outcomes (specific feature failures, latency in a specific region).

SLO target setting #

Once the SLI is right, the target is easier:

Critical user-facing operations: 99.9% success, p99 < 1s (typical for our shape).
Background / async: 99.5% completion within 5 minutes.
Internal tools: 99% — internal users are more tolerant.

Don't pick targets that are higher than your team's actual capacity to maintain. 99.99% sounds great until you're on call at 2am for a fourth time this month. Pick targets you can actually defend.

What we don't SLI #

Not every service has SLIs. Internal admin tools, batch jobs, exploratory data work — they have basic health monitoring but not SLOs. SLIs are operational discipline; over-applying them to low-stakes systems generates noise.

We focus SLI work on:

The handful of critical customer-facing services
Internal services whose failure would impact a customer-facing service
A few high-volume background processors

Everything else gets monitoring without explicit SLOs.

Burn rate alerting #

The right way to alert on an SLO violation: not "the SLI crossed the threshold for 30 seconds" but "we're burning the error budget faster than we can sustain over the SLO window."

Two-rate alerting:

Fast burn: 5-minute window, alert if we'd consume X% of budget at this rate
Slow burn: 1-hour window, lower threshold

This way you get paged on real incidents (fast burn) AND on slow degradations (slow burn) but not on every transient blip.

Worth its own post; see "burn rate alerting" below in the reading list.

What I'd tell a team starting #

Pick fewer SLIs, more carefully. 2–4 per service is enough; 8 is mostly noise.
User-facing outcomes, not resource utilization. CPU isn't an SLI.
Backtest against incidents. If the SLI didn't move during the last bad day, it's the wrong SLI.
Per-critical-flow, not aggregate. Hide bad behavior at your own risk.
Targets you can actually maintain. Aspirational 99.99% will burn the team out.

What to read next #

Burn-rate alerting — the SLO discipline that prevents alert fatigue — what to do once you've picked the SLI
Deep dive: SLO-based monitoring for APIs — the larger SLI/SLO pattern this post slots into
Monitoring that actually helps on-call — alerting + dashboards on top of SLIs

SLI design isn't glamorous and it never feels done — you re-tune every time you ship a feature or have an incident. But the cost of getting it wrong is a reliability program that looks healthy while users leave. The audit takes an afternoon per service; doing it once a year is enough to keep the SLIs honest.

SLI Design — Picking Metrics That Actually Correlate With User Experience

SLI Design: Picking Metrics That Actually Correlate With User Experience

The point of an SLI, restated #

The five-metric audit #

Common bad SLIs #

The SLIs that survived in our system #

How we test correlation #

SLO target setting #

What we don't SLI #

Burn rate alerting #

What I'd tell a team starting #

What to read next #

Stay Updated

Cross-Cloud Identity Federation — Patterns That Replaced Our Long-Lived Keys

eBPF Tools for Everyday Ops — bpftrace Patterns We Use

More from DevOps

OIDC Federation Beyond GitHub — GitLab, Buildkite, and Generic Providers

Kubernetes Workload Identity — Projected Tokens and OIDC to Cloud IAM

On-Call Without Burnout: Rotations, Runbooks, and Escalation

OIDC Federation Beyond GitHub — GitLab, Buildkite, and Generic Providers

Kubernetes Workload Identity — Projected Tokens and OIDC to Cloud IAM

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Feature Flags for Safe Deploys: Decoupling Release From Deploy

Observability for Edge Functions — Logs, Traces, and Metrics

Blameless Postmortems: The Template and Facilitation That Works

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Embedding Models Comparison: Choosing the Right Model for Your Use Case

AWS Graviton Migration: What Broke and What We Saved

About Kiril Urbonas