Wrong SLI metrics mean green dashboards while users churn. The discipline of picking signals that move with what users actually feel, and the ones that look reliable but lie.
The most expensive SLI mistake is the one where the dashboard stays green while users churn. Pick the wrong indicator — average latency instead of p99, 5xx-rate instead of error-by-customer — and you end up with a reliability program that's correct about things nobody cares about. After a few iterations across our services, this is the discipline we run for picking SLIs that move when user experience moves.
A Service Level Indicator is a metric you measure that's supposed to correlate with whether your users are having a good time. Picking the right one is the hard part. The SLO (Service Level Objective) is the threshold; the SLI is the measurement. Get the SLI wrong and the SLO is meaningless.
The test for a good SLI: when users report problems, the SLI should be moving in the wrong direction. When users are happy, it should be in target. Sounds tautological; it's not — most production SLIs we've inherited fail this test.
For every customer-facing service, we periodically run this check:
Most services have 1–3 SLIs that survive this audit. Six SLIs per service is a sign of confusion, not rigor.
A few that look reasonable and aren't:
Average latency. Averages hide the slow tail. P50 of 80ms looks great even when p99 is 8 seconds. Users who hit the tail churn. Use p95 or p99.
5xx error rate alone. A service can be 100% 200s while returning wrong data. Pair status-code with content correctness when possible (does the response contain a valid result?).
Uptime from external pinger. A synthetic check that hits /health every minute can return 200 while a particular user's request fails because of a database lock contention specific to their data. Sample the actual user-traffic outcomes, not a synthetic.
CPU/memory utilization. Resource metrics are interesting but they're causes, not user experience. Users don't care if your CPU is at 50% or 90% as long as their requests succeed and are fast.
Aggregate error rate across all endpoints. Hides which endpoint is broken. If /search is at 0.1% errors and /checkout is at 10%, the aggregate looks fine because /search dominates volume. Per-endpoint or per-critical-flow SLIs.
For our customer-facing API:
Request success rate per critical operation. Not all endpoints; the ones users actually depend on. POST /checkout, POST /login, GET /me. Each gets its own SLI. Aggregating these into one number loses signal.
P95 and P99 latency on the same critical operations. Two numbers, not one. P95 catches "this is slow" trends; P99 catches "the tail is exploding" issues that show up before most users notice.
Background job completion rate. For services that do async work after the response. Did the email actually send? Did the webhook fire? A request that returns 200 doesn't mean the downstream work succeeded.
Per-customer error rate for our top-50 customers. The aggregate looks fine even when one large customer is having a 30% error rate; per-customer surfaces it. Catches issues that would otherwise show up only when the account team gets a complaint email.
That's it. Four SLI types per critical-path service. Sometimes a fifth (for services that have a specific user-facing freshness requirement — e.g. "the dashboard is showing data less than 5 minutes old"). Rarely more.
Two ways:
Backtest against past incidents. For each significant incident in the last 6 months, look at the SLI during the incident window. Did it move? How much? If the SLI was steady during a user-reported outage, the SLI is wrong.
Forward correlation against user feedback. Track customer support ticket volume and user-reported issues. Plot them against the SLI. The correlation should be visible. If support tickets spike during periods when the SLI was green, something's missed.
We've reworked SLIs twice based on this exercise. Both times we found we were tracking technical health (server CPU, queue depth) and missing user-facing outcomes (specific feature failures, latency in a specific region).
Once the SLI is right, the target is easier:
Don't pick targets that are higher than your team's actual capacity to maintain. 99.99% sounds great until you're on call at 2am for a fourth time this month. Pick targets you can actually defend.
Not every service has SLIs. Internal admin tools, batch jobs, exploratory data work — they have basic health monitoring but not SLOs. SLIs are operational discipline; over-applying them to low-stakes systems generates noise.
We focus SLI work on:
Everything else gets monitoring without explicit SLOs.
The right way to alert on an SLO violation: not "the SLI crossed the threshold for 30 seconds" but "we're burning the error budget faster than we can sustain over the SLO window."
Two-rate alerting:
This way you get paged on real incidents (fast burn) AND on slow degradations (slow burn) but not on every transient blip.
Worth its own post; see "burn rate alerting" below in the reading list.
SLI design isn't glamorous and it never feels done — you re-tune every time you ship a feature or have an incident. But the cost of getting it wrong is a reliability program that looks healthy while users leave. The audit takes an afternoon per service; doing it once a year is enough to keep the SLIs honest.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
OIDC federation between AWS, GCP, and CI providers let us delete every long-lived cloud credential we had. The setup, the gotchas, and the trust-relationship discipline.
bpftrace one-liners replace strace, perf top, and a half-dozen ad-hoc debugging scripts. The patterns that actually earn their place when you're troubleshooting at 2 AM.
Explore more articles in this category
Production monitoring catches user-facing issues. CI failures stay invisible until someone notices the merge queue is stuck. The metrics and alerts that make pipelines observable.
Static thresholds on error rate produce noisy alerts. Burn-rate alerting flips the question to "are we burning the error budget faster than we can sustain?" — and pages only on real problems.
SBOMs and signed attestations sound like checkboxes until you need to answer "did this artifact come from our pipeline?" The minimum viable supply-chain story we run.
Evergreen posts worth revisiting.