Cause-based alerts page you for things that don't matter and miss things that do. How we rebuilt alerting around SLO burn rates — multi-window, multi-burn-rate — and cut pages while catching more real pain.
Our old alerting was a museum of causes: "CPU > 80%", "disk > 90%", "pod restarted", "queue depth > 1000". Each one fired regularly. Almost none of them correlated with users actually having a bad time. We were paged at 3am for a CPU spike that auto-resolved, and we missed a 40-minute checkout degradation because no single resource threshold tripped. Rebuilding around SLO burn rates fixed both directions of that problem.
A cause-based alert (CPU, memory, restarts) assumes you know in advance which resource problem will hurt users. You don't. High CPU might be fine. Low CPU with a deadlocked thread pool might be an outage. The only thing that reliably means "users are in pain" is measuring the thing users experience: request success rate, latency, freshness.
So the first move was to define SLIs at the user boundary:
Then SLOs: 99.9% availability over 30 days. That 0.1% is the error budget — the amount of failure we're allowed before it's a problem.
A burn rate of 1 means you're spending error budget exactly fast enough to exhaust it at the end of the window. A burn rate of 14.4 means you'll exhaust a 30-day budget in roughly 2 days if it continues. Alerting on burn rate means alerting on how fast users are being hurt — which is exactly what should determine urgency.
burn_rate = (error_rate_observed) / (1 - SLO_target)
# For 99.9% SLO, budget = 0.001
# 5% errors → burn_rate = 0.05 / 0.001 = 50 → page NOW
# 0.2% errors → burn_rate = 0.2 → not urgent, ticket
A single threshold is either too twitchy (pages on a 30-second blip) or too slow (misses a steady slow burn). The Google SRE pattern uses two burn-rate conditions at two windows:
# Fast burn: big problem, page immediately
- alert: ErrorBudgetFastBurn
expr: |
burn_rate_1h > 14.4 and burn_rate_5m > 14.4
for: 2m
severity: page
# Slow burn: chronic bleed, page during business hours
- alert: ErrorBudgetSlowBurn
expr: |
burn_rate_6h > 6 and burn_rate_30m > 6
severity: page
- alert: ErrorBudgetTicket
expr: |
burn_rate_24h > 3 and burn_rate_2h > 3
severity: ticket
The two-window AND is the key trick. The long window (1h, 6h) confirms the problem is sustained — it's not a transient. The short window (5m, 30m) confirms it's still happening right now — so the alert auto-resolves quickly once you fix it, instead of staying red for an hour after recovery. You get both low false-positive rate and fast reset.
The burn-rate/window pairs map to "what fraction of the month's budget would this consume before we notice":
| Severity | Burn rate | Long window | Budget consumed at alert |
|---|---|---|---|
| Page (fast) | 14.4 | 1h | ~2% |
| Page (slow) | 6 | 6h | ~5% |
| Ticket | 3 | 24h | ~10% |
Fast burn pages because a 2% budget hit in an hour means a real outage is underway. The ticket-level slow bleed doesn't justify waking someone — but it does need fixing before it accumulates.
We didn't delete every resource alert. A handful survive as leading indicators at ticket severity, not page: "disk will fill in 4 hours at current rate", "cert expires in 7 days". These predict a future symptom and give lead time to act. The rule: cause-based alerts may inform, but only symptom-based (SLO) alerts may page. Everything that wakes a human must trace back to a user having a bad time.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
The cache-control header most teams under-use. How stale-while-revalidate and stale-if-error turned our CDN from a freshness liability into a latency and resilience win — with the gotchas.
You can't improve retrieval you don't measure. The offline eval harness that lets us change embeddings, chunking, and rerankers with confidence instead of vibes — with the metrics that actually predict production quality.
Explore more articles in this category
Node upgrades, autoscaler scale-downs, and spot reclaims all drain nodes. Without PDBs they can take all your replicas at once. The budgets, probes, and graceful-shutdown handling that keep voluntary disruptions invisible to users.
Most CI caches either miss constantly or restore stale junk. The cache-key discipline, scope boundaries, and measurements that turned our pipeline cache from theatre into real minutes saved.
Default-deny, namespace isolation, egress control — the patterns we use, the gotchas around DNS, and where Cilium changed our calculus.