Static thresholds on error rate produce noisy alerts. Burn-rate alerting flips the question to "are we burning the error budget faster than we can sustain?" — and pages only on real problems.
The first version of every alerting setup is the same: "page if error rate > 1% for 5 minutes." It works for a week, then the alerts start firing during minor blips that resolve on their own. The team gets paged, looks at it, says "false alarm," ignores it next time. By month three, the actual outage doesn't get a response.
Burn-rate alerting fixes this. The question changes from "is the error rate currently elevated?" to "are we consuming our error budget faster than we can sustain?" The second is a much better signal — it's calibrated to real impact on the SLO over time.
If your SLO is "99.9% of requests succeed over a 30-day window," your error budget is 0.1% of requests over 30 days. Concretely: 1 million requests in 30 days × 0.1% = 1,000 allowed errors over the month.
The error budget is a quota. You can burn it however you want — a sustained 0.1% error rate uses it linearly across the month; a 30-second outage spikes it but might still leave the budget intact. What matters is the rate of consumption.
Burn-rate alerting watches the rate. Specifically: how fast you're consuming your error budget right now, projected against the time you have left in the window.
For an SLO of 99.9% (allowed error rate 0.1%) over 30 days:
burn_rate = current_error_rate / 0.001
A burn rate of 1 means "we're burning the budget at the rate that exactly consumes it over 30 days" — i.e., we're at the SLO limit, sustainably. Burn rate of 10 means we're consuming the month's budget in 3 days; burn rate of 100 means we'd consume the whole month in 7 hours.
The Google SRE workbook's pattern, slightly adapted to our taste:
Both windows must agree. The fast window catches acute incidents (full outage right now); the slow window catches gradual degradation (1% errors all day). Together they cover the spectrum without firing on transient blips.
A 5-minute spike to 5% errors would set the 1-hour window's average to about 0.4% — burn rate of 4. Below the threshold; no alert. If that spike persists for 30 minutes, the 1-hour average hits 2.5%, burn rate of 25 — fires immediately.
For an HTTP service with SLO "99.9% non-5xx over 30 days":
- alert: APIErrorBudgetFastBurn
expr: |
(
sum(rate(http_requests_total{service="api",status=~"5.."}[1h]))
/
sum(rate(http_requests_total{service="api"}[1h]))
) > 0.001 * 14.4
for: 5m
labels:
severity: page
annotations:
summary: "API error budget burning at >14.4× sustained rate (fast)"
runbook: https://runbooks.internal/api-error-budget-burn
The 0.001 * 14.4 is the SLO target (0.1% = 0.001) times the burn-rate threshold (14.4). Simplifies to "alert if error rate > 1.44%."
The for: 5m keeps the rule from firing on instantaneous blips. The 1-hour rate inside is already smoothed.
The slow burn rule has the same shape with [6h] and * 6.
Static thresholds have two failure modes:
Too sensitive. "Page on >0.5% errors." Fires on every minor blip; team gets numb.
Too insensitive. "Page on >5% errors." Misses the slow degradation that consumes the budget over a day — you discover you've blown the SLO retrospectively.
Burn rate alerting is calibrated to the SLO automatically. The math is the same regardless of which service you're alerting on — only the SLI changes. We use the same rule template across every service that has an SLO, with the service name and rate threshold as parameters.
The taxonomy that survived after a year of tuning:
Severity: page (wake someone up). Fast burn rate > 14.4 sustained 5 minutes. These are the alerts that mean "something real is happening, get on it now."
Severity: ticket (work item, daytime response). Slow burn rate > 6 sustained 30 minutes. Indicates degradation that needs investigation but isn't a hair-on-fire incident.
Severity: info (Slack notification). Slow burn rate > 3 sustained 1 hour. A heads-up that something is trending. Often resolves on its own; useful to know.
We tuned these by looking at the last quarter of incidents and asking: "what would have woken us up correctly, and what would have woken us for nothing?" Numbers above are what landed.
The same mechanic works for latency SLOs. If your SLO is "99% of requests served in under 1s":
rate(http_requests_above_1s_total[window]) / rate(http_requests_total[window]).slow_ratio / 0.01.We have these set up for our latency-sensitive services. Same alerting plumbing; just a different SLI.
A few patterns from the first version of our setup that we changed:
Forgetting the for: clause. Without it, instantaneous transients (a 200ms blip in metrics scraping) could fire alerts. The 5-minute for: is essential.
Using a single very-long window. "Alert if burn rate > 5 over 24 hours." Catches the slow burn but is way too slow on acute incidents (a 30-minute outage doesn't move the 24h average enough). Multi-window covers both.
Not pinning the SLO target to each rule. We tried having a global slo_target variable. Different services have different SLO targets; tied them per-rule eventually.
Alerting on cause metrics instead of SLI metrics. Pages firing on CPU utilization, memory pressure, disk space. These can be useful, but they're causes, not SLO violations. SLI-based pages tell you when users are experiencing problems; cause-based pages tell you when something might fail later. Treat them differently.
A few habits that go with the alerting:
Every SLO alert has a runbook. The runbook explains how to investigate, what to roll back, who else to wake. Pages without runbooks are noise.
Quarterly SLO review. Are the targets still appropriate? Are we consistently meeting them with margin (target too loose)? Consistently violating them (target too tight)?
Post-incident: did the alert fire correctly? For every paged incident, ask: was the page useful, or was it after the damage was done? Was it a false alarm? Tune thresholds based on real incident data, not theory.
Don't alert on SLOs you're not committed to. If product won't allow you to delay shipping when the budget is blown, the SLO is decorative. Pick honest targets.
for: clause is mandatory. Otherwise transients drive you crazy.Burn-rate alerting is one of those discipline shifts that takes a week to set up and pays back forever. Once the page-rate aligns with actual problems, on-call quality improves: you respond to the alert because experience teaches that the alert is real. Static-threshold alerting eventually trains the opposite behavior. The math is simple; the operational difference is large.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
cpu.shares vs cpu.cfs_quota_us vs memory.max — the cgroup mechanics behind Kubernetes resource limits, and the surprises that explain the weird symptoms you've seen.
Most LLM eval suites correlate poorly with what real users experience. The eval patterns we run that move with prod metrics — and the ones that lied to us.
Explore more articles in this category
Production monitoring catches user-facing issues. CI failures stay invisible until someone notices the merge queue is stuck. The metrics and alerts that make pipelines observable.
SBOMs and signed attestations sound like checkboxes until you need to answer "did this artifact come from our pipeline?" The minimum viable supply-chain story we run.
Argo CD ships your manifests; Argo Rollouts ships them gradually with automated quality gates. The setup, the analysis templates that earn their place, and what we measure.