Static thresholds on error rate produce noisy alerts. Burn-rate alerting flips the question to "are we burning the error budget faster than we can sustain?" — and pages only on real problems.

On this page

Burn-Rate Alerting — The SLO Discipline That Prevents Alert Fatigue

The first version of every alerting setup is the same: "page if error rate > 1% for 5 minutes." It works for a week, then the alerts start firing during minor blips that resolve on their own. The team gets paged, looks at it, says "false alarm," ignores it next time. By month three, the actual outage doesn't get a response.

Burn-rate alerting fixes this. The question changes from "is the error rate currently elevated?" to "are we consuming our error budget faster than we can sustain?" The second is a much better signal — it's calibrated to real impact on the SLO over time.

The SLO + error budget framing #

If your SLO is "99.9% of requests succeed over a 30-day window," your error budget is 0.1% of requests over 30 days. Concretely: 1 million requests in 30 days × 0.1% = 1,000 allowed errors over the month.

The error budget is a quota. You can burn it however you want — a sustained 0.1% error rate uses it linearly across the month; a 30-second outage spikes it but might still leave the budget intact. What matters is the rate of consumption.

Burn-rate alerting watches the rate. Specifically: how fast you're consuming your error budget right now, projected against the time you have left in the window.

The burn rate metric #

For an SLO of 99.9% (allowed error rate 0.1%) over 30 days:

code

burn_rate = current_error_rate / 0.001

A burn rate of 1 means "we're burning the budget at the rate that exactly consumes it over 30 days" — i.e., we're at the SLO limit, sustainably. Burn rate of 10 means we're consuming the month's budget in 3 days; burn rate of 100 means we'd consume the whole month in 7 hours.

Multi-window, multi-rate alerts #

The Google SRE workbook's pattern, slightly adapted to our taste:

Fast burn: 1-hour window, alert if burn rate > 14.4. (Consumes 2% of monthly budget in 1 hour.)
Slow burn: 6-hour window, alert if burn rate > 6. (Consumes 5% of monthly budget in 6 hours.)

Both windows must agree. The fast window catches acute incidents (full outage right now); the slow window catches gradual degradation (1% errors all day). Together they cover the spectrum without firing on transient blips.

A 5-minute spike to 5% errors would set the 1-hour window's average to about 0.4% — burn rate of 4. Below the threshold; no alert. If that spike persists for 30 minutes, the 1-hour average hits 2.5%, burn rate of 25 — fires immediately.

A concrete Prometheus rule #

For an HTTP service with SLO "99.9% non-5xx over 30 days":

yaml.yaml

- alert: APIErrorBudgetFastBurn
  expr: |
    (
      sum(rate(http_requests_total{service="api",status=~"5.."}[1h]))
      /
      sum(rate(http_requests_total{service="api"}[1h]))
    ) > 0.001 * 14.4
  for: 5m
  labels:
    severity: page
  annotations:
    summary: "API error budget burning at >14.4× sustained rate (fast)"
    runbook: https://runbooks.internal/api-error-budget-burn

The 0.001 * 14.4 is the SLO target (0.1% = 0.001) times the burn-rate threshold (14.4). Simplifies to "alert if error rate > 1.44%."

The for: 5m keeps the rule from firing on instantaneous blips. The 1-hour rate inside is already smoothed.

The slow burn rule has the same shape with [6h] and * 6.

Why this beats static thresholds #

Static thresholds have two failure modes:

Too sensitive. "Page on >0.5% errors." Fires on every minor blip; team gets numb.

Too insensitive. "Page on >5% errors." Misses the slow degradation that consumes the budget over a day — you discover you've blown the SLO retrospectively.

Burn rate alerting is calibrated to the SLO automatically. The math is the same regardless of which service you're alerting on — only the SLI changes. We use the same rule template across every service that has an SLO, with the service name and rate threshold as parameters.

What we actually alert on #

The taxonomy that survived after a year of tuning:

Severity: page (wake someone up). Fast burn rate > 14.4 sustained 5 minutes. These are the alerts that mean "something real is happening, get on it now."

Severity: ticket (work item, daytime response). Slow burn rate > 6 sustained 30 minutes. Indicates degradation that needs investigation but isn't a hair-on-fire incident.

Severity: info (Slack notification). Slow burn rate > 3 sustained 1 hour. A heads-up that something is trending. Often resolves on its own; useful to know.

We tuned these by looking at the last quarter of incidents and asking: "what would have woken us up correctly, and what would have woken us for nothing?" Numbers above are what landed.

Beyond error rate: latency burn-rate #

The same mechanic works for latency SLOs. If your SLO is "99% of requests served in under 1s":

Compute the proportion of slow requests: rate(http_requests_above_1s_total[window]) / rate(http_requests_total[window]).
Compute burn rate against the 1% slow-request budget: slow_ratio / 0.01.
Apply the same multi-window thresholds.

We have these set up for our latency-sensitive services. Same alerting plumbing; just a different SLI.

Mistakes we made #

A few patterns from the first version of our setup that we changed:

Forgetting the for: clause. Without it, instantaneous transients (a 200ms blip in metrics scraping) could fire alerts. The 5-minute for: is essential.

Using a single very-long window. "Alert if burn rate > 5 over 24 hours." Catches the slow burn but is way too slow on acute incidents (a 30-minute outage doesn't move the 24h average enough). Multi-window covers both.

Not pinning the SLO target to each rule. We tried having a global slo_target variable. Different services have different SLO targets; tied them per-rule eventually.

Alerting on cause metrics instead of SLI metrics. Pages firing on CPU utilization, memory pressure, disk space. These can be useful, but they're causes, not SLO violations. SLI-based pages tell you when users are experiencing problems; cause-based pages tell you when something might fail later. Treat them differently.

Operational discipline #

A few habits that go with the alerting:

Every SLO alert has a runbook. The runbook explains how to investigate, what to roll back, who else to wake. Pages without runbooks are noise.

Quarterly SLO review. Are the targets still appropriate? Are we consistently meeting them with margin (target too loose)? Consistently violating them (target too tight)?

Post-incident: did the alert fire correctly? For every paged incident, ask: was the page useful, or was it after the damage was done? Was it a false alarm? Tune thresholds based on real incident data, not theory.

Don't alert on SLOs you're not committed to. If product won't allow you to delay shipping when the budget is blown, the SLO is decorative. Pick honest targets.

What I'd tell a team starting #

Multi-window, multi-rate. Don't try to make a single window work.
The for: clause is mandatory. Otherwise transients drive you crazy.
One SLO per critical service, not five. SLO sprawl makes the discipline incoherent.
Connect the burn-rate page to a real on-call runbook. Otherwise the page is theater.

What to read next #

SLI design — picking metrics that actually correlate with user experience — the SLI side that this alerting hangs off
Deep dive: SLO-based monitoring for APIs — the broader SLO pattern
Monitoring that actually helps on-call — alerts + dashboards together
Incident postmortems that actually prevent repeat failures — what to do after the page fires

Burn-rate alerting is one of those discipline shifts that takes a week to set up and pays back forever. Once the page-rate aligns with actual problems, on-call quality improves: you respond to the alert because experience teaches that the alert is real. Static-threshold alerting eventually trains the opposite behavior. The math is simple; the operational difference is large.

Burn-Rate Alerting — The SLO Discipline That Prevents Alert Fatigue

Burn-Rate Alerting — The SLO Discipline That Prevents Alert Fatigue

The SLO + error budget framing #

The burn rate metric #

Multi-window, multi-rate alerts #

A concrete Prometheus rule #

Why this beats static thresholds #

What we actually alert on #

Beyond error rate: latency burn-rate #

Mistakes we made #

Operational discipline #

What I'd tell a team starting #

What to read next #

Stay Updated

Container Resource Limits — What They Actually Do at the Kernel Level

LLM Evals That Actually Predict Production Quality

More from DevOps

OIDC Federation Beyond GitHub — GitLab, Buildkite, and Generic Providers

Kubernetes Workload Identity — Projected Tokens and OIDC to Cloud IAM

On-Call Without Burnout: Rotations, Runbooks, and Escalation

OIDC Federation Beyond GitHub — GitLab, Buildkite, and Generic Providers

Kubernetes Workload Identity — Projected Tokens and OIDC to Cloud IAM

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Feature Flags for Safe Deploys: Decoupling Release From Deploy

Observability for Edge Functions — Logs, Traces, and Metrics

Blameless Postmortems: The Template and Facilitation That Works

About Kiril Urbonas

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Embedding Models Comparison: Choosing the Right Model for Your Use Case