Cause-based alerts page you for things that don't matter and miss things that do. How we rebuilt alerting around SLO burn rates — multi-window, multi-burn-rate — and cut pages while catching more real pain.

On this page

Alert on Symptoms, Not Causes — SLO Burn-Rate Alerting in Practice

Our old alerting was a museum of causes: "CPU > 80%", "disk > 90%", "pod restarted", "queue depth > 1000". Each one fired regularly. Almost none of them correlated with users actually having a bad time. We were paged at 3am for a CPU spike that auto-resolved, and we missed a 40-minute checkout degradation because no single resource threshold tripped. Rebuilding around SLO burn rates fixed both directions of that problem.

Causes lie; symptoms tell the truth #

A cause-based alert (CPU, memory, restarts) assumes you know in advance which resource problem will hurt users. You don't. High CPU might be fine. Low CPU with a deadlocked thread pool might be an outage. The only thing that reliably means "users are in pain" is measuring the thing users experience: request success rate, latency, freshness.

So the first move was to define SLIs at the user boundary:

Availability: fraction of requests that return non-5xx
Latency: fraction of requests served under 300ms

Then SLOs: 99.9% availability over 30 days. That 0.1% is the error budget — the amount of failure we're allowed before it's a problem.

Burn rate: how fast you're spending the budget #

A burn rate of 1 means you're spending error budget exactly fast enough to exhaust it at the end of the window. A burn rate of 14.4 means you'll exhaust a 30-day budget in roughly 2 days if it continues. Alerting on burn rate means alerting on how fast users are being hurt — which is exactly what should determine urgency.

code

burn_rate = (error_rate_observed) / (1 - SLO_target)

# For 99.9% SLO, budget = 0.001
# 5% errors → burn_rate = 0.05 / 0.001 = 50  → page NOW
# 0.2% errors → burn_rate = 0.2  → not urgent, ticket

Multi-window, multi-burn-rate: fast and slow #

A single threshold is either too twitchy (pages on a 30-second blip) or too slow (misses a steady slow burn). The Google SRE pattern uses two burn-rate conditions at two windows:

yaml.yaml

# Fast burn: big problem, page immediately
- alert: ErrorBudgetFastBurn
  expr: |
    burn_rate_1h > 14.4 and burn_rate_5m > 14.4
  for: 2m
  severity: page

# Slow burn: chronic bleed, page during business hours
- alert: ErrorBudgetSlowBurn
  expr: |
    burn_rate_6h > 6 and burn_rate_30m > 6
  severity: page

- alert: ErrorBudgetTicket
  expr: |
    burn_rate_24h > 3 and burn_rate_2h > 3
  severity: ticket

The two-window AND is the key trick. The long window (1h, 6h) confirms the problem is sustained — it's not a transient. The short window (5m, 30m) confirms it's still happening right now — so the alert auto-resolves quickly once you fix it, instead of staying red for an hour after recovery. You get both low false-positive rate and fast reset.

Tuning the thresholds to budget consumption #

The burn-rate/window pairs map to "what fraction of the month's budget would this consume before we notice":

Severity	Burn rate	Long window	Budget consumed at alert
Page (fast)	14.4	1h	~2%
Page (slow)	6	6h	~5%
Ticket	3	24h	~10%

Fast burn pages because a 2% budget hit in an hour means a real outage is underway. The ticket-level slow bleed doesn't justify waking someone — but it does need fixing before it accumulates.

What changed on call #

Total pages dropped roughly 70%. The CPU/disk/restart noise was almost all cause-based and almost all not user-impacting.
The pages that did fire correlated with real user pain nearly every time, so people stopped ignoring them.
The checkout-degradation class of incident — no single resource tripped, but success rate quietly dropped — now pages within minutes because the symptom is what's measured.

Keep a few cause-based alerts — as predictors #

We didn't delete every resource alert. A handful survive as leading indicators at ticket severity, not page: "disk will fill in 4 hours at current rate", "cert expires in 7 days". These predict a future symptom and give lead time to act. The rule: cause-based alerts may inform, but only symptom-based (SLO) alerts may page. Everything that wakes a human must trace back to a user having a bad time.

Alert on Symptoms, Not Causes — SLO Burn-Rate Alerting in Practice

Alert on Symptoms, Not Causes — SLO Burn-Rate Alerting in Practice

Causes lie; symptoms tell the truth #

Burn rate: how fast you're spending the budget #

Multi-window, multi-burn-rate: fast and slow #

Tuning the thresholds to budget consumption #

What changed on call #

Keep a few cause-based alerts — as predictors #

Stay Updated

Edge Caching with Stale-While-Revalidate — Fast and Fresh at the CDN

RAG Retrieval Evaluation — Building an Offline Eval Harness Before You Ship

More from DevOps

Kubernetes Pod Disruption Budgets — Surviving Node Drains Without an Outage

CI Pipeline Caching That Actually Pays Off

Kubernetes NetworkPolicies in Practice

Kubernetes Pod Disruption Budgets — Surviving Node Drains Without an Outage

CI Pipeline Caching That Actually Pays Off

Kubernetes NetworkPolicies in Practice

Incident Post-Mortems That Drive Change (Not Theater)

Linux Memory Pressure — Reading PSI Before the OOM Killer Reads You

Observability — Correlating Logs, Metrics, and Traces in Anger

About Kiril Urbonas

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Alert on Symptoms, Not Causes — SLO Burn-Rate Alerting in Practice

Causes lie; symptoms tell the truth#

Burn rate: how fast you're spending the budget#

Multi-window, multi-burn-rate: fast and slow#

Tuning the thresholds to budget consumption#

What changed on call#

Keep a few cause-based alerts — as predictors#

Stay Updated

Edge Caching with Stale-While-Revalidate — Fast and Fresh at the CDN

RAG Retrieval Evaluation — Building an Offline Eval Harness Before You Ship

More from DevOps

Kubernetes Pod Disruption Budgets — Surviving Node Drains Without an Outage

CI Pipeline Caching That Actually Pays Off

Kubernetes NetworkPolicies in Practice

About Kiril Urbonas

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Causes lie; symptoms tell the truth #

Burn rate: how fast you're spending the budget #

Multi-window, multi-burn-rate: fast and slow #

Tuning the thresholds to budget consumption #

What changed on call #

Keep a few cause-based alerts — as predictors #