We replaced 47 percentile threshold alerts with 3 SLO burn-rate alerts. The on-call rotation gets paged less and catches more.
For a long time we monitored our APIs with the obvious thing: per-percentile threshold alerts. "If p95 > 1s for 5 minutes, page." It works. Sort of. After a year of running it, we'd accumulated 47 separate latency-related alerts across our service catalog and they were paging us at all the wrong times.
Then we read the Google SRE workbook chapter on alerting on SLOs (the Betsy Beyer one) and rebuilt the whole thing. We have 3 latency-related alerts now and they fire about a third as often, but each one is a real signal. This post is the actual implementation, what worked, and where we still struggle.
Two failure modes that this kind of alert hits constantly:
The blip. p95 spikes for 4 minutes during a brief upstream issue. The threshold is "5 minutes." The alert doesn't fire. But the customer experience genuinely was bad for those 4 minutes. We just chose not to page on it.
The slow burn. p95 sits at 980ms (just under the 1s threshold) for 6 hours. Customer experience is steadily degrading. The threshold is never crossed; nothing ever pages. By the time someone notices, half the day's error budget is gone.
Both are misses. The threshold alert is essentially a step function applied to a noisy time series — sensitive to where the step is, blind to magnitude and duration of the breach.
An SLO-based alert measures how fast you're burning your error budget. The error budget is "the percentage of the SLO you're allowed to fail per period."
Our checkout API has an SLO: 99.5% of requests should complete in <500ms over a 30-day window. That's a budget of 0.5% slow requests per 30 days.
If you spend that budget evenly, you can afford a 0.5% / 30 days = 0.000694% slow rate per minute. If you're burning faster than that, you're going to run out of budget before the window resets.
The alert fires when burn-rate is high enough that you'll exhaust your budget faster than the window allows.
We use Prometheus + Alertmanager. The two foundational queries:
# Slow request rate over 5m (numerator: slow requests, denominator: total)
sum(rate(http_request_duration_seconds_count{
service="checkout", code!~"5..", le="0.5"
}[5m]))
/
sum(rate(http_request_duration_seconds_count{
service="checkout", code!~"5.."
}[5m]))
Wait — that's not quite right. We actually want the slow rate, which is 1 - fast_rate. Let me write it the way we actually have it:
1 - (
sum(rate(http_request_duration_seconds_bucket{
service="checkout", code!~"5..", le="0.5"
}[5m]))
/
sum(rate(http_request_duration_seconds_count{
service="checkout", code!~"5.."
}[5m]))
)
That's the fraction of requests that are slow over the last 5 minutes. If that number is, say, 0.05, then 5% of recent requests violated the SLO. Compared to our 0.5% budget, we're burning 10× faster than allowed.
The factor actual_slow_rate / SLO_slow_rate_target is the burn rate.
A single burn-rate alert is too noisy on short windows and too slow on long windows. The Google approach pairs two windows:
- alert: CheckoutBurnRateFast
expr: |
(
1 - (
sum(rate(http_request_duration_seconds_bucket{service="checkout", le="0.5"}[1h]))
/ sum(rate(http_request_duration_seconds_count{service="checkout"}[1h]))
)
) / 0.005 > 14.4
and
(
1 - (
sum(rate(http_request_duration_seconds_bucket{service="checkout", le="0.5"}[5m]))
/ sum(rate(http_request_duration_seconds_count{service="checkout"}[5m]))
)
) / 0.005 > 14.4
for: 2m
labels: { severity: page }
annotations:
summary: "Checkout API burning SLO budget at >14.4× rate (1h sustained, 5m current)"
runbook: https://wiki.internal/oncall/checkout-slo-burn
Two clauses joined by and:
The 14.4× number isn't magic. Burning at 14.4× for one hour exhausts 2% of a 30-day budget. We page if any single hour would consume 2% of the budget.
Pairing the windows means: a 30-second blip can't trigger a page (the 1-hour window won't have shifted enough). And an issue that started 8 hours ago and has since recovered won't trigger either (the 5-minute window will be back to normal).
The fast burn catches "something is on fire." The slow burn catches "something has been quietly degraded for hours."
- alert: CheckoutBurnRateSlow
expr: |
(... same query, [6h] window) / 0.005 > 6
and
(... same query, [30m] window) / 0.005 > 6
for: 15m
labels: { severity: page }
annotations:
summary: "Checkout API has burned SLO budget at >6× rate for 6h"
Burning at 6× for 6 hours is 5% of the 30-day budget — meaningful, even if no single moment is dramatic.
Three latency-related pages now exist for any given API:
We deleted 44 alerts. Most were per-endpoint p95 thresholds that were essentially noise. A few were error-rate thresholds that the availability burn-rate alert subsumes more cleanly.
The slow-burn alert catches things humans don't. Twice in the first month, our slow-burn alert fired for a degradation that nobody on the team had noticed. Both were real customer issues that had been slowly worsening for hours. Threshold alerts wouldn't have fired because no single moment crossed a hard line.
The fast-burn alert is quieter than threshold alerts. We expected more noise; we got less. A blip that was 100% slow for 30 seconds doesn't trigger because the 1-hour window doesn't move enough. The same blip with a threshold alert would have woken us up.
Tuning the burn rate per service is necessary. 14.4× / 6× are reasonable defaults, but a service with low traffic has more variance and needs a higher threshold to avoid false positives. We tuned per-service after the first month of data; 4 of our 14 services now use slightly different multipliers.
Defining the SLO target in the first place. "99.5% within 500ms" is a number we picked from product input plus historical baseline. Six months in we've revisited it twice. The number isn't sacred — it's a forcing function for the conversation about what's acceptable.
Communicating burn rate to non-engineers. "We're at 14.4×" doesn't mean anything to product or business stakeholders. We translate it: "We'll consume 2% of our monthly budget if this keeps up for an hour." That phrasing landed.
Getting metrics right at low volume. Some of our internal APIs see 50 requests/hour. Burn-rate math becomes statistically noisy. For those we kept threshold alerts; SLO-based monitoring is for traffic above ~100 req/min minimum.
You don't put SLO burn-rate alerts on a service overnight. The ramp:
Total ~5 weeks per service. We did the first one carefully, then templated the rest; subsequent services took about 3 days each.
Start with one service. Specifically: your most-traffic-receiving service that has a clear customer-impact metric. Get burn-rate alerting working there. Live with it for a month. Then template.
The trap is to try to convert all services at once. The SLO-burn approach has a learning curve — for the engineers writing the queries, for the on-call team responding to a different alert shape, for the people reviewing post-incident. One service at a time gives the team time to adjust.
The reward, if you stick with it: fewer pages, more meaningful pages, and a number — burn rate — that the team can talk about precisely instead of arguing about thresholds.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We mapped every byte that ends up in our production containers. The map showed three places trust was implicit. Each became a control.
A DR runbook nobody reads is worse than no runbook. The shape that finally got ours executed correctly under pressure.
Explore more articles in this category
You always have known vulnerabilities. The question is how you triage, patch, and respond. The discipline we run after a few real incidents and a lot of routine work.
Three terms that get mixed up constantly. The actual differences, where each one sits in the request path, when you reach for which, and where the same tool plays all three roles.
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.