We were drowning in 200 alerts a week. Most got ignored. After a quarter of triage and rework, we're at about 15 — and on-call actually responds to them.
For about a year, our on-call channel was a graveyard of red squiggles nobody acted on. Roughly 200 alerts a week. Maybe ten of those mapped to anything you'd call an incident. The rest were noise: flapping thresholds, dependency hiccups, alarms that fired during routine deploys, and a couple of legacy probes nobody could remember why we kept.
Last quarter we tore the whole thing down and rebuilt it. We're now at around 15 actionable alerts a week. People answer them.
Before this rework, we wrote our alerts the way most teams do: someone notices a problem, opens a PR adding a Prometheus rule, gets it merged, moves on. The result was a pile of rules with no consistent shape. Some pinged on absolute values, some on rates, some on percentage of a window. Annotations were inconsistent or missing. Some had a runbook link; most didn't.
We landed on three rules for every alert we now keep:
Roughly 70% of our existing alerts failed one of those three on first review. We deleted them outright instead of trying to fix them.
The cull itself took a week. We exported every alert rule, opened a spreadsheet, and went through each one with the on-call team. For each rule we asked: when did this last fire? When it fired, did anyone do anything?
For ~140 rules, the answer to either was "I don't know" or "no." Those went straight to deletion. Another 30 had fired exactly once, six months ago, in a way the codebase had since rendered impossible. Those went too.
What we kept was small. About 40 rules covering five customer-facing endpoints, our queue depth, payment processing, two databases, and the auth service. Each one mapped to a section in our incident response handbook.
Almost everything we kept fits one of three patterns.
- alert: CheckoutLatencyP95High
expr: |
histogram_quantile(0.95,
sum by (le) (rate(http_request_duration_seconds_bucket{
service="checkout", route="/api/orders"
}[5m]))
) > 1.0
for: 5m
labels: { severity: page, team: payments }
annotations:
summary: Checkout p95 > 1s for 5m
runbook: https://wiki.internal/oncall/checkout-latency
impact: Users abandon orders at p95 > 1s based on funnel data
The impact annotation matters. When someone gets paged at 3am, the first question is "do I care right now?" Spelling out the customer cost in the alert removes the guesswork.
For services with SLOs we use multi-window burn-rate alerts. The classic Google SRE workbook approach. We tuned the windows to our incident reality: a 1-hour fast burn paired with a 6-hour slow burn. Anything faster gives too many false pages on benign blips; anything slower lets the budget drain before someone notices.
A handful of alerts on resources that fail in ways customers can feel before latency breaks: queue depth (jobs piling up), connection pool exhaustion (timeouts about to start), disk fill on stateful pods (writes about to fail). These are early warnings. They don't page; they go to a daytime channel where someone picks them up during business hours.
Every alert links to one canonical dashboard. Not three, not "the closest match" — exactly one, for that exact alert. The dashboard has the same name as the alert when it's clear, or it gets a label.
The dashboard layout we use is unimaginative on purpose:
Anyone glancing at the dashboard during a page should be able to answer "is this getting worse, getting better, or stable, and is anything else weird at the same time" within ~15 seconds. If they can't, the dashboard is wrong.
We delete dashboards that aren't linked from a runbook or alert. Last quarter we deleted 38 dashboards that nobody had viewed in 90 days. Nobody noticed.
Runbooks rot faster than code. We've started treating them like part of the production system: every alert's runbook link is checked monthly by a small script that hits the URL, follows it to the latest version, and flags anything that hasn't been edited in 90+ days.
Each runbook follows a fixed shape:
## Symptom
What you'll see when this fires.
## First 5 minutes
- Check X dashboard: <link>
- If [condition], jump to Mitigation A
- If [condition], jump to Mitigation B
- If neither, page the [team] team
## Mitigation A
Specific commands or links. Steps that have been tested.
## Mitigation B
...
## Common causes (last 12 months)
- 2026-01-15: cache eviction config push, fixed by reverting <PR link>
- 2025-11-03: upstream provider outage, fixed by failover to <region>
The "Common causes" section is the most valuable part. New on-call engineers read those entries and learn the system in 30 minutes flat.
Before: average week had 28 alerts, one person was on call, most weren't acted on, two were real incidents.
After: average week has 4 alerts, all are acted on, two are real incidents.
The on-call burden actually went down, even though the same people are responding. Because they trust the alerts now, they don't have to triage in their head before opening the laptop. The alert says what it says; the runbook says what to do.
We also pay attention to who's on call when we deploy risky changes. The release engineer is encouraged to stay nearby for the first hour. We don't auto-roll-back; we have alerts that page the human who owns the change. Most of our recent fast recoveries came from this loop.
We don't have a clean answer for alerting on customer-cohort-specific issues. If 2% of users hit a problem, our overall metrics don't move enough to trip thresholds, but those customers are usually the loudest. We're experimenting with synthetic users from a couple of geographies, but that's a different post.
We also haven't solved deploy-time noise completely. We silence all non-page alerts during a deploy window, which catches most of the false positives, but occasionally a real degradation gets masked. The trade-off is intentional; we'd rather miss a 5-minute window of signal than train the team to ignore alerts.
If I joined a new team tomorrow, I'd start by running this query against their alerts: how many fired in the last 90 days, how many resulted in action, how many had no runbook? The answer would tell me whether their monitoring helps or just decorates.
Most monitoring rebuilds aren't a tooling problem. They're a culling problem. Delete more than you think.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We had .env files in three repos, AWS keys in Slack DMs, and a postgres password etched into a Confluence page. Cleaning it up took a sprint and changed how we think about secrets.
How we shipped three schema migrations with zero customer impact. Expand-then-contract, dual-writes, and the rollback plan we never had to use — but tested anyway.
Explore more articles in this category
Backups are easy. Restores are hard. The quarterly drill we run, what's failed during it, and the discipline that makes "we have backups" actually mean something.
Replication is the foundation of database HA. What we monitor, how we practice failover, and the gotchas that show up only when you actually fail over.
Why Postgres connection limits bite at unexpected times, the pooling layer we put in front, and the pool-mode tradeoffs we learned the hard way.