We were drowning in 200 alerts a week. Most got ignored. After a quarter of triage and rework, we're at about 15 — and on-call actually responds to them.

On this page

Monitoring That Actually Helps On-Call: Alerts, Dashboards, and Runbooks

For about a year, our on-call channel was a graveyard of red squiggles nobody acted on. Roughly 200 alerts a week. Maybe ten of those mapped to anything you'd call an incident. The rest were noise: flapping thresholds, dependency hiccups, alarms that fired during routine deploys, and a couple of legacy probes nobody could remember why we kept.

Last quarter we tore the whole thing down and rebuilt it. We're now at around 15 actionable alerts a week. People answer them.

What "actionable" actually meant #

Before this rework, we wrote our alerts the way most teams do: someone notices a problem, opens a PR adding a Prometheus rule, gets it merged, moves on. The result was a pile of rules with no consistent shape. Some pinged on absolute values, some on rates, some on percentage of a window. Annotations were inconsistent or missing. Some had a runbook link; most didn't.

We landed on three rules for every alert we now keep:

It must reflect something a customer or system would notice. Not "CPU > 80% for 5 minutes" but "p95 latency on the checkout path > 1s for 5 minutes."
There must be at least one human action that fixes it. If the only response is "wait it out," it's a metric, not an alert.
The runbook link in the annotations must point to something written in the last 90 days. If the runbook says to call someone who left the company, the alert is broken.

Roughly 70% of our existing alerts failed one of those three on first review. We deleted them outright instead of trying to fix them.

The cull #

The cull itself took a week. We exported every alert rule, opened a spreadsheet, and went through each one with the on-call team. For each rule we asked: when did this last fire? When it fired, did anyone do anything?

For ~140 rules, the answer to either was "I don't know" or "no." Those went straight to deletion. Another 30 had fired exactly once, six months ago, in a way the codebase had since rendered impossible. Those went too.

What we kept was small. About 40 rules covering five customer-facing endpoints, our queue depth, payment processing, two databases, and the auth service. Each one mapped to a section in our incident response handbook.

Three alert shapes that survived #

Almost everything we kept fits one of three patterns.

Symptom-based on customer impact #

yaml.yaml

- alert: CheckoutLatencyP95High
  expr: |
    histogram_quantile(0.95,
      sum by (le) (rate(http_request_duration_seconds_bucket{
        service="checkout", route="/api/orders"
      }[5m]))
    ) > 1.0
  for: 5m
  labels: { severity: page, team: payments }
  annotations:
    summary: Checkout p95 > 1s for 5m
    runbook: https://wiki.internal/oncall/checkout-latency
    impact: Users abandon orders at p95 > 1s based on funnel data

The impact annotation matters. When someone gets paged at 3am, the first question is "do I care right now?" Spelling out the customer cost in the alert removes the guesswork.

Error budget burn #

For services with SLOs we use multi-window burn-rate alerts. The classic Google SRE workbook approach. We tuned the windows to our incident reality: a 1-hour fast burn paired with a 6-hour slow burn. Anything faster gives too many false pages on benign blips; anything slower lets the budget drain before someone notices.

Saturation #

A handful of alerts on resources that fail in ways customers can feel before latency breaks: queue depth (jobs piling up), connection pool exhaustion (timeouts about to start), disk fill on stateful pods (writes about to fail). These are early warnings. They don't page; they go to a daytime channel where someone picks them up during business hours.

Dashboards that on-call actually opens #

Every alert links to one canonical dashboard. Not three, not "the closest match" — exactly one, for that exact alert. The dashboard has the same name as the alert when it's clear, or it gets a label.

The dashboard layout we use is unimaginative on purpose:

Top row: the metric in the alert, on a 6-hour window with annotations for deploys
Second row: the four most likely upstream/downstream signals (request rate, error rate, dependency p95, and saturation)
Third row: log volume on the affected service for the same window

Anyone glancing at the dashboard during a page should be able to answer "is this getting worse, getting better, or stable, and is anything else weird at the same time" within ~15 seconds. If they can't, the dashboard is wrong.

We delete dashboards that aren't linked from a runbook or alert. Last quarter we deleted 38 dashboards that nobody had viewed in 90 days. Nobody noticed.

Runbooks that don't lie #

Runbooks rot faster than code. We've started treating them like part of the production system: every alert's runbook link is checked monthly by a small script that hits the URL, follows it to the latest version, and flags anything that hasn't been edited in 90+ days.

Each runbook follows a fixed shape:

code

## Symptom
What you'll see when this fires.

## First 5 minutes
- Check X dashboard: <link>
- If [condition], jump to Mitigation A
- If [condition], jump to Mitigation B
- If neither, page the [team] team

## Mitigation A
Specific commands or links. Steps that have been tested.

## Mitigation B
...

## Common causes (last 12 months)
- 2026-01-15: cache eviction config push, fixed by reverting <PR link>
- 2025-11-03: upstream provider outage, fixed by failover to <region>

The "Common causes" section is the most valuable part. New on-call engineers read those entries and learn the system in 30 minutes flat.

What changed for the people on rotation #

Before: average week had 28 alerts, one person was on call, most weren't acted on, two were real incidents.

After: average week has 4 alerts, all are acted on, two are real incidents.

The on-call burden actually went down, even though the same people are responding. Because they trust the alerts now, they don't have to triage in their head before opening the laptop. The alert says what it says; the runbook says what to do.

We also pay attention to who's on call when we deploy risky changes. The release engineer is encouraged to stay nearby for the first hour. We don't auto-roll-back; we have alerts that page the human who owns the change. Most of our recent fast recoveries came from this loop.

What we still don't have right #

We don't have a clean answer for alerting on customer-cohort-specific issues. If 2% of users hit a problem, our overall metrics don't move enough to trip thresholds, but those customers are usually the loudest. We're experimenting with synthetic users from a couple of geographies, but that's a different post.

We also haven't solved deploy-time noise completely. We silence all non-page alerts during a deploy window, which catches most of the false positives, but occasionally a real degradation gets masked. The trade-off is intentional; we'd rather miss a 5-minute window of signal than train the team to ignore alerts.

What I'd do on day one of a new team #

If I joined a new team tomorrow, I'd start by running this query against their alerts: how many fired in the last 90 days, how many resulted in action, how many had no runbook? The answer would tell me whether their monitoring helps or just decorates.

Most monitoring rebuilds aren't a tooling problem. They're a culling problem. Delete more than you think.

Monitoring That Actually Helps On-Call: Alerts, Dashboards, and Runbooks

Monitoring That Actually Helps On-Call: Alerts, Dashboards, and Runbooks

What "actionable" actually meant #

The cull #

Three alert shapes that survived #

Symptom-based on customer impact #

Error budget burn #

Saturation #

Dashboards that on-call actually opens #

Runbooks that don't lie #

What changed for the people on rotation #

What we still don't have right #

What I'd do on day one of a new team #

Stay Updated

Secrets Management in Practice: From .env Files to Vault

Database Migrations Without Downtime: Patterns From Three Real Cutovers

More from Infrastructure

Database Backups — Testing Restores, Not Just Taking Them

Postgres Replication Lag — Monitoring and Failover Practice

Postgres Connection Pooling — PgBouncer in Front of RDS

Database Backups — Testing Restores, Not Just Taking Them

Postgres Replication Lag — Monitoring and Failover Practice

Postgres Connection Pooling — PgBouncer in Front of RDS

Terraform Tutorial — Your First Infrastructure-as-Code Project

Helm Chart Anti-Patterns We've Stopped Using

Karpenter — Node Provisioning Patterns at Scale

About Kiril urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Prompt Engineering Best Practices: Maximizing LLM Performance