How we went from 200 alerts per week (most ignored) to 15 actionable alerts with clear runbooks and useful dashboards.
Our on-call engineers were getting 200 alerts per week. They ignored most of them. Then a real outage went unnoticed for 47 minutes because the alert was buried in noise. We rebuilt our monitoring from scratch.
We audited one week of alerts:
On-call engineers had trained themselves to ignore Slack notifications.
We categorized every alert into three tiers:
| Tier | Criteria | Notification | Example |
|---|---|---|---|
| P1 - Page | Customer-facing impact now | PagerDuty + Phone | API error rate > 5% for 5min |
| P2 - Notify | Likely to become P1 soon | Slack channel | Disk usage > 85% |
| P3 - Log | For investigation later | Dashboard only | Pod restart count > 3/hour |
Rule: If an alert doesn't have a clear action, it's not an alert—it's a metric.
The biggest source of noise was threshold-based alerts that toggled rapidly:
# Before: fires every time CPU spikes briefly
- alert: HighCPU
expr: container_cpu_usage > 0.8
for: 0m # instant fire
# After: sustained high CPU is the real problem
- alert: HighCPU
expr: container_cpu_usage > 0.8
for: 10m # must sustain for 10 minutes
labels:
severity: warning
Adding for: 10m to sustained-condition alerts eliminated 70% of our noise.
Every P1 and P2 alert now includes a runbook link:
- alert: DatabaseConnectionPoolExhausted
expr: pg_stat_activity_count > pg_settings_max_connections * 0.9
for: 5m
annotations:
summary: "Database connection pool near limit"
runbook: "https://wiki.internal/runbooks/db-pool-exhausted"
The runbook answers three questions:
We replaced 12 generic dashboards with 3 purpose-built ones:
Each dashboard has a "What am I looking at?" text panel at the top explaining how to read it.
| Metric | Before | After |
|---|---|---|
| Alerts per week | 200 | 15 |
| Mean time to acknowledge | 23 min | 4 min |
| Alerts with runbooks | 10% | 100% (P1/P2) |
| Missed real incidents | 2/month | 0 |
for durations to prevent flappingGood monitoring isn't about collecting more data. It's about surfacing the right signal at the right time to the right person.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Explore more articles in this category
Practical patterns for Terraform modules at scale: versioning, composition, testing, and avoiding the monolith trap.
A real-world Terraform module version pinning guide for platform teams that want safer upgrades, clearer ownership, and fewer broken pipelines after shared module releases.
A practical Terraform state isolation guide built from a real environment-mixing incident, with patterns for safer backends, clearer ownership, and lower blast radius.