Production monitoring catches user-facing issues. CI failures stay invisible until someone notices the merge queue is stuck. The metrics and alerts that make pipelines observable.
The first time our CI pipeline broke for 6 hours on a Friday, we found out Monday morning when an engineer asked why their PR had been "waiting on checks" all weekend. The CI hadn't been alerting because nothing in our monitoring stack treated CI failures as worth waking someone over. Production was healthy; alerts stayed silent; the build pipeline was a graveyard.
That weekend cost a sprint of velocity. Since then we've treated CI pipelines as production infrastructure with their own observability and alerting. This post is what we monitor.
The failure modes we've actually seen:
npm install fails for an hour. CI dies en masse.Each of these has a signature. Without monitoring, nobody sees the signature; they see the symptom (PRs not merging) after the fact.
Treating the pipeline as a service, we instrument it like a service:
Success rate per workflow. Per-repo, per-workflow, last 1h / 6h / 24h. A workflow whose success rate drops below 90% over an hour is a red flag.
Queue time p95. How long jobs wait before running. Healthy is < 30 seconds; > 5 minutes means the runner pool is undersized or unhealthy.
Build duration p95. Per workflow. Sudden increases mean either the workflow changed or the runner perf changed (often a noisy neighbor on self-hosted, or an issue with a service the build depends on).
Runner online count. How many runners are currently registered + healthy. Should be > some floor; alert if it drops below.
Per-step failure counts. A test that's failing 10% of the time across runs is flaky; the metric surfaces it before someone has to manually find a pattern.
Total daily build minutes. Cost tracking. Sudden spikes mean someone introduced a heavy test or a build is in a retry loop.
GitHub Actions emits a webhook on every workflow run completion. We have a tiny webhook receiver that:
The receiver itself is ~50 lines of code. Once metrics are in Prometheus, the rest is standard observability — Grafana dashboards, Alertmanager rules.
For platforms other than GitHub Actions (BuildKite, CircleCI), the same pattern applies — they all have webhooks or APIs that expose the same kind of data.
The alert rules we run:
Workflow success rate < 85% over 1 hour. Page (during business hours; ticket overnight). Indicates either a flaky test, a broken dependency, or a credential issue. Either way, someone needs to look.
Queue time p95 > 5 minutes for 15 minutes. Page. Runner pool is undersized or unhealthy.
Zero successful runs of a critical workflow in 2 hours during weekday business hours. Page. The build is broken enough that nothing's getting through.
Runner online count below floor for 10 minutes. Page. Self-hosted runners died and didn't recover.
Daily build minutes 2× the trailing 7-day average. Slack notification (not page). Could be a legitimate spike (release day) or a runaway pipeline.
We don't page on individual workflow run failures — that would be way too noisy. We page on patterns of failure.
A single dashboard, one row per signal:
Engineers look at this dashboard about as often as the production dashboard. Sometimes more, when build failures are blocking their work.
Per alert, a runbook. The most-used one:
Alert: Workflow success rate < 85%
- Check the dashboard for which workflow is failing.
- Open the most recent failed run; read the error.
- Is it a real test failure? A credential issue? Dependency download failure?
- Real test: find the responsible team, ping them.
- Credential: check expiry dates of GitHub Actions secrets; rotate if expired.
- Dependency: check the upstream package server's status.
- If the cause is unclear: rerun a failed job. Sometimes the error is more verbose on retry.
It's prosaic. The runbook gets you to the cause faster than starting from scratch.
Real examples of CI issues caught by alerts within minutes instead of hours:
A test became flaky after a dependency upgrade. Per-step failure rate alert fired within 90 minutes. The team noticed before the merge queue accumulated dozens of retries.
Runner disk fill. Runner online count dropped (the runner went unhealthy when disk hit 95%). Pager fired within 10 minutes. We added a disk-usage cleanup step to the runner provisioning.
Image registry hiccup. Push step started failing across all workflows that pushed images. Success rate alert fired; we diagnosed it as upstream registry rate-limiting and waited it out (rather than thrashing retries).
Credential rotation that was scheduled and forgotten. A secret expired at 4am on a weekend. We were paged. Fixed in 10 minutes. Without the alert, the next merge attempt Monday morning would have been the discovery.
Each of these would have been a much bigger productivity hit if we'd found out hours later. The alerting cost is small; the productivity protection is real.
A few patterns we tried and dropped:
Alerting on every workflow failure. Too noisy. Real test failures (the kind of thing CI is supposed to catch) shouldn't page anyone. The pattern-based alerts above are the right level.
SLO on CI uptime. We tried formalizing CI as an SLO; got bogged down in defining "available." For us, the alerts cover it without the SLO ceremony.
A separate ops team for the build pipeline. The pipeline is everyone's responsibility. Anyone who touches CI configs is responsible for keeping the alerts green for their workflows. The platform team owns the runners and shared workflows; product teams own their service-specific ones.
Treating CI as "developer concerns, not ops concerns." It's both. A broken CI pipeline is a productivity outage that's just as costly as a small production incident.
Logging CI failures only to GitHub. The fact that GitHub knows isn't enough — nobody reads GitHub for system-level alerts. Wire it into your monitoring stack like everything else.
Alerting on individual failures. Too noisy; trains the team to ignore. Pattern-based alerts (success rate, queue time, repeated step failures) are the right signal.
Forgetting the cost dashboard. Pipeline cost isn't an outage signal but it's a useful health signal — sudden spikes indicate something's wrong.
CI pipelines aren't optional; they're as important to throughput as any production service. Treating them as observable infrastructure — with metrics, alerts, dashboards — turns "wait, when did the pipeline break?" into "the pipeline alerted at 04:13 and was fixed by 04:42." The discipline is the same as any service; the implementation is a webhook plus standard tooling.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Version-pinned modules across many repos. The release process, semver discipline, and the breaking-change communication that keeps a shared registry sane.
AI agents for incident triage sound great in demos. We've tried it in production. The patterns that earn their keep, the ones that backfire, and where humans still beat agents.
Explore more articles in this category
Static thresholds on error rate produce noisy alerts. Burn-rate alerting flips the question to "are we burning the error budget faster than we can sustain?" — and pages only on real problems.
SBOMs and signed attestations sound like checkboxes until you need to answer "did this artifact come from our pipeline?" The minimum viable supply-chain story we run.
Argo CD ships your manifests; Argo Rollouts ships them gradually with automated quality gates. The setup, the analysis templates that earn their place, and what we measure.