Production monitoring catches user-facing issues. CI failures stay invisible until someone notices the merge queue is stuck. The metrics and alerts that make pipelines observable.

On this page

Pipeline Observability: Why CI Failures Don't Trigger Alerts (And Should)

The first time our CI pipeline broke for 6 hours on a Friday, we found out Monday morning when an engineer asked why their PR had been "waiting on checks" all weekend. The CI hadn't been alerting because nothing in our monitoring stack treated CI failures as worth waking someone over. Production was healthy; alerts stayed silent; the build pipeline was a graveyard.

That weekend cost a sprint of velocity. Since then we've treated CI pipelines as production infrastructure with their own observability and alerting. This post is what we monitor.

What goes wrong with CI #

The failure modes we've actually seen:

Runners stop coming online. Self-hosted runners crash; replacement runners fail to register. New jobs queue indefinitely.
Credentials expire silently. A token used for pushing images expires; every build fails with the same error.
Upstream dependencies break. A package server has an outage; npm install fails for an hour. CI dies en masse.
Disk fills. Self-hosted runners accumulate Docker images; disk hits 100%; new builds fail with no clear error.
One specific test goes flaky. Most builds pass; the flaky test fails on 30% of runs. Merges are delayed by retries.
Cache poisoning. A bad cached layer or dependency keeps getting reused; clean rebuilds fix it; nobody notices the cache.

Each of these has a signature. Without monitoring, nobody sees the signature; they see the symptom (PRs not merging) after the fact.

The metrics that earn their place #

Treating the pipeline as a service, we instrument it like a service:

Success rate per workflow. Per-repo, per-workflow, last 1h / 6h / 24h. A workflow whose success rate drops below 90% over an hour is a red flag.

Queue time p95. How long jobs wait before running. Healthy is < 30 seconds; > 5 minutes means the runner pool is undersized or unhealthy.

Build duration p95. Per workflow. Sudden increases mean either the workflow changed or the runner perf changed (often a noisy neighbor on self-hosted, or an issue with a service the build depends on).

Runner online count. How many runners are currently registered + healthy. Should be > some floor; alert if it drops below.

Per-step failure counts. A test that's failing 10% of the time across runs is flaky; the metric surfaces it before someone has to manually find a pattern.

Total daily build minutes. Cost tracking. Sudden spikes mean someone introduced a heavy test or a build is in a retry loop.

How we collect #

GitHub Actions emits a webhook on every workflow run completion. We have a tiny webhook receiver that:

Parses the workflow run payload.
Extracts the metrics above.
Pushes them as Prometheus metrics (via the Prometheus pushgateway pattern, or a custom counter exporter).

The receiver itself is ~50 lines of code. Once metrics are in Prometheus, the rest is standard observability — Grafana dashboards, Alertmanager rules.

For platforms other than GitHub Actions (BuildKite, CircleCI), the same pattern applies — they all have webhooks or APIs that expose the same kind of data.

Alerts that fire #

The alert rules we run:

Workflow success rate < 85% over 1 hour. Page (during business hours; ticket overnight). Indicates either a flaky test, a broken dependency, or a credential issue. Either way, someone needs to look.

Queue time p95 > 5 minutes for 15 minutes. Page. Runner pool is undersized or unhealthy.

Zero successful runs of a critical workflow in 2 hours during weekday business hours. Page. The build is broken enough that nothing's getting through.

Runner online count below floor for 10 minutes. Page. Self-hosted runners died and didn't recover.

Daily build minutes 2× the trailing 7-day average. Slack notification (not page). Could be a legitimate spike (release day) or a runaway pipeline.

We don't page on individual workflow run failures — that would be way too noisy. We page on patterns of failure.

The dashboard #

A single dashboard, one row per signal:

Workflow success rate over time, faceted by workflow. Spot which one is degrading.
Queue time + run duration, separately. Run duration affects throughput; queue time affects perceived speed.
Per-runner health for self-hosted. Each runner's last-heartbeat, jobs run last hour.
Per-step failure rate, last 24h. Surfaces flaky tests.
Cost trend — daily build minutes.

Engineers look at this dashboard about as often as the production dashboard. Sometimes more, when build failures are blocking their work.

The runbook #

Per alert, a runbook. The most-used one:

Alert: Workflow success rate < 85%

Check the dashboard for which workflow is failing.

Open the most recent failed run; read the error.

Is it a real test failure? A credential issue? Dependency download failure?

Real test: find the responsible team, ping them.

Credential: check expiry dates of GitHub Actions secrets; rotate if expired.

Dependency: check the upstream package server's status.

If the cause is unclear: rerun a failed job. Sometimes the error is more verbose on retry.

It's prosaic. The runbook gets you to the cause faster than starting from scratch.

What we caught with this #

Real examples of CI issues caught by alerts within minutes instead of hours:

A test became flaky after a dependency upgrade. Per-step failure rate alert fired within 90 minutes. The team noticed before the merge queue accumulated dozens of retries.

Runner disk fill. Runner online count dropped (the runner went unhealthy when disk hit 95%). Pager fired within 10 minutes. We added a disk-usage cleanup step to the runner provisioning.

Image registry hiccup. Push step started failing across all workflows that pushed images. Success rate alert fired; we diagnosed it as upstream registry rate-limiting and waited it out (rather than thrashing retries).

Credential rotation that was scheduled and forgotten. A secret expired at 4am on a weekend. We were paged. Fixed in 10 minutes. Without the alert, the next merge attempt Monday morning would have been the discovery.

Each of these would have been a much bigger productivity hit if we'd found out hours later. The alerting cost is small; the productivity protection is real.

What we don't do #

A few patterns we tried and dropped:

Alerting on every workflow failure. Too noisy. Real test failures (the kind of thing CI is supposed to catch) shouldn't page anyone. The pattern-based alerts above are the right level.

SLO on CI uptime. We tried formalizing CI as an SLO; got bogged down in defining "available." For us, the alerts cover it without the SLO ceremony.

A separate ops team for the build pipeline. The pipeline is everyone's responsibility. Anyone who touches CI configs is responsible for keeping the alerts green for their workflows. The platform team owns the runners and shared workflows; product teams own their service-specific ones.

Common mistakes #

Treating CI as "developer concerns, not ops concerns." It's both. A broken CI pipeline is a productivity outage that's just as costly as a small production incident.

Logging CI failures only to GitHub. The fact that GitHub knows isn't enough — nobody reads GitHub for system-level alerts. Wire it into your monitoring stack like everything else.

Alerting on individual failures. Too noisy; trains the team to ignore. Pattern-based alerts (success rate, queue time, repeated step failures) are the right signal.

Forgetting the cost dashboard. Pipeline cost isn't an outage signal but it's a useful health signal — sudden spikes indicate something's wrong.

What I'd tell a team starting #

Treat the build pipeline as production. Same monitoring discipline.
Webhook → metrics → dashboard + alerts. A small webhook receiver is enough.
Alert on patterns, not individual failures. Success rate, queue time, repeated step failures.
One runbook per alert. Make the on-call response fast.
Watch the cost trend. Cheapest leading indicator of pipeline misbehavior.

What to read next #

Scalable CI/CD with GitHub Actions — the pipeline patterns the observability watches over
CI/CD pipeline optimization — where the time goes — making the pipeline fast complements making it observable
Burn-rate alerting — the SLO discipline that prevents alert fatigue — adjacent alerting discipline
Monitoring that actually helps on-call — broader monitoring patterns

CI pipelines aren't optional; they're as important to throughput as any production service. Treating them as observable infrastructure — with metrics, alerts, dashboards — turns "wait, when did the pipeline break?" into "the pipeline alerted at 04:13 and was fixed by 04:42." The discipline is the same as any service; the implementation is a webhook plus standard tooling.

Pipeline Observability — Why CI Failures Don't Trigger Alerts (And Should)

Pipeline Observability: Why CI Failures Don't Trigger Alerts (And Should)

What goes wrong with CI #

The metrics that earn their place #

How we collect #

Alerts that fire #

The dashboard #

The runbook #

What we caught with this #

What we don't do #

Common mistakes #

What I'd tell a team starting #

What to read next #

Stay Updated

Terraform Module Versioning and Shared Registries

Agentic Ops — When (and When Not) to Use AI Agents for Incident Response

More from DevOps

OIDC Federation Beyond GitHub — GitLab, Buildkite, and Generic Providers

Kubernetes Workload Identity — Projected Tokens and OIDC to Cloud IAM

On-Call Without Burnout: Rotations, Runbooks, and Escalation

OIDC Federation Beyond GitHub — GitLab, Buildkite, and Generic Providers

Kubernetes Workload Identity — Projected Tokens and OIDC to Cloud IAM

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Feature Flags for Safe Deploys: Decoupling Release From Deploy

Observability for Edge Functions — Logs, Traces, and Metrics

Blameless Postmortems: The Template and Facilitation That Works

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Embedding Models Comparison: Choosing the Right Model for Your Use Case

About Kiril Urbonas