Our CI was 73% green at the worst point. People trusted it less than coin flips. Six things we did to get to 96%, in rough order of impact.

On this page

GitHub Actions Pipeline Reliability

A year ago our default-branch CI green rate was 73%. The other 27% was a soup of timeouts, flaky tests, transient registry pulls, and the occasional real failure. Engineers had stopped reading the failure messages — they just hit Re-run and moved on.

That's a worse state than nominal red. When the team trains itself to ignore CI, real failures slip through. We did six things over a quarter to bring green rate up to 96%. They're listed below in the order they delivered the most value.

1. Pin everything #

The biggest single contributor to flakiness was unpinned actions and base images. A workflow that ran actions/setup-node@v3 would silently move from 3.0.0 to 3.8.1 over months, and once in a while the new version had a bug that broke our specific use case for a day until they patched it.

Every action is now pinned to a SHA, not a tag:

yaml.yaml

# Before — moves over time
- uses: actions/setup-node@v4

# After — frozen
- uses: actions/setup-node@60edb5dd545a775178f52524783378180af0d1f8 # v4.0.2

Dependabot keeps these up to date with PRs we review. The diff per upgrade is small and intentional. If a Dependabot PR breaks CI, it's contained — main is still on the previous SHA.

We do the same for Docker base images (digest pinning, not tag). For our self-built builder image (with our toolchain pre-installed) we use a content-addressable tag like builder:8a4f2c1. Nothing in CI references :latest or :main ever.

This change alone took us from 73% to about 86% green over two weeks.

2. Quarantine flaky tests, don't fix them in line #

We had ~40 tests that flaked sporadically. The natural instinct is to fix them as you find them, but engineers under time pressure don't fix them — they retry the job and merge. The tests stay flaky.

We added a script that runs after every CI failure: it checks if the failed test has flaked in the last 30 days. If yes, the test is automatically quarantined — moved to a separate test job that runs but doesn't block merges. The author of the original PR isn't blocked.

Quarantined tests get a Jira ticket auto-filed against the team that owns them. The ticket has a 30-day SLA. If it's not fixed in 30 days, the test gets deleted from the codebase, full stop. We've deleted maybe 12 tests this way; nobody has missed any of them.

This had two effects: PRs stopped getting blocked by flakes (so engineers stopped retrying and merging through), and the team learned which areas of the codebase were producing flaky tests, which was usually a sign of a real reliability bug in the system under test.

3. Cache aggressively, but pin the cache key #

Most workflows do npm install or pip install as part of every job. Caching this saves enormous time but also introduces its own flakiness — a cache miss shouldn't cause a job to fail, but a cache containing the WRONG dependencies will.

Our cache key includes:

yaml.yaml

- uses: actions/cache@v4
  with:
    path: |
      ~/.npm
      node_modules
    key: npm-${{ runner.os }}-${{ hashFiles('package-lock.json') }}-${{ hashFiles('.nvmrc') }}

The package-lock.json hash means a lockfile change invalidates the cache. The .nvmrc hash means a Node version change invalidates the cache. The runner.os means swapping from ubuntu-22 to ubuntu-24 invalidates it. We don't ever take a cache hit across these dimensions; we only reuse caches when the inputs are byte-identical to what produced them.

We had a class of weird failures where a cache from package-lock.json revision A was being reused on revision B because the cache key didn't include the lockfile. The dependencies were technically wrong but compatible enough to mostly-work. Fixed by hashing the lockfile.

4. Tighten timeouts #

Before this work, our default job timeout was 6 hours (the GitHub default). The intent was "don't make tests fail just because they're slow." The effect was that one stuck test could waste an hour of runner time before someone noticed and cancelled it.

We set explicit timeouts everywhere:

yaml.yaml

jobs:
  unit-tests:
    timeout-minutes: 12   # we measured; p99 is ~7m
  integration-tests:
    timeout-minutes: 25
  deploy:
    timeout-minutes: 15

Numbers come from measuring p99 actual runtime over a month and adding 50%. Anything that takes longer than that has gone wrong; we'd rather fail fast and re-run.

Stuck jobs went from "occasionally noticed by a human after 60 minutes" to "killed automatically in 15-25 minutes." Total CI runner time dropped about 22%.

5. Fail fast and explicitly #

A workflow that fails on step 14 of 18 should report exactly what failed, not bury the error in 500 lines of teardown logs. Two changes made the failures actually readable:

yaml.yaml

# Stop on the first failure of any step in a job
defaults:
  run:
    shell: bash --noprofile --norc -eo pipefail {0}

# And use ::error:: annotations for the most common failure modes
- name: Lint
  run: |
    if ! npm run lint; then
      echo "::error::lint failed — see above for the offending file"
      exit 1
    fi

The ::error:: annotation puts a red banner at the top of the PR. Engineers see it without scrolling. Click-to-expand is one click instead of scrolling through 200 lines.

We also wrapped our most common test command to print only the failed tests at the end:

bash.bash

# In our test runner wrapper
... <run tests> ...
echo "::group::Failed tests summary"
grep -E "FAIL|✗" test-output.log | head -20
echo "::endgroup::"

Helps engineers see what they need to fix in 5 seconds instead of 30.

6. Concurrency groups to prevent self-collision #

We had a class of issue where two PRs from the same branch (one being amended after a small fix) would both run in parallel. The first one would win, the second would race against the first's deployment artifacts, and either could fail in confusing ways.

yaml.yaml

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

Cancels older runs of the same workflow + ref. The latest push wins. Eliminates the self-collision class of failures.

For the deploy workflow, we use a slightly different version that doesn't cancel — it queues:

yaml.yaml

concurrency:
  group: deploy-${{ github.event.inputs.environment }}
  cancel-in-progress: false

Two simultaneous deploys to prod would be bad; one waiting for the other to finish is fine.

What we measure now #

The metric we track is simple: percentage of PRs whose CI is green on first run. Not "green eventually after retries"; first run.

That number was 73% a year ago. It's 96% now. The 4% red is real failures that an engineer needs to fix.

We also track: median time to first feedback (PR open → first check completes). It was about 18 minutes; it's now 7. Most of that came from caching (#3) and timeout discipline (#4) reducing total runner time.

What we don't bother with #

Self-hosted runners. We considered them for cost reasons. The reliability of self-hosted is on us; the reliability of GitHub-hosted is on GitHub. We picked the trade-off where we're not the SRE for our own runners.
Custom test sharding. We use jest --shard for the JS tests but didn't build anything bespoke. The naive sharding gave us most of the speedup; the more sophisticated balancing hasn't been worth the complexity.

What I'd tell a team starting from "CI is unreliable"#

Pin everything first. The single biggest accelerant of CI flakiness is moving versions silently. Pinning is mechanical; do it in one PR per repo and set up Dependabot.

Then quarantine flaky tests. Don't try to fix them inline. The discipline of "any flake gets quarantined, owner has 30 days, then it's deleted" works because it removes the option to retry-and-merge.

Everything after that is incremental. The first two will get you most of the way.

The trap is to treat CI flakiness as a list of individual bugs to fix. It's a system that tolerates flakiness or doesn't. If you make flakes painful for everyone (quarantine + 30-day SLA + deletion), the team's behaviour changes and the system gets reliable. If you don't, no amount of individual fixes catches up.

Best Practices: GitHub Actions Pipeline Reliability

GitHub Actions Pipeline Reliability

1. Pin everything #

2. Quarantine flaky tests, don't fix them in line #

3. Cache aggressively, but pin the cache key #

4. Tighten timeouts #

5. Fail fast and explicitly #

6. Concurrency groups to prevent self-collision #

What we measure now #

What we don't bother with #

What I'd tell a team starting from "CI is unreliable"#

Stay Updated

How We Stopped Terraform Drift from Surprising On-Call

Real-World RAG Incidents: Lessons from a Production Rollout

More from DevOps

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Four Signals That Matter: Choosing SLIs Users Actually Feel

GitHub Actions Reusable Workflows: DRY Pipelines at Org Scale

OIDC Federation for GitHub Actions to AWS: Killing Long-Lived Keys

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Process Management and Monitoring in Linux

Linux Network Debugging — tcpdump, ss, and eBPF in Anger

About Kiril Urbonas