Our CI was 73% green at the worst point. People trusted it less than coin flips. Six things we did to get to 96%, in rough order of impact.
A year ago our default-branch CI green rate was 73%. The other 27% was a soup of timeouts, flaky tests, transient registry pulls, and the occasional real failure. Engineers had stopped reading the failure messages — they just hit Re-run and moved on.
That's a worse state than nominal red. When the team trains itself to ignore CI, real failures slip through. We did six things over a quarter to bring green rate up to 96%. They're listed below in the order they delivered the most value.
The biggest single contributor to flakiness was unpinned actions and base images. A workflow that ran actions/setup-node@v3 would silently move from 3.0.0 to 3.8.1 over months, and once in a while the new version had a bug that broke our specific use case for a day until they patched it.
Every action is now pinned to a SHA, not a tag:
# Before — moves over time
- uses: actions/setup-node@v4
# After — frozen
- uses: actions/setup-node@60edb5dd545a775178f52524783378180af0d1f8 # v4.0.2
Dependabot keeps these up to date with PRs we review. The diff per upgrade is small and intentional. If a Dependabot PR breaks CI, it's contained — main is still on the previous SHA.
We do the same for Docker base images (digest pinning, not tag). For our self-built builder image (with our toolchain pre-installed) we use a content-addressable tag like builder:8a4f2c1. Nothing in CI references :latest or :main ever.
This change alone took us from 73% to about 86% green over two weeks.
We had ~40 tests that flaked sporadically. The natural instinct is to fix them as you find them, but engineers under time pressure don't fix them — they retry the job and merge. The tests stay flaky.
We added a script that runs after every CI failure: it checks if the failed test has flaked in the last 30 days. If yes, the test is automatically quarantined — moved to a separate test job that runs but doesn't block merges. The author of the original PR isn't blocked.
Quarantined tests get a Jira ticket auto-filed against the team that owns them. The ticket has a 30-day SLA. If it's not fixed in 30 days, the test gets deleted from the codebase, full stop. We've deleted maybe 12 tests this way; nobody has missed any of them.
This had two effects: PRs stopped getting blocked by flakes (so engineers stopped retrying and merging through), and the team learned which areas of the codebase were producing flaky tests, which was usually a sign of a real reliability bug in the system under test.
Most workflows do npm install or pip install as part of every job. Caching this saves enormous time but also introduces its own flakiness — a cache miss shouldn't cause a job to fail, but a cache containing the WRONG dependencies will.
Our cache key includes:
- uses: actions/cache@v4
with:
path: |
~/.npm
node_modules
key: npm-${{ runner.os }}-${{ hashFiles('package-lock.json') }}-${{ hashFiles('.nvmrc') }}
The package-lock.json hash means a lockfile change invalidates the cache. The .nvmrc hash means a Node version change invalidates the cache. The runner.os means swapping from ubuntu-22 to ubuntu-24 invalidates it. We don't ever take a cache hit across these dimensions; we only reuse caches when the inputs are byte-identical to what produced them.
We had a class of weird failures where a cache from package-lock.json revision A was being reused on revision B because the cache key didn't include the lockfile. The dependencies were technically wrong but compatible enough to mostly-work. Fixed by hashing the lockfile.
Before this work, our default job timeout was 6 hours (the GitHub default). The intent was "don't make tests fail just because they're slow." The effect was that one stuck test could waste an hour of runner time before someone noticed and cancelled it.
We set explicit timeouts everywhere:
jobs:
unit-tests:
timeout-minutes: 12 # we measured; p99 is ~7m
integration-tests:
timeout-minutes: 25
deploy:
timeout-minutes: 15
Numbers come from measuring p99 actual runtime over a month and adding 50%. Anything that takes longer than that has gone wrong; we'd rather fail fast and re-run.
Stuck jobs went from "occasionally noticed by a human after 60 minutes" to "killed automatically in 15-25 minutes." Total CI runner time dropped about 22%.
A workflow that fails on step 14 of 18 should report exactly what failed, not bury the error in 500 lines of teardown logs. Two changes made the failures actually readable:
# Stop on the first failure of any step in a job
defaults:
run:
shell: bash --noprofile --norc -eo pipefail {0}
# And use ::error:: annotations for the most common failure modes
- name: Lint
run: |
if ! npm run lint; then
echo "::error::lint failed — see above for the offending file"
exit 1
fi
The ::error:: annotation puts a red banner at the top of the PR. Engineers see it without scrolling. Click-to-expand is one click instead of scrolling through 200 lines.
We also wrapped our most common test command to print only the failed tests at the end:
# In our test runner wrapper
... <run tests> ...
echo "::group::Failed tests summary"
grep -E "FAIL|✗" test-output.log | head -20
echo "::endgroup::"
Helps engineers see what they need to fix in 5 seconds instead of 30.
We had a class of issue where two PRs from the same branch (one being amended after a small fix) would both run in parallel. The first one would win, the second would race against the first's deployment artifacts, and either could fail in confusing ways.
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
Cancels older runs of the same workflow + ref. The latest push wins. Eliminates the self-collision class of failures.
For the deploy workflow, we use a slightly different version that doesn't cancel — it queues:
concurrency:
group: deploy-${{ github.event.inputs.environment }}
cancel-in-progress: false
Two simultaneous deploys to prod would be bad; one waiting for the other to finish is fine.
The metric we track is simple: percentage of PRs whose CI is green on first run. Not "green eventually after retries"; first run.
That number was 73% a year ago. It's 96% now. The 4% red is real failures that an engineer needs to fix.
We also track: median time to first feedback (PR open → first check completes). It was about 18 minutes; it's now 7. Most of that came from caching (#3) and timeout discipline (#4) reducing total runner time.
jest --shard for the JS tests but didn't build anything bespoke. The naive sharding gave us most of the speedup; the more sophisticated balancing hasn't been worth the complexity.Pin everything first. The single biggest accelerant of CI flakiness is moving versions silently. Pinning is mechanical; do it in one PR per repo and set up Dependabot.
Then quarantine flaky tests. Don't try to fix them inline. The discipline of "any flake gets quarantined, owner has 30 days, then it's deleted" works because it removes the option to retry-and-merge.
Everything after that is incremental. The first two will get you most of the way.
The trap is to treat CI flakiness as a list of individual bugs to fix. It's a system that tolerates flakiness or doesn't. If you make flakes painful for everyone (quarantine + 30-day SLA + deletion), the team's behaviour changes and the system gets reliable. If you don't, no amount of individual fixes catches up.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
Explore more articles in this category
You always have known vulnerabilities. The question is how you triage, patch, and respond. The discipline we run after a few real incidents and a lot of routine work.
Three terms that get mixed up constantly. The actual differences, where each one sits in the request path, when you reach for which, and where the same tool plays all three roles.
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.