Most CI caches either miss constantly or restore stale junk. The cache-key discipline, scope boundaries, and measurements that turned our pipeline cache from theatre into real minutes saved.
Caching is the first optimization everyone reaches for when CI gets slow, and the one most often done wrong. A cache that misses on every run costs you upload/download time for zero benefit. A cache that restores stale artifacts costs you a flaky build that's worse than no cache at all. This is what we learned getting our pipeline cache from "configured" to "actually saving minutes."
Before tuning anything, instrument the hit rate. Most CI systems report cache restore status; surface it.
- name: Restore deps
id: cache
uses: actions/cache@v4
with:
path: ~/.npm
key: npm-${{ runner.os }}-${{ hashFiles('package-lock.json') }}
restore-keys: |
npm-${{ runner.os }}-
- name: Report cache status
run: echo "cache-hit=${{ steps.cache.outputs.cache-hit }}"
We found our "cache" was hitting 11% of the time. The key included a timestamp someone added during debugging months earlier. Every run wrote a new entry and never matched. The cache was pure overhead.
A cache key has one job: change exactly when the cached content should change, and not before. Two failure modes:
The right key is a hash of the inputs that determine the output. For dependencies, that's the lockfile — not package.json, the lockfile, because that's what pins exact versions.
key: npm-${{ runner.os }}-${{ hashFiles('**/package-lock.json') }}
When the exact key misses (lockfile changed), restore-keys lets you fall back to the most recent prefix match. You get last build's node_modules, then npm ci only reconciles the delta instead of downloading everything.
restore-keys: |
npm-${{ runner.os }}-
This is the difference between a 90-second cold install and a 12-second warm one on a one-package bump. The exact-key write keeps future identical runs at a full hit; the prefix fallback keeps changed runs from going fully cold.
Don't cache build outputs keyed on source unless the build is deterministic. We cached a compiled bundle keyed on the lockfile, and it served a stale bundle because the source had changed but dependencies hadn't. Cache the inputs to an expensive step (downloaded packages, base layers), not the step's product, unless you key on the full input set.
Good caching candidates:
~/.npm, ~/.cache/pip, ~/.cargo)sccache, ccache)--cache-fromBad candidates:
node_modules itself across major version bumps (platform-specific binaries)For image builds, ordering the Dockerfile so that rarely-changing layers come first is more impactful than any external cache. Copy the lockfile and install before copying source:
COPY package-lock.json package.json ./
RUN npm ci
COPY . .
RUN npm run build
Now a source-only change reuses the npm ci layer. Combined with registry-backed cache (--cache-from type=registry), cold runners still get warm layers.
After fixing the key, adding restore-keys, and reordering the Dockerfile:
The cache upload/download overhead is real — roughly 8–15s per cache. It only pays off when the hit rate is high enough that the saved work exceeds that overhead. Below ~40% hit rate, we found several caches were net-negative and removed them.
Cache the expensive, deterministic inputs. Key on the exact thing that invalidates them. Add a prefix fallback for partial reuse. Then measure the hit rate — a cache you don't measure is a cache you can't trust, and an untrusted cache is one more thing making your builds slow and flaky at the same time.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
The "three pillars" framing misses the point — what matters is correlating across them. The patterns that earn their place and the tooling decisions that pay back.
Parsing model output with a regex and a prayer doesn't survive contact with traffic. The validation layers that keep structured LLM output reliable — constrained decoding, schema validation, and the repair loop.
Explore more articles in this category
Node upgrades, autoscaler scale-downs, and spot reclaims all drain nodes. Without PDBs they can take all your replicas at once. The budgets, probes, and graceful-shutdown handling that keep voluntary disruptions invisible to users.
Cause-based alerts page you for things that don't matter and miss things that do. How we rebuilt alerting around SLO burn rates — multi-window, multi-burn-rate — and cut pages while catching more real pain.
Default-deny, namespace isolation, egress control — the patterns we use, the gotchas around DNS, and where Cilium changed our calculus.
Evergreen posts worth revisiting.