We mapped every byte that ends up in our production containers. The map showed three places trust was implicit. Each became a control.
We started this work after a news cycle about a compromised npm package. Nothing of ours was affected, but the question that came up internally was uncomfortable: if a malicious package were introduced into one of our dependencies tomorrow, would we ship it to production?
Honest answer: probably yes, within a day or two. We had no controls between "PR merged" and "container running in prod" that would catch a malicious dependency. The work below was the response.
Before adding controls, we mapped where bytes that end up in our production containers come from. The chain looked like:
package.json/requirements.txt/go.modnode:20-alpineEach numbered item represents a place where an attacker could insert hostile code if they compromised the corresponding party. Roughly half of those points (3, 5, 7, 8) we hadn't actively controlled.
The first instance of trust we had to make explicit was the resolver. Saying "we depend on react: ^19" trusts npm's resolver to pick a safe version. Saying "we depend on react@19.2.4 with hash sha512-..." trusts only what we've previously verified.
We require:
package-lock.json, poetry.lock, go.sum)npm ci (not npm install) — fails if lockfile is stale or any hash mismatchesThe CI check is the one that closes the gap. Without it, an engineer could accidentally update a dependency by running npm install foo locally and not commit the lockfile change. The next CI run would re-resolve, possibly to a different version, possibly compromised.
This caught one issue in our codebase — a dev dependency had drifted in a feature branch. Renovate noticed and opened a PR. Mundane fix; the value is the visibility.
FROM node:20-alpine is moving. Yesterday's node:20-alpine is different from today's. The image content is whatever was on Docker Hub when you built. If node:20-alpine were silently replaced by a malicious version (we've never seen this happen, but it's possible), we'd have shipped it.
We pin to digest:
FROM node:20-alpine@sha256:1a7d234c8b00b001a... AS builder
The digest is content-addressed. The bytes can't change without the digest changing. Renovate updates these via PR; we see the digest change and approve the new one explicitly.
For our self-built builder image (with our toolchain), we tag by content hash and pin to that:
FROM ourorg/builder@sha256:8a4f2c1...
Nothing in our Dockerfiles references :latest or :main or any moving tag.
Pinning by digest stops compromise of the base layer. It doesn't stop compromise of OUR builds — if our CI were compromised, an attacker could push a malicious image with a valid (pinned) name to our own registry, and our cluster would pull it.
We sign every image we build and require valid signatures on pull. The implementation uses sigstore/cosign:
# In CI, after build:
cosign sign --key cosign.key $REGISTRY/$IMAGE@$DIGEST
# In Kubernetes, via admission policy:
# Reject any pod whose image isn't signed by our key
The Kubernetes admission policy is a Kyverno rule:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-signed-images
spec:
validationFailureAction: enforce
rules:
- name: check-image-signature
match:
any:
- resources:
kinds: [Pod]
verifyImages:
- imageReferences:
- "registry.internal/*"
attestors:
- entries:
- keys:
publicKeys: |-
-----BEGIN PUBLIC KEY-----
...our public key...
-----END PUBLIC KEY-----
Now even if an attacker pushed an image to our registry, the cluster would refuse to pull it without a valid signature from our CI key.
The signing key lives in HashiCorp Vault and is only accessible to the CI workflow, gated by GitHub OIDC + branch restriction (only main branch builds can sign). Even a compromised PR can't sign an image.
Honest about scope:
A compromised direct dependency you've blessed. If one of our pinned dependencies is malicious (e.g., a popular package that someone took over), we'll ship it. Pinning slows the attack but doesn't prevent it. Mitigation: Trivy scans + Snyk for known vulnerabilities; vigilance on Dependabot PRs.
A compromised CI runner. If GitHub Actions itself is compromised, our signing key is still safe (it's in Vault), but anything that the CI is allowed to do (build, scan, sign) could be done with malicious intent. Mitigation: limit blast radius with role-scoped credentials, monitor for unusual activity.
A zero-day in a base image. If node:20-alpine has a critical CVE we don't know about, we ship the CVE. Mitigation: regular base-image updates, pinned but rotated weekly.
A malicious commit by an authorized contributor. If a developer with merge rights goes rogue, none of these controls help. Mitigation: code review requirement, audit trails, separation of duties for sensitive paths.
These aren't gaps in our supply chain controls per se — they're attacks at a different layer. We document them so people know what we're not promising.
For every image we build, we generate a Software Bill of Materials (SBOM) and attach it as a sigstore attestation:
syft $REGISTRY/$IMAGE@$DIGEST -o spdx-json > sbom.spdx.json
cosign attest --key cosign.key --predicate sbom.spdx.json --type spdx \
$REGISTRY/$IMAGE@$DIGEST
When a CVE in a popular package gets disclosed, we can answer "which of our running images contains the affected version" by querying the SBOMs. Without SBOMs, that's a manual exercise that takes hours and gets stale immediately.
We've used this once for real, when a xz-utils issue was reported in early 2024. Within an hour of the disclosure, we knew which of our images were affected and could prioritize rebuild and rollout. Without SBOMs we'd have spent half a day at minimum.
A subtle but important thing: our CI doesn't have any long-lived AWS or signing credentials. It authenticates to AWS via GitHub OIDC (assuming a role) and to Vault via the same mechanism (Vault is configured to trust GitHub OIDC tokens for specific repos and branches).
The trust policy is tight:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"Federated": "arn:aws:iam::123:oidc-provider/token.actions.githubusercontent.com"},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"token.actions.githubusercontent.com:sub": "repo:ourorg/ourrepo:ref:refs/heads/main"
}
}
}]
}
Only the main branch's Actions can assume this role. PR builds (which run untrusted code from contributors) cannot. This means: a malicious PR cannot publish a signed image. Only main-branch builds can. Combined with branch protection rules requiring code review before merge, the path from "malicious PR" to "running in prod" requires compromising a reviewer's account.
Three things, monthly:
:tag references vs @sha256:... references. Goal: 100%. Currently 100% in main branches.The numbers being 100% is the point. If they slip, something has slipped past the controls; we'd want to know.
The controls add real friction for engineers:
Total overhead: maybe 30-60 minutes per engineer per month. We accept the cost.
Do it in this order:
Skipping ahead is tempting but each layer relies on the previous one. SBOMs without pinning are useless because the SBOM doesn't reflect what you actually shipped. Signing without OIDC means a leaked CI key is a disaster.
The whole thing took us about a quarter of dedicated effort, spread across two engineers. Most of the benefit came from steps 1-3; steps 4-5 are paying it forward against a class of incidents we hope to never see.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We expanded from one Kubernetes cluster to four across two regions. The traffic-routing layer was the hardest piece. Here's what we tried, what worked, and what we'd do again.
We replaced 47 percentile threshold alerts with 3 SLO burn-rate alerts. The on-call rotation gets paged less and catches more.
Explore more articles in this category
You always have known vulnerabilities. The question is how you triage, patch, and respond. The discipline we run after a few real incidents and a lot of routine work.
Three terms that get mixed up constantly. The actual differences, where each one sits in the request path, when you reach for which, and where the same tool plays all three roles.
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.