We scan every container image in CI and at runtime. Trivy + Cosign + admission controllers. The setup that earns its place and what we wish we'd known.
Every container image we build goes through security scanning. The scanners flag CVEs in base images and dependencies; the policies determine what blocks deploys; the runtime checks verify what's actually running matches what was scanned. After tuning the system over a few years, this is what we run, with the production reasons each piece earns its place.
The goals:
Each of these maps to specific tooling.
A typical image's security journey:
Build → Scan (CI) → Sign → Push to registry → Verify on deploy → Run
In detail:
If anything fails along the way, the deploy is blocked. The unhappy path doesn't give you "a slightly less secure container" — it gives you no deploy.
Trivy is the open-source scanner we use. It checks:
A typical Trivy run on a container takes 10-30 seconds. Fast enough to fit in CI.
The scan in CI:
trivy image \
--severity HIGH,CRITICAL \
--exit-code 1 \
--ignore-unfixed \
--format table \
$IMAGE_TAG
What this means:
--ignore-unfixed → only fail on CVEs that have a fix available (don't block on unfixable issues, which are usually old CVEs in stable distros)The naive setup is "any HIGH/CRITICAL CVE blocks the deploy." This is too strict; you'll be unable to ship anything because some library will always have a recent CVE.
Our thresholds:
The allowlist is non-trivial. Each entry has an owner, a reason, and an expiry date. An allowlist entry that's been there for 6 months gets re-reviewed.
The base image determines most of your CVE surface. Switching from ubuntu:22.04 to gcr.io/distroless/static cut our average HIGH/CRITICAL CVE count from ~25 to ~2. Because distroless contains far fewer packages.
Our base image rules:
debian:slim, not debian)debian:12.4-slim, not debian:slim)The "rebuild nightly" point is important. Even if your code doesn't change, the upstream base might have new CVE fixes. Our nightly rebuild catches these without manual intervention.
Trivy tells us what's IN the image. Cosign tells us if it's the image we built.
In CI, after building and scanning, we sign:
cosign sign --key cosign.key $IMAGE_TAG
The signature is stored in the registry alongside the image (Cosign uses a sidecar tag).
At deploy time, an admission controller (Kyverno or Gatekeeper) verifies the signature:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: verify-image-signatures
spec:
rules:
- name: verify-cosign
match:
resources:
kinds: [Pod]
verifyImages:
- imageReferences:
- "our-ecr-repo/*"
attestors:
- entries:
- keys:
publicKeys: |-
-----BEGIN PUBLIC KEY-----
...
-----END PUBLIC KEY-----
A pod whose image isn't signed by our key gets rejected at admission. This is the runtime defense against "someone pushes an unauthorized image to ECR."
We started with this in audit-only mode for a few weeks (just logging violations) before enforcing. Found a few internal images that weren't going through our signing pipeline; fixed those before turning enforcement on.
Software Bill of Materials: a structured list of all components in an image. We generate one per image with Syft:
syft $IMAGE_TAG -o cyclonedx-json > sbom.json
The SBOM is stored alongside the image (also as a Cosign attestation).
Use cases:
We don't use SBOMs much day-to-day, but having them means we're prepared for compliance asks and incidents.
Build-time scanning catches what's IN the image. Runtime detection catches what the image DOES.
Falco is the open-source runtime security tool we use. It watches kernel events for suspicious behavior:
/etc/shadow, ~/.aws/credentials)nc -e /bin/sh — reverse shell)Falco rules are extensive out-of-the-box; we tune to reduce noise.
When a Falco rule fires:
We've had ~15 high-severity Falco hits in the past year. None real (false positives — usually a CI runner doing something unusual). The signal hasn't surfaced an actual attack yet but the visibility is worth the effort.
Things that bite teams:
False positives. Some CVEs are flagged but unexploitable in your context (you don't use the vulnerable function). Without an allowlist mechanism, you fail builds for issues that don't matter. Allowlist with reason + expiry helps.
Vuln database coverage. Trivy's database is good for major distros and ecosystems. For obscure dependencies, coverage is patchier. We use multiple scanners (Trivy + Snyk for languages it covers better) on critical images.
Build-time vs runtime mismatch. Image is scanned at build; some packages are added at runtime (shouldn't be, but sometimes are). Runtime detection (Falco) helps catch these.
Speed. Scanning a large image (~1GB) can take 30+ seconds. For small frequent CI runs this is significant. Caching layers helps; switching to smaller base images helps more.
Noise after a major CVE announcement. When log4shell hit, every Java image flagged. All hands on deck for a couple of days. The pipeline survived but barely; we built tooling to do bulk allowlisting / bulk re-scanning afterward.
Times scanning saved us:
A base image upgrade introduced a new HIGH CVE. We were updating from node:18-alpine to node:20-alpine. The newer alpine had a temporarily-vulnerable musl. CI blocked the deploy; we waited a week for the upstream fix; rebuilt; deploy went through. Without scanning, the vulnerability would have been in production.
A typosquatted npm package made it into a dev dependency. A package similar to a real one, with malicious code. Trivy flagged unusual behavior in the npm scan. We removed the dependency; reported it upstream.
An old image was still being deployed in prod. A service hadn't been redeployed in 8 months; its image had accumulated several CVE patches in upstream. Our nightly rebuild + scan flagged this; we kicked off a deploy of the fresh image.
Self-hosted setup:
Total marginal cost: ~$200/month. The engineer time is real but small.
Compared to commercial alternatives (Snyk, Aqua, Sysdig): the commercial tools have better UX and more features but cost $5-20k+/month for our scale. The open-source stack is good enough for our needs.
Scan in CI; block on critical findings. The simplest setup catches a lot.
Use distroless or slim base images. Most of your CVE surface comes from the base; smaller bases = fewer CVEs.
Sign images and verify at admission. Defense against unauthorized images. Cosign + Kyverno is the standard.
Allowlist with expiry, not "ignore forever." Each exception is reviewed periodically.
Rebuild nightly. Catches upstream patches without manual intervention.
Add runtime detection eventually. Falco or similar for visibility into what's running.
Don't try to fix everything immediately. Triage by severity; address the actionable items; track the rest.
Container security isn't one big thing. It's a pipeline with multiple checkpoints, each catching a different class of issue. The pieces are well-known; the discipline is in keeping the thresholds reasonable, the allowlist clean, and the rebuilds happening. Once the system is running, it does its work quietly. The win is the bad images that never made it to production — which you don't see, because they didn't.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.
Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.
Explore more articles in this category
You always have known vulnerabilities. The question is how you triage, patch, and respond. The discipline we run after a few real incidents and a lot of routine work.
Three terms that get mixed up constantly. The actual differences, where each one sits in the request path, when you reach for which, and where the same tool plays all three roles.
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.
Evergreen posts worth revisiting.