Our base image went from 1.2 GB and 200+ CVEs to 80 MB and 4 CVEs. Most of the work wasn't clever — it was deletion.
When we audited our container images last spring, the average production image was 1.2 GB. The Trivy report showed 200+ CVEs across them, mostly inherited from base layers we hadn't picked deliberately. Cold-start times for autoscaled services were noticeably worse than they should have been because pulling a 1.2 GB image to a fresh node took ~12 seconds.
A quarter later we're at an average ~80 MB per image, ~4 CVEs each, and cold-start image-pull time is under 2 seconds. The work was unglamorous. None of it was clever.
We did the audit in a spreadsheet. For each of our ~22 production images, we recorded: base image, total size, layer count, CVEs by severity, and what the image actually needed to run.
The patterns that fell out:
node:18 (full Debian-based, ~1 GB). They needed Node and curl. That's it.python:3.11 (similarly full). Same story.ubuntu:22.04 as a base for what amounted to a single Go binary./app and one in /srv/app. Nobody knew why.The opportunity was deleting things, not adding security tools.
For the Node services we moved to gcr.io/distroless/nodejs20-debian12. For Go binaries we moved to gcr.io/distroless/static-debian12. For Python we used python:3.11-slim as an intermediate step (distroless Python is harder to use because of native deps); slim was good enough.
# Before
FROM node:18
WORKDIR /app
COPY . .
RUN npm install
CMD ["node", "server.js"]
# After
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev
COPY . .
RUN npm run build
FROM gcr.io/distroless/nodejs20-debian12
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
USER nonroot
CMD ["dist/server.js"]
Image size dropped from ~1 GB to ~140 MB on the Node services. CVEs from inherited layers dropped to near zero (distroless ships only what you need to run).
Most of our images had been single-stage — build tools and runtime tools in the same final image. Splitting them was mechanical:
FROM <full toolchain image> — install build tools, compile, produce artifactsFROM <minimal runtime image> — copy ONLY the artifacts and runtime deps from stage 1The trap is COPY --from=builder /app /app. That tends to copy more than the artifact. The fix: COPY --from=builder /app/dist /app/dist, or COPY --from=builder /app/myapp /myapp. Be precise.
Most distroless images have a nonroot user (UID 65532) baked in. We added USER nonroot to every Dockerfile. Combined with runAsNonRoot: true in the Pod's securityContext, anything that tries to escalate fails immediately.
securityContext:
runAsNonRoot: true
runAsUser: 65532
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
Read-only root filesystem caught a few apps that were writing temp files in surprising places — we mounted explicit emptyDir volumes for the directories they actually needed and called the rest a bug.
FROM node:20-alpine is a moving target. Today's 20-alpine is different from yesterday's. We pin to the digest:
FROM node:20-alpine@sha256:1a7d234c8b... AS builder
FROM gcr.io/distroless/nodejs20-debian12@sha256:9fa2f99... AS runtime
Dependabot bumps these via PR. We see the digest change in the diff. If a base image bumps in a way that breaks our build, we see it explicitly instead of being surprised next Tuesday.
We added Trivy to every image build. The CI fails on CRITICAL or HIGH CVEs unless an explicit allowlist entry is set:
- name: Scan image
uses: aquasecurity/trivy-action@<pinned-sha>
with:
image-ref: ${{ steps.build.outputs.tag }}
severity: CRITICAL,HIGH
exit-code: 1
ignore-unfixed: true # don't fail on CVEs without a patch yet
trivyignores: .trivyignore
The .trivyignore file lists CVEs we've consciously accepted, with a comment explaining why and an expiration date for review.
This stopped us from shipping new HIGH CVEs. The existing ones got cleaned up by re-baselining as we did the migration to distroless.
Two things, both fixable.
The Python migration to python:3.11-slim broke a service that needed libpq-dev (for psycopg2 to compile). The fix was to install it explicitly in the build stage and use psycopg2-binary instead of psycopg2 so the runtime image didn't need libpq. Took a couple of hours.
The Go service migration to distroless broke a service that was shelling out to curl for a health check. We replaced the curl call with a small Go HTTP request inside the binary. Cleaner anyway. Took an hour.
kubectl debug with an ephemeral container is the right tool; baking a shell into production images defeats the point of distroless.| Metric | Before | After |
|---|---|---|
| Avg image size | ~1.2 GB | ~80 MB |
| Avg CVEs per image (HIGH+) | 200+ | 4 |
| Cold-start image pull p95 | 12s | 1.8s |
| Total registry storage | ~140 GB | ~12 GB |
| CI build time (avg, full chain) | 4m 20s | 2m 50s |
The cold-start improvement was an unexpected bonus. Pulling 80 MB instead of 1.2 GB shaved 10+ seconds off autoscaling response time. For traffic spikes, that's the difference between absorbing the spike and dropping requests.
In order of return:
-slim) base images. This is the single biggest move.The first two get you 80% of the size reduction. Number three and four are mechanical and prevent regression. Number five enforces the bar.
What I wouldn't bother with on day one: image signing, custom Trivy DB, content trust workflows. They're great once you've done the basics; they don't help if your base image is still ubuntu:latest.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.
How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.
Explore more articles in this category
You always have known vulnerabilities. The question is how you triage, patch, and respond. The discipline we run after a few real incidents and a lot of routine work.
Three terms that get mixed up constantly. The actual differences, where each one sits in the request path, when you reach for which, and where the same tool plays all three roles.
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.