Our base image went from 1.2 GB and 200+ CVEs to 80 MB and 4 CVEs. Most of the work wasn't clever — it was deletion.

On this page

Docker Image Hardening for Production

When we audited our container images last spring, the average production image was 1.2 GB. The Trivy report showed 200+ CVEs across them, mostly inherited from base layers we hadn't picked deliberately. Cold-start times for autoscaled services were noticeably worse than they should have been because pulling a 1.2 GB image to a fresh node took ~12 seconds.

A quarter later we're at an average ~80 MB per image, ~4 CVEs each, and cold-start image-pull time is under 2 seconds. The work was unglamorous. None of it was clever.

What was wrong #

We did the audit in a spreadsheet. For each of our ~22 production images, we recorded: base image, total size, layer count, CVEs by severity, and what the image actually needed to run.

The patterns that fell out:

9 images used node:18 (full Debian-based, ~1 GB). They needed Node and curl. That's it.
5 images used python:3.11 (similarly full). Same story.
4 images used ubuntu:22.04 as a base for what amounted to a single Go binary.
2 images had inherited layers from a builder stage that nobody had cleaned up — so they shipped GCC, make, npm, and a node_modules tree alongside the actual app.
1 image had two copies of the application baked in, one in /app and one in /srv/app. Nobody knew why.

The opportunity was deleting things, not adding security tools.

The five changes that did most of the work #

1. Switch to distroless or minimal bases #

For the Node services we moved to gcr.io/distroless/nodejs20-debian12. For Go binaries we moved to gcr.io/distroless/static-debian12. For Python we used python:3.11-slim as an intermediate step (distroless Python is harder to use because of native deps); slim was good enough.

dockerfile.dockerfile

# Before
FROM node:18
WORKDIR /app
COPY . .
RUN npm install
CMD ["node", "server.js"]

# After
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev
COPY . .
RUN npm run build

FROM gcr.io/distroless/nodejs20-debian12
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
USER nonroot
CMD ["dist/server.js"]

Image size dropped from ~1 GB to ~140 MB on the Node services. CVEs from inherited layers dropped to near zero (distroless ships only what you need to run).

2. Multi-stage builds, ruthlessly #

Most of our images had been single-stage — build tools and runtime tools in the same final image. Splitting them was mechanical:

Stage 1: FROM <full toolchain image> — install build tools, compile, produce artifacts
Stage 2: FROM <minimal runtime image> — copy ONLY the artifacts and runtime deps from stage 1

The trap is COPY --from=builder /app /app. That tends to copy more than the artifact. The fix: COPY --from=builder /app/dist /app/dist, or COPY --from=builder /app/myapp /myapp. Be precise.

3. Run as non-root #

Most distroless images have a nonroot user (UID 65532) baked in. We added USER nonroot to every Dockerfile. Combined with runAsNonRoot: true in the Pod's securityContext, anything that tries to escalate fails immediately.

yaml.yaml

securityContext:
  runAsNonRoot: true
  runAsUser: 65532
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]

Read-only root filesystem caught a few apps that were writing temp files in surprising places — we mounted explicit emptyDir volumes for the directories they actually needed and called the rest a bug.

4. Pin base images by digest, not tag #

FROM node:20-alpine is a moving target. Today's 20-alpine is different from yesterday's. We pin to the digest:

dockerfile.dockerfile

FROM node:20-alpine@sha256:1a7d234c8b... AS builder
FROM gcr.io/distroless/nodejs20-debian12@sha256:9fa2f99... AS runtime

Dependabot bumps these via PR. We see the digest change in the diff. If a base image bumps in a way that breaks our build, we see it explicitly instead of being surprised next Tuesday.

5. Scan in CI, not as a one-off #

We added Trivy to every image build. The CI fails on CRITICAL or HIGH CVEs unless an explicit allowlist entry is set:

yaml.yaml

- name: Scan image
  uses: aquasecurity/trivy-action@<pinned-sha>
  with:
    image-ref: ${{ steps.build.outputs.tag }}
    severity: CRITICAL,HIGH
    exit-code: 1
    ignore-unfixed: true   # don't fail on CVEs without a patch yet
    trivyignores: .trivyignore

The .trivyignore file lists CVEs we've consciously accepted, with a comment explaining why and an expiration date for review.

This stopped us from shipping new HIGH CVEs. The existing ones got cleaned up by re-baselining as we did the migration to distroless.

What broke during the migration #

Two things, both fixable.

The Python migration to python:3.11-slim broke a service that needed libpq-dev (for psycopg2 to compile). The fix was to install it explicitly in the build stage and use psycopg2-binary instead of psycopg2 so the runtime image didn't need libpq. Took a couple of hours.

The Go service migration to distroless broke a service that was shelling out to curl for a health check. We replaced the curl call with a small Go HTTP request inside the binary. Cleaner anyway. Took an hour.

What we don't do #

We don't add a shell to distroless images for "debugging." kubectl debug with an ephemeral container is the right tool; baking a shell into production images defeats the point of distroless.
We don't sign images with cosign yet. It's on the roadmap but we want to land registry-side image scanning + admission control first.
We don't try to get to zero CVEs. Some HIGH CVEs in OS layers don't have patches yet; we accept and review monthly. The goal is "explicitly known and tracked," not "zero."

Numbers after the migration #

Metric	Before	After
Avg image size	~1.2 GB	~80 MB
Avg CVEs per image (HIGH+)	200+	4
Cold-start image pull p95	12s	1.8s
Total registry storage	~140 GB	~12 GB
CI build time (avg, full chain)	4m 20s	2m 50s

The cold-start improvement was an unexpected bonus. Pulling 80 MB instead of 1.2 GB shaved 10+ seconds off autoscaling response time. For traffic spikes, that's the difference between absorbing the spike and dropping requests.

The 80/20 list for a team starting fresh #

In order of return:

Move to distroless (or -slim) base images. This is the single biggest move.
Multi-stage builds with precise COPY targets.
Run as non-root with read-only root filesystem.
Pin base images by digest. Use Dependabot.
Scan in CI and fail on HIGH CVEs (with an allowlist for tracked exceptions).

The first two get you 80% of the size reduction. Number three and four are mechanical and prevent regression. Number five enforces the bar.

What I wouldn't bother with on day one: image signing, custom Trivy DB, content trust workflows. They're great once you've done the basics; they don't help if your base image is still ubuntu:latest.

Best Practices: Docker Image Hardening for Production

Docker Image Hardening for Production

What was wrong #

The five changes that did most of the work #

1. Switch to distroless or minimal bases #

2. Multi-stage builds, ruthlessly #

3. Run as non-root #

4. Pin base images by digest, not tag #

5. Scan in CI, not as a one-off #

What broke during the migration #

What we don't do #

Numbers after the migration #

The 80/20 list for a team starting fresh #

Stay Updated

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

A Pragmatic Multi-Region Strategy for Small Teams

More from DevOps

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Four Signals That Matter: Choosing SLIs Users Actually Feel

External Secrets Operator: One Secrets Workflow Across Clouds

GitHub Actions Reusable Workflows: DRY Pipelines at Org Scale

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Process Management and Monitoring in Linux

About Kiril Urbonas