We cut our average production image size by 78% with multi-stage builds. The patterns that worked, the ones that didn't, and the production gotchas.
A few years ago our average production image size was around 1.2GB. Today it's 240MB. The biggest single change was committing seriously to multi-stage builds. This post is the patterns that worked and the ones we abandoned, with the actual production reasons.
A multi-stage Dockerfile has multiple FROM lines. Each starts a new stage with a fresh filesystem. You can COPY --from=<stage> between them.
The point is to separate "what you need to build the artifact" from "what you need to run it." Build needs compilers, dev dependencies, source code, build caches. Runtime needs none of these — just the binary or compiled output, plus its runtime dependencies.
The pre-multi-stage workaround was a build container vs runtime container split with a CI orchestration step copying artifacts between them. Multi-stage replaces that with a single Dockerfile.
# Stage 1: build
FROM golang:1.22-alpine AS build
WORKDIR /src
# Cache dependencies in their own layer
COPY go.mod go.sum ./
RUN go mod download
# Build with static linking + version info
COPY . .
RUN CGO_ENABLED=0 GOOS=linux \
go build -ldflags="-s -w -X main.version=$(git -C . rev-parse HEAD)" \
-o /app ./cmd/server
# Stage 2: runtime
FROM gcr.io/distroless/static-debian12 AS runtime
COPY --from=build /app /app
USER 65532:65532
ENTRYPOINT ["/app"]
Result: ~15MB final image. The Go binary itself is the bulk of it.
Three things to highlight:
COPY go.mod go.sum ./ then go mod download is in its own layer before copying source. This caches dependencies — they only re-download when go.mod changes. For a service with 80+ dependencies this saves ~40 seconds per build.
-ldflags="-s -w" strips debug symbols from the binary. ~30% size reduction.
The runtime base is distroless/static. No shell, no package manager, no nothing — just the Go binary running as a non-root user (65532 is distroless's nonroot UID).
Node is harder because runtime needs node_modules, which is huge:
# Stage 1: build (with full toolchain)
FROM node:20-bookworm AS build
WORKDIR /src
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build && \
npm prune --production
# Stage 2: runtime
FROM gcr.io/distroless/nodejs20-debian12 AS runtime
WORKDIR /app
COPY --from=build /src/dist /app/dist
COPY --from=build /src/node_modules /app/node_modules
COPY --from=build /src/package.json /app/package.json
USER nonroot
CMD ["dist/server.js"]
The tricks:
npm ci after copying lockfile-only — same dependency caching trick as Go's go mod download.npm prune --production after build — removes dev dependencies. Cuts node_modules size by 40-60%.node:20-slim (~250MB) and significantly smaller than the full node:20 (~1.1GB).The final size of one of our Node services: ~280MB. Could be smaller with esbuild bundling (one bundle, no node_modules), but for our case the marginal benefit didn't justify the bundler complexity.
Python multi-stage is fiddly because of compiled dependencies:
# Stage 1: build (with build tools for compiling C extensions)
FROM python:3.12-slim AS build
WORKDIR /src
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc g++ libffi-dev && \
rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt
# Stage 2: runtime (no build tools)
FROM python:3.12-slim AS runtime
WORKDIR /app
COPY --from=build /root/.local /root/.local
COPY . .
ENV PATH=/root/.local/bin:$PATH
USER 1000:1000
CMD ["python", "-m", "myapp"]
The pip install --user puts packages in ~/.local, which is a single directory we can copy. Without --user, pip installs into system-wide locations that are harder to copy cleanly.
Python images are still chunky (350-450MB typically). The Python interpreter alone is ~50MB, then add scientific libraries (numpy, pandas, torch, etc.) and you're often at 1GB+. We've cut these down with:
--no-cache-dir on pip (prevents wheel cache bloat)torch+cpu is ~200MB; torch with CUDA is ~2GB)__pycache__ directories before final stageDistroless Python exists but we found it brittle (path issues, missing C libraries some packages need). We stick with python:3.12-slim.
Single-stage with aggressive cleanup. The "old" pattern: build everything in one stage, then rm -rf /usr/share/doc /var/cache/apt/archives /tmp/* etc. It works but every cleanup needs a separate RUN to drop the layer. Multi-stage is cleaner and more aggressive.
Scratch as a runtime base. FROM scratch is the smallest possible — but you have nothing. No CA certs (HTTPS calls fail), no /etc/passwd (some libraries crash trying to look up the current user), no DNS (in some configurations). Distroless gives you the minimum you actually need without going overboard.
Buildkit cache mounts everywhere. RUN --mount=type=cache,target=/root/.cache/pip pip install ... speeds up CI builds. We use this for some services. The catch: cache mounts don't transfer between CI runners by default. If your CI doesn't have persistent storage, the cache mount is ignored.
Builds with embedded git history. We tried COPY .git ./ to embed version info, then realized it bloats the build context and the build stage. We use a build arg instead: --build-arg VERSION=$(git rev-parse HEAD).
A few production gotchas:
Distroless lacks shell, so no docker exec debugging. When you need to investigate a running container, distroless has no /bin/sh. Mitigation: we have a "debug" build of each service that uses gcr.io/distroless/...:debug (which adds busybox) and is push-to-prod-able when needed.
Read-only filesystem expectations. Some apps assume they can write temp files in /tmp. Distroless's /tmp is empty by default. We mount an emptyDir volume in Kubernetes for /tmp to give the app writable scratch space.
HEALTHCHECK doesn't work in distroless. Docker's HEALTHCHECK calls a binary (usually curl or sh) to test readiness. Distroless doesn't have either. We use Kubernetes liveness/readiness probes (which run from outside the container) and skip Dockerfile HEALTHCHECK entirely.
Time zone and locale data. Distroless includes minimal locale data. If your app needs specific timezones, you might need gcr.io/distroless/base instead of static, or copy the tzdata files manually. We hit this once; the fix was switching to base.
User permissions. Distroless's nonroot user is UID 65532. If you write files in the build stage, they're owned by root. When the runtime container runs as 65532, those files might not be readable depending on permissions. We use COPY --chown=65532:65532 to handle this.
After three years of optimization:
| Language | Base | Typical size |
|---|---|---|
| Go | distroless/static | 15-30 MB |
| Rust | distroless/static (or scratch) | 10-25 MB |
| Node.js | distroless/nodejs | 200-400 MB |
| Python | python:slim | 350-700 MB |
| Java | distroless/java | 250-400 MB |
We track image size as a metric. Any new service > 500MB without justification goes back for a redesign. Most fit comfortably under target.
The size matters for:
Pull time. A 1GB image vs a 250MB image — pull time differs 4-5x. On a node-replacement scenario (a node fails and 30 pods need to start on a new node), this is the difference between 30 seconds and 2 minutes of pull contention.
Cold-start scaling. Same as above but for autoscaling. New pod from scratch with a 1GB image takes longer to be ready.
Storage cost. ECR / Docker Hub charges for storage and transfer. Across 40 services × hundreds of historical tags, this adds up. We've cut our ECR storage bill by ~70% with image size reductions.
Security surface. Smaller images contain fewer packages → fewer CVEs to scan → fewer reasons to rebuild. This is the underrated win. Our distroless-based services have ~95% fewer CVE findings than they did on ubuntu:22.04-based images.
Cosign signatures verified at runtime. We sign images with Cosign in CI. Verification at runtime (Kubernetes admission controller) is configured for staging but not yet enforced in prod. Roadmap.
SBOM generation. Software Bill of Materials, generated by Syft. We produce SBOMs but don't yet have downstream consumers (compliance, supply chain audit). Will be useful when we do.
Build cache distribution. Each CI runner builds from scratch. Build caches are local. For a fleet of runners, a remote cache (registry-based or BuildKit S3) would help. Haven't built this; build times are acceptable.
Two stages is the minimum. If your Dockerfile has only one FROM, you have something to fix.
Cache dependencies in their own layer. Copy lockfile, install, then copy source. The single biggest CI-time win.
Use distroless or scratch for runtime. No shell, no package manager, no excess. Smaller and more secure.
Track image size as a metric. What gets measured gets managed. We have CI reject PRs that grow image size by > 10% without justification.
Build for production, not for debugging. When you need to debug, switch to a debug image. Don't keep production images bloated for the rare debugging case.
Multi-stage is one of those "the obvious answer is the right answer" tools. The patterns are well-known; the discipline is to apply them consistently across every service. Once we did, the other operational benefits (faster deploys, smaller security surface, lower storage bills) showed up almost as side effects.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We use Ansible for configuration management on hosts where Terraform stops. The workflow that keeps it tractable and what we wish we'd known about idempotency.
Bash patterns beyond the basics: arrays, traps, process substitution, parameter expansion. The features that earn their place when scripts grow.
Explore more articles in this category
You always have known vulnerabilities. The question is how you triage, patch, and respond. The discipline we run after a few real incidents and a lot of routine work.
Three terms that get mixed up constantly. The actual differences, where each one sits in the request path, when you reach for which, and where the same tool plays all three roles.
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.