We cut our average CI build time from 28 minutes to 6 minutes. The changes that mattered, ranked by impact.

On this page

CI/CD Pipeline Optimization: Where the Time Goes

A while ago our average CI build took 28 minutes. Engineers context-switched away during builds; PRs sat for hours waiting for checks; the deploy queue compounded. We did focused work to bring it to 6 minutes median. Most of the savings came from a small number of specific changes; some of the changes were less obvious. This post is what worked, ranked by impact.

The starting point #

A typical PR triggered:

Linting + format checks: 4 min
Unit tests: 12 min
Integration tests: 8 min
Build + push container image: 5 min
Security scan: 4 min
Deploy preview: 5 min

Total wall-clock: ~28 min. Several stages ran in parallel but the long pole was unit tests + integration tests.

The CI runner was self-hosted on a fixed-size EC2 fleet (~12 runners). At peak times, jobs queued, adding 5-15 minutes of waiting.

Change 1: Cache, cache, cache (largest impact)#

The biggest single win: aggressive caching of dependencies and build artifacts.

What we cache:

npm node_modules keyed by package-lock.json hash
Python pip cache keyed by requirements.txt hash
Maven .m2 repository keyed by pom.xml hash
Docker image layers (BuildKit cache mount or registry-based cache)
CI job results for unchanged files (path-based skipping)

GitHub Actions has built-in actions/cache. It uses backend storage tied to the repo (10GB limit). For our scale, sufficient.

A specific example: a Node service's CI did npm ci every run, taking ~3 minutes. With cache, ~30 seconds when the lockfile is unchanged.

Saving across all services: ~6 minutes typical. Largest single change.

Change 2: Parallel test execution #

Unit tests took 12 minutes serially. The test suite had ~1500 tests. Single-threaded.

Changes:

Pytest with pytest-xdist to parallelize by file across N workers
Jest with --maxWorkers to parallelize across cores
For services where tests don't parallelize cleanly, we sharded by test file across multiple CI jobs

A 12-minute serial run became a ~3-minute parallel run with 4 workers, with no test logic changes.

Watch out for test interdependencies. Some tests assumed previous tests had set up state. Parallel runs broke them. Fixing the inter-test coupling was real work but the right thing — those tests were brittle anyway.

Saving: ~9 minutes.

Change 3: Path-based skipping #

Not every PR touches every service. A doc-only change shouldn't run all the integration tests.

We added path filters:

A PR that only changes *.md files skips all CI except docs validation.
A PR that only changes one service skips test runs for unrelated services.
A PR that only changes infrastructure runs only the infra checks.

GitHub Actions' paths: filter for workflows handles this. We set it up per-service.

Watch out: dependencies between services. A change to a shared library might affect three services. We err on the side of running too many tests rather than too few.

Saving: ~3 minutes average (depends on PR; some saves more, most less).

Change 4: Faster runner instances #

Our CI runners were m5.xlarge (4 vCPU, 16GB). Compute-bound jobs (large compilations, parallel test runs) bottlenecked on CPU.

We switched to c6i.4xlarge for compute-heavy jobs (16 vCPU, 32GB). Cost per minute is 4x; but jobs run 3x faster, so cost-per-CI-run is roughly even, and engineers wait less.

For light jobs (lint, security scans), kept smaller instances; cost matters more than time.

Saving: ~2 minutes on heavy jobs.

Change 5: Build cache for Docker images #

Building containers without cache means every layer rebuilds. With proper layer caching, only changed layers rebuild.

We use BuildKit with registry-based cache:

yaml.yaml

- uses: docker/build-push-action@v5
  with:
    cache-from: type=registry,ref=our-ecr-repo/cache:latest
    cache-to: type=registry,ref=our-ecr-repo/cache:latest,mode=max

The cache image is updated on every successful build. New PRs pull from this cache; only changed layers actually rebuild.

The Dockerfile structure matters too — dependencies in their own layer (so npm ci is cached when the lockfile is unchanged), copy source code last.

For a Node service: image build went from 4 min to 45 seconds in the typical case (only app code changed; deps cached).

Saving: ~3 minutes per build typically.

Change 6: Concurrency limits and queue management #

CI runner saturation was causing 5-15 min queue times during peak.

Changes:

Added more runners (auto-scaling based on queue depth)
Per-PR concurrency limits (kill the previous run when a new commit arrives — only the latest commit runs)
Job-level priority (PRs to main get higher priority than nightly batch jobs)

The "kill previous run" change is important. If a developer pushes 3 commits in 10 minutes, only the third needs full CI; the first two are obsolete. Without this, all three queue and run.

Saving: queue time dropped from 5-15 min to 0-3 min during peak.

Change 7: Smaller integration test sets per change #

Integration tests originally ran the full suite for every PR. We split into:

Core integration tests: must pass for any PR. ~5 min.
Extended integration tests: run on PRs that touch related code, plus nightly. ~12 min.
Full integration tests: run on PRs to main only, plus nightly. ~25 min.

The split is by code paths exercised, not arbitrary categorization. Tests that exercise critical paths are in the core; tests that exercise edge cases of specific features are extended.

Saving: ~4 minutes on PRs not affected by extended tests.

Change 8: Pre-commit checks moved to local #

Some checks (formatting, basic linting) are fast and obvious. We moved them from CI to pre-commit hooks:

ruff format for Python
prettier for TypeScript / JSON / YAML
golangci-lint for Go
markdownlint for docs

Pre-commit catches these at commit-time; CI doesn't need to run them. CI still does a final check (in case someone bypassed the hook), but the failure mode is "your local hook should have caught this," not "wait 30 seconds for CI to tell you."

Saving: ~30 seconds. Small in CI time but improves developer experience.

Change 9: Eager builds (slightly different)#

For some workflows, we kick off the deploy build before tests pass. The image gets built and pushed; if tests fail, the image is just orphaned. If tests pass, the deploy can use the already-built image.

This trades some ECR storage cost for parallelism. For services with long build times (5+ min), worthwhile.

Saving: ~2 minutes when it pays off (most successful PRs).

What didn't work #

Some changes we tried that didn't help:

Bigger CI runners across the board. Diminishing returns. Beyond c6i.4xlarge, our tests didn't get faster. Some bottlenecks are I/O or single-threaded.

Distributed test sharding via custom orchestration. We considered breaking test runs across many small workers via a custom orchestration layer. The plumbing complexity wasn't worth the marginal speedup over pytest-xdist.

Pre-warmed test databases via persistent state. We tried keeping a long-running test database in CI with state pre-loaded. State drift between tests caused flakiness; reverted to per-job database setup.

Reducing test count. We considered "are all these tests needed?" Mostly yes. The few we removed were tests of long-deprecated code paths.

The current state #

Median CI time: ~6 minutes. p95: ~9 minutes. Queue time: usually < 1 minute.

Per-PR breakdown:

Lint + pre-commit equivalent checks: ~30 sec
Unit tests (parallel): ~3 min
Integration tests (relevant subset): ~3 min
Build + push image (cached): ~1 min
Security scan (cached): ~1 min
Deploy preview (parallel with above): ~3 min

Most things run in parallel; the longest path is what we see.

Operational discipline #

Maintaining fast CI requires ongoing discipline:

Review CI durations weekly. A creep up by 30 seconds per build is invisible per-build but adds up.

Fail fast, fail clear. Expensive checks (integration tests) run after cheap checks (lint, unit). A failure surfaces in the cheaper tier first; engineer fixes it without paying the full CI cost.

Test code is code. Bad tests cause flakiness, slow runs, and false confidence. We treat test code with the same review standards as production code.

Don't add more checks blindly. Each new check adds wall-clock time. Adding checks should require justification.

Cost reality #

Self-hosted CI runners on Spot:

Compute cost: ~$400/month for our runner fleet
Storage (ECR cache, build artifacts): ~$50/month
Engineer time on CI maintenance: ~4 hours/week ongoing

Compared to the engineer time saved (every engineer waits less per PR), the ROI is large.

Compared to managed CI alternatives (CircleCI, BuildKite cloud, GitHub-hosted runners): we're cheaper and faster. The trade is the operational overhead.

What I'd tell a team starting #

Cache aggressively. Dependencies, build artifacts, container layers. Largest single lever.

Parallelize tests. Most test suites parallelize fine with off-the-shelf tools.

Path-based skipping for unaffected jobs. Doc PRs shouldn't run integration tests.

Faster runners for compute-heavy jobs. Cost-per-CI-run stays similar; engineers wait less.

Per-PR concurrency limits. Don't run obsolete commits' jobs.

Pre-commit hooks for fast local checks. Faster feedback for developers; less CI work.

Watch the trends. Build time creeps up. Weekly check, action when it does.

CI optimization is one of the highest-ROI engineering investments per hour of effort. Engineers wait less; PRs merge faster; deploys flow more smoothly. The 22-minute improvement per build, multiplied across hundreds of PRs per week, is significant time recaptured.

CI/CD Pipeline Optimization: Speeding Up Your Builds

CI/CD Pipeline Optimization: Where the Time Goes

The starting point #

Change 1: Cache, cache, cache (largest impact)#

Change 2: Parallel test execution #

Change 3: Path-based skipping #

Change 4: Faster runner instances #

Change 5: Build cache for Docker images #

Change 6: Concurrency limits and queue management #

Change 7: Smaller integration test sets per change #

Change 8: Pre-commit checks moved to local #

Change 9: Eager builds (slightly different)#

What didn't work #

The current state #

Operational discipline #

Cost reality #

What I'd tell a team starting #

Stay Updated

How We Stopped Terraform Drift from Surprising On-Call

Architecture Review: Python Worker Queue Scaling Patterns

More from DevOps

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Four Signals That Matter: Choosing SLIs Users Actually Feel

GitHub Actions Reusable Workflows: DRY Pipelines at Org Scale

OIDC Federation for GitHub Actions to AWS: Killing Long-Lived Keys

About Kiril Urbonas

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Linux Network Debugging — tcpdump, ss, and eBPF in Anger