We run ~600 GitHub Actions workflow runs per day across 80 repos. The patterns that scale and the ones that hit limits we didn't expect.

On this page

Scalable CI/CD with GitHub Actions

We run roughly 600 GitHub Actions workflow runs per day across 80 repos. At small scale, GitHub Actions just works. At our scale, specific patterns matter — runner management, secrets handling, workflow reuse, and the limits that surface only when you push hard. This post is what we've learned to keep CI fast and reliable as scale grew.

Why GitHub Actions #

We chose Actions because:

Tightly integrated with our code (we use GitHub for source control)
Runner ecosystem is healthy
Reusable workflows make standardization possible
Cost is reasonable for our volume (mostly self-hosted runners)

Comparison points: we've also used CircleCI, Jenkins, BuildKite. Each has its strengths. Actions wins for "we already use GitHub" simplicity.

Self-hosted vs GitHub-hosted runners #

Both have their place:

GitHub-hosted runners:

Free for public repos; included tier for private (varies by plan)
Zero ops
Cold start each run (no cache between jobs)
Limited concurrent runners on lower plans

Self-hosted runners:

Pay for compute (we use spot EC2)
Cache state between runs (faster overall)
Can size for your workload
More setup and maintenance

We use a mix: self-hosted for the bulk of work; GitHub-hosted as a fallback for "I want this to run somewhere else right now."

For self-hosted, we run the GitHub Actions Runner Controller (ARC) on Kubernetes. Runners are pods; they spin up to handle jobs and tear down after. Auto-scaling based on queue depth.

Our setup: ~30 active runners during peak, scaling down to ~5 idle baseline. Cost: ~$300-400/month on EC2 spot.

Workflow structure: reusable, composable #

The biggest scaling lever: reusable workflows. Without them, we'd have 80 repos × N workflows each = lots of duplicated YAML.

A reusable workflow:

yaml.yaml

# .github/workflows/standard-build.yml in a central repo
on:
  workflow_call:
    inputs:
      service-name:
        required: true
        type: string
    secrets:
      aws-credentials:
        required: true

jobs:
  build:
    runs-on: self-hosted
    steps:
      - uses: actions/checkout@v4
      - uses: ./.github/actions/setup-build-env
      - run: ./build.sh ${{ inputs.service-name }}
      - run: ./test.sh
      - run: ./push.sh ${{ inputs.service-name }}

A repo using it:

yaml.yaml

on: pull_request

jobs:
  build:
    uses: company/.github/.github/workflows/standard-build.yml@v3
    with:
      service-name: my-service
    secrets:
      aws-credentials: ${{ secrets.AWS_CREDENTIALS }}

Most of our repos call ~3 reusable workflows. New service = a 5-line workflow file pointing at standardized reusables.

Composite actions: smaller building blocks #

Below the workflow level, composite actions encapsulate steps:

yaml.yaml

# .github/actions/setup-build-env/action.yml
name: Setup build env
runs:
  using: composite
  steps:
    - uses: actions/setup-node@v4
      with:
        node-version: 20
    - uses: actions/cache@v4
      with:
        path: ~/.npm
        key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
    - run: npm ci
      shell: bash

A reusable workflow uses this with one line. Adding a new dependency cache step happens in one place.

We have ~15 composite actions for common patterns: setup-build-env, log-into-aws, run-trivy, etc.

Caching strategy #

Caching is the biggest performance lever. Our caches:

npm node_modules (~/.npm) keyed by package-lock.json hash
Python pip cache (~/.cache/pip) keyed by requirements.txt hash
Maven .m2 repository keyed by pom.xml hash
Docker layers via BuildKit registry-based cache
Build artifacts (compiled binaries, generated files) keyed by source hash

GitHub's actions/cache is the standard mechanism. Cache size limit is 10GB per repo; we usually fit.

For Docker layer caching, registry-based cache (type=registry,ref=cache:latest) works across runners. GitHub's built-in cache (type=gha) works too but has size limits.

Secrets handling #

Secrets in GitHub Actions: organization-level, repo-level, environment-level.

Our discipline:

Org-level: shared secrets used by many repos (CI base credentials, etc.)
Repo-level: repo-specific config that isn't sensitive
Environment-level: production credentials gated by environment approval

Environments are powerful — they can require human approval before secrets are exposed to a job. We use this for production deploys: the job pauses, waits for an approver, then runs.

yaml.yaml

deploy-prod:
  environment: production
  runs-on: self-hosted
  steps:
    - run: ./deploy.sh
      env:
        AWS_ACCESS_KEY_ID: ${{ secrets.PROD_AWS_KEY }}

The "production" environment is configured to require manual approval. The deploy waits at this step; an approver clicks; deploy proceeds.

For per-job temporary credentials, we use OIDC federation: GitHub Actions issues a JWT that AWS / GCP trusts. No long-lived credentials to manage.

OIDC for cloud auth #

Instead of long-lived AWS access keys in GitHub:

yaml.yaml

permissions:
  id-token: write
  contents: read

steps:
  - uses: aws-actions/configure-aws-credentials@v4
    with:
      role-to-assume: arn:aws:iam::123:role/github-actions-deployer
      aws-region: us-east-1

GitHub mints a JWT; AWS exchanges it for temporary credentials. The JWT identifies the specific workflow/branch, so we can scope what each can do.

Setting this up was a one-time effort; ongoing maintenance is essentially zero. Vastly better than managing rotated access keys per repo.

Concurrency control #

Two patterns:

Per-PR concurrency: when a new commit is pushed, cancel the in-progress run for the same PR.

yaml.yaml

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

Per-environment concurrency: only one deploy to production at a time.

yaml.yaml

concurrency:
  group: production-deploy
  cancel-in-progress: false  # queue, don't cancel

These prevent a class of issues. Without concurrency control, multiple deploys can race; with it, you serialize cleanly.

What broke at scale #

Specific issues we've hit:

Self-hosted runner pool exhaustion. During heavy days, queues built up. Auto-scaling up wasn't fast enough. Fix: keep a higher baseline of idle runners, even though it costs more, to absorb the burst.

Secret rotation orphaned old workflows. Rotating an org-level secret broke any job that hadn't picked up the new value. Fix: rotation procedure now includes "re-run any in-progress workflows."

Workflow run history slowness. With many concurrent runs, the GitHub UI sometimes lagged. Pure UX issue; the runs themselves were fine. We added a custom dashboard that aggregates runs across repos for ops visibility.

Cache contention. Heavy parallel builds hitting the same cache key sometimes caused thrash. Fix: more granular cache keys; smaller caches.

OIDC federation misconfigured. A config typo gave a CI workflow more AWS permissions than intended. Caught in a security review. Fix: stricter trust policies (specific repo + branch + workflow), reviewed quarterly.

GitHub API rate limits. Some workflow patterns (lots of API calls per run) hit rate limits during peak. Fix: built tooling that respects rate limits; uses GraphQL where possible (cheaper than REST per query).

Specific Actions we've stopped using #

Generic "deploy to AWS" actions that wrap basic awscli. We use awscli directly. Less abstraction; clearer behavior.

Some community actions for build steps. If the action is doing something simple, we'd rather inline it (less supply chain risk). For complex actions, we evaluate before adding.

actions/checkout@v3 (and earlier). Pinning to specific recent versions; not "latest." Old versions of common actions get deprecated.

Cost reality #

Our monthly GitHub Actions bill:

GitHub Actions usage (org plan): included in our plan, plus ~$200/month for overage on a few high-traffic repos
Self-hosted runner compute (EC2 spot): ~$350/month
Cache storage (S3 for some Docker registry caches): ~$30/month

Total: ~$580/month. Compared to commercial CI alternatives, comparable or cheaper at our scale.

Monitoring CI itself #

We monitor:

Workflow run duration (per repo, per workflow) — trends matter; sudden increases are signal
Workflow run success rate — flake detection
Queue time for self-hosted runners — auto-scaling tuning signal
Cost per repo per month — for chargeback to teams

Datadog has a GitHub Actions integration; we use it for dashboards. Custom metrics on top via the API.

What I'd tell a team starting #

Reusable workflows from day one. Adding them later means migrating every repo.

Self-hosted runners pay off above ~50 runs/day. Below that, GitHub-hosted is fine.

OIDC for cloud auth. Don't store long-lived access keys in GitHub.

Concurrency control on PRs. Cancel obsolete runs.

Composite actions for repeated steps. Reduces duplication.

Watch the trend on workflow duration. Slow CI compounds.

Environment-gated secrets for production. Manual approval before production credentials are exposed.

GitHub Actions at scale is mostly a discipline question. The features exist; the question is whether your team uses them consistently. The teams that get the most out of Actions have standardized workflows, reusable building blocks, and disciplined secrets management. The teams that struggle have copy-pasted workflows that drift, secrets sprinkled across repos, and unpredictable CI behavior. The patterns above are how we stayed in the first camp as we grew.

Building Scalable CI/CD Pipelines with GitHub Actions

Scalable CI/CD with GitHub Actions

Why GitHub Actions #

Self-hosted vs GitHub-hosted runners #

Workflow structure: reusable, composable #

Composite actions: smaller building blocks #

Caching strategy #

Secrets handling #

OIDC for cloud auth #

Concurrency control #

What broke at scale #

Specific Actions we've stopped using #

Cost reality #

Monitoring CI itself #

What I'd tell a team starting #

Stay Updated

Fine-tuning Llama 3 on Consumer Hardware

Terraform State Management Strategies

More from DevOps

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Four Signals That Matter: Choosing SLIs Users Actually Feel

GitHub Actions Reusable Workflows: DRY Pipelines at Org Scale

OIDC Federation for GitHub Actions to AWS: Killing Long-Lived Keys

About Kiril Urbonas

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025