We run ~600 GitHub Actions workflow runs per day across 80 repos. The patterns that scale and the ones that hit limits we didn't expect.
We run roughly 600 GitHub Actions workflow runs per day across 80 repos. At small scale, GitHub Actions just works. At our scale, specific patterns matter — runner management, secrets handling, workflow reuse, and the limits that surface only when you push hard. This post is what we've learned to keep CI fast and reliable as scale grew.
We chose Actions because:
Comparison points: we've also used CircleCI, Jenkins, BuildKite. Each has its strengths. Actions wins for "we already use GitHub" simplicity.
Both have their place:
GitHub-hosted runners:
Self-hosted runners:
We use a mix: self-hosted for the bulk of work; GitHub-hosted as a fallback for "I want this to run somewhere else right now."
For self-hosted, we run the GitHub Actions Runner Controller (ARC) on Kubernetes. Runners are pods; they spin up to handle jobs and tear down after. Auto-scaling based on queue depth.
Our setup: ~30 active runners during peak, scaling down to ~5 idle baseline. Cost: ~$300-400/month on EC2 spot.
The biggest scaling lever: reusable workflows. Without them, we'd have 80 repos × N workflows each = lots of duplicated YAML.
A reusable workflow:
# .github/workflows/standard-build.yml in a central repo
on:
workflow_call:
inputs:
service-name:
required: true
type: string
secrets:
aws-credentials:
required: true
jobs:
build:
runs-on: self-hosted
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/setup-build-env
- run: ./build.sh ${{ inputs.service-name }}
- run: ./test.sh
- run: ./push.sh ${{ inputs.service-name }}
A repo using it:
on: pull_request
jobs:
build:
uses: company/.github/.github/workflows/standard-build.yml@v3
with:
service-name: my-service
secrets:
aws-credentials: ${{ secrets.AWS_CREDENTIALS }}
Most of our repos call ~3 reusable workflows. New service = a 5-line workflow file pointing at standardized reusables.
Below the workflow level, composite actions encapsulate steps:
# .github/actions/setup-build-env/action.yml
name: Setup build env
runs:
using: composite
steps:
- uses: actions/setup-node@v4
with:
node-version: 20
- uses: actions/cache@v4
with:
path: ~/.npm
key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
- run: npm ci
shell: bash
A reusable workflow uses this with one line. Adding a new dependency cache step happens in one place.
We have ~15 composite actions for common patterns: setup-build-env, log-into-aws, run-trivy, etc.
Caching is the biggest performance lever. Our caches:
node_modules (~/.npm) keyed by package-lock.json hash~/.cache/pip) keyed by requirements.txt hash.m2 repository keyed by pom.xml hashGitHub's actions/cache is the standard mechanism. Cache size limit is 10GB per repo; we usually fit.
For Docker layer caching, registry-based cache (type=registry,ref=cache:latest) works across runners. GitHub's built-in cache (type=gha) works too but has size limits.
Secrets in GitHub Actions: organization-level, repo-level, environment-level.
Our discipline:
Environments are powerful — they can require human approval before secrets are exposed to a job. We use this for production deploys: the job pauses, waits for an approver, then runs.
deploy-prod:
environment: production
runs-on: self-hosted
steps:
- run: ./deploy.sh
env:
AWS_ACCESS_KEY_ID: ${{ secrets.PROD_AWS_KEY }}
The "production" environment is configured to require manual approval. The deploy waits at this step; an approver clicks; deploy proceeds.
For per-job temporary credentials, we use OIDC federation: GitHub Actions issues a JWT that AWS / GCP trusts. No long-lived credentials to manage.
Instead of long-lived AWS access keys in GitHub:
permissions:
id-token: write
contents: read
steps:
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123:role/github-actions-deployer
aws-region: us-east-1
GitHub mints a JWT; AWS exchanges it for temporary credentials. The JWT identifies the specific workflow/branch, so we can scope what each can do.
Setting this up was a one-time effort; ongoing maintenance is essentially zero. Vastly better than managing rotated access keys per repo.
Two patterns:
Per-PR concurrency: when a new commit is pushed, cancel the in-progress run for the same PR.
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
Per-environment concurrency: only one deploy to production at a time.
concurrency:
group: production-deploy
cancel-in-progress: false # queue, don't cancel
These prevent a class of issues. Without concurrency control, multiple deploys can race; with it, you serialize cleanly.
Specific issues we've hit:
Self-hosted runner pool exhaustion. During heavy days, queues built up. Auto-scaling up wasn't fast enough. Fix: keep a higher baseline of idle runners, even though it costs more, to absorb the burst.
Secret rotation orphaned old workflows. Rotating an org-level secret broke any job that hadn't picked up the new value. Fix: rotation procedure now includes "re-run any in-progress workflows."
Workflow run history slowness. With many concurrent runs, the GitHub UI sometimes lagged. Pure UX issue; the runs themselves were fine. We added a custom dashboard that aggregates runs across repos for ops visibility.
Cache contention. Heavy parallel builds hitting the same cache key sometimes caused thrash. Fix: more granular cache keys; smaller caches.
OIDC federation misconfigured. A config typo gave a CI workflow more AWS permissions than intended. Caught in a security review. Fix: stricter trust policies (specific repo + branch + workflow), reviewed quarterly.
GitHub API rate limits. Some workflow patterns (lots of API calls per run) hit rate limits during peak. Fix: built tooling that respects rate limits; uses GraphQL where possible (cheaper than REST per query).
Generic "deploy to AWS" actions that wrap basic awscli. We use awscli directly. Less abstraction; clearer behavior.
Some community actions for build steps. If the action is doing something simple, we'd rather inline it (less supply chain risk). For complex actions, we evaluate before adding.
actions/checkout@v3 (and earlier). Pinning to specific recent versions; not "latest." Old versions of common actions get deprecated.
Our monthly GitHub Actions bill:
Total: ~$580/month. Compared to commercial CI alternatives, comparable or cheaper at our scale.
We monitor:
Datadog has a GitHub Actions integration; we use it for dashboards. Custom metrics on top via the API.
Reusable workflows from day one. Adding them later means migrating every repo.
Self-hosted runners pay off above ~50 runs/day. Below that, GitHub-hosted is fine.
OIDC for cloud auth. Don't store long-lived access keys in GitHub.
Concurrency control on PRs. Cancel obsolete runs.
Composite actions for repeated steps. Reduces duplication.
Watch the trend on workflow duration. Slow CI compounds.
Environment-gated secrets for production. Manual approval before production credentials are exposed.
GitHub Actions at scale is mostly a discipline question. The features exist; the question is whether your team uses them consistently. The teams that get the most out of Actions have standardized workflows, reusable building blocks, and disciplined secrets management. The teams that struggle have copy-pasted workflows that drift, secrets sprinkled across repos, and unpredictable CI behavior. The patterns above are how we stayed in the first camp as we grew.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
I fine-tuned Llama 3 8B on a single 4090 over a weekend for a side project. Here's what worked, what cost more than expected, and what I'd do differently.
How we organize Terraform state across 12 AWS accounts and 40+ services. Backends, locking, partitioning, and the migration we got wrong twice.
Explore more articles in this category
You always have known vulnerabilities. The question is how you triage, patch, and respond. The discipline we run after a few real incidents and a lot of routine work.
Three terms that get mixed up constantly. The actual differences, where each one sits in the request path, when you reach for which, and where the same tool plays all three roles.
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.