Bills hit $3,400/mo for runner minutes. We moved to self-hosted on EKS spot. The savings were real; the surprises were too.
Our GitHub-hosted runner bill grew to $3,420/month across ~140 active workflows. We migrated to self-hosted runners on EKS using spot instances. The new bill is $210/month. Here's what worked, what broke, and what we'd do differently.
Three forces compounded:
GitHub's Linux x64 minutes are cheap ($0.008/min), but $3,400/mo gets attention.
┌─────────────────────────────────────────────────────────┐
│ GitHub Actions Workflow │
│ runs-on: [self-hosted, linux, x64, prod] │
└────────────────────────┬────────────────────────────────┘
│ webhook
▼
┌─────────────────────────────────────────────────────────┐
│ Actions Runner Controller (ARC) on EKS │
│ - Watches GitHub queue │
│ - Spins up ephemeral runner pod per job │
│ - Pod runs on spot c7i.xlarge node pool │
└─────────────────────────────────────────────────────────┘
We use the official actions-runner-controller Helm chart. Each runner is a fresh pod, scheduled on a Karpenter-managed spot node pool.
# values.yaml (trimmed)
template:
spec:
nodeSelector:
karpenter.sh/capacity-type: spot
tolerations:
- key: ci
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
resources:
requests: { cpu: 2, memory: 4Gi }
limits: { cpu: 4, memory: 8Gi }
A custom Karpenter NodePool provisions c7i.xlarge and c7i.2xlarge spot nodes with a CI-only taint so other workloads don't land there.
ARC doesn't run macOS. Our Electron build needs macOS for code signing.
Fix: kept macOS jobs on GitHub-hosted runners (the expensive ones), moved everything else to self-hosted. macOS still costs ~$800/mo but it's 5 jobs/day, not 50.
Roughly 1 in 30 jobs got terminated mid-run. The job retried, but engineers saw red Xs and got nervous.
Fix: two-tier setup. Critical jobs (deploys, release builds) run on runs-on: [self-hosted, linux, x64, on-demand] with a small on-demand node pool. Bulk jobs (tests, lints, scans) tolerate spot interruption and just retry.
deploy:
runs-on: [self-hosted, linux, x64, on-demand] # never spot
test:
runs-on: [self-hosted, linux, x64, spot] # spot is fine
GitHub-hosted runners have caching for actions/cache baked in. Self-hosted runners need their own cache backend or each job downloads dependencies fresh.
- uses: actions/cache@v4
with:
path: ~/.npm
key: npm-${{ hashFiles('package-lock.json') }}
Without a cache backend, this works against a brand-new pod each run. Every job redownloaded every dependency.
Fix: deployed an S3-backed cache server using runs-on/cache-action so actions/cache writes/reads from S3. Cache hit rate went from 0% to 78%; average job time dropped from 6.2 min to 2.4 min.
About 30% of our jobs build container images. The runner pod can't docker build without privileged mode (security risk) or a daemon.
Fix: switched to buildkit as a sidecar with rootless mode + remote builder pattern.
- uses: docker/setup-buildx-action@v3
with:
driver: remote
endpoint: tcp://buildkitd.ci-system.svc:1234
A long-lived buildkitd deployment handles all builds. Cache layers are shared across PR branches. Image build time dropped 40% from the cache reuse alone.
Our previous workflows used AWS_ACCESS_KEY_ID from GitHub secrets. On self-hosted we wanted to use IAM Roles for Service Accounts (IRSA).
Fix: each runner pod has a service account with a scoped IAM role. The job assumes the role automatically; no AWS keys in GitHub secrets at all.
serviceAccount:
create: true
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/github-runner-deploy
This was the biggest security win of the migration. Zero static AWS credentials in GitHub.
| Metric | Before | After |
|---|---|---|
| Monthly cost | $3,420 | $210 |
| Avg PR feedback time | 12 min | 9 min |
| Cache hit rate | 64% | 78% |
| Spot interruption rate | n/a | 3.4% |
| Static AWS credentials in GH | 11 | 0 |
| On-call pages from CI | 2 | 4 |
The slight uptick in CI-related on-call (2 → 4) is because we now own more of the stack. None were severe.
[self-hosted, linux, x64, prod] means anything matching self-hosted qualifies. Be specific to avoid wrong pools picking up jobs.Don't migrate if any of these apply:
Strong case if all of these are true:
We hit those thresholds; the migration paid for itself in 6 weeks.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We ran the same RAG workload across three vector stores for a quarter each. Here's what we learned about latency, cost, and operational overhead.
Three production OOM incidents that taught us how kubelet, containerd, and the kernel actually decide which process dies. With debugging commands you'll wish you had earlier.
Explore more articles in this category
Three layers of pooling, three different jobs. We learned the hard way which to use when. Real numbers from a 8k-connection workload.
A two-line config change to an Argo Rollouts analysis template caught a regression that would have cost ~$40k in API spend before we noticed. Here's the pattern.
Every hook on this list caught a bug or a security issue in the last twelve months. The configs are short. The savings have been considerable.