We upgraded a 60-node EKS cluster from 1.27 to 1.31 over six months. Four minor versions, one bad surprise, zero customer impact. Here's the playbook.
We took our production EKS cluster from 1.27 to 1.31 over six months — four minor version jumps, three control-plane upgrades, full node-group rotations each time. The cluster runs about 60 nodes at peak with ~14 production services. No customer-visible impact.
That outcome wasn't from being clever; it was from being boring on purpose. Every step had a written checklist that we executed without improvising. This post is the checklist, plus the one bad surprise that didn't break us only because the rollback path was already prepped.
Roughly one minor-version upgrade every 6-8 weeks. Faster than that and the team felt rushed; slower and we'd accumulate dead deprecation work and end up doing two upgrades back-to-back, which is worse.
Each upgrade has the same five phases. Total elapsed time per upgrade: 10-14 days. Engineering time: maybe 2-3 days, distributed across the calendar.
Before touching anything, one engineer reads:
The output of this phase is a one-page document called upgrade-N.M.md in our infra repo. It lists:
This document gets a peer review like any PR. It's the most important hour of the upgrade.
For 1.30 specifically, the doc flagged that PodDisruptionBudget behavior had changed for unhealthy pods, that cgroups v2 was the default, and that the in-tree GCE PD plugin removal didn't affect us (we're on AWS). All three of those would have caused issues if we'd skipped reading.
kubectl convert and kubent both help, but neither catches everything. We do a four-pass scan of every manifest in our infra repos:
kubent for known deprecationsThe PRs from this phase always land before any cluster work. They merge to main, deploy through the normal release pipeline, and we let them bake on the current cluster version for at least a week.
We have a staging cluster that mirrors production at smaller scale (10-15 nodes). It runs a copy of every production workload at about 10% replicas.
Order of operations on staging:
Day 1 morning: Upgrade control plane
Day 1 afternoon: Update aws-vpc-cni, kube-proxy, CoreDNS
Day 2: Rotate node groups (gradual, 25% of nodes per cycle)
Day 3: Soak. Run full integration tests. Check every dashboard.
Soak time matters. Some failure modes don't show up under load for hours. We've caught:
If staging is clean for 48 continuous hours of running real (synthetic) traffic, we proceed.
Same order as staging, slower pace.
Day 1: Control plane upgrade. Watch metrics for 24h.
Day 2-3: Add-on upgrades, one at a time, with bake periods.
Day 4-5: Node rotation, 10% of nodes at a time, 30-min spacing.
The node rotation is the riskiest part because it's the most concurrent activity. We use Karpenter, so it's mostly automatic — we drain old nodes, Karpenter spins up new ones at the target version. The 10% / 30-min cadence is a hard guardrail to limit blast radius if something goes wrong.
During node rotation, we watch:
If any signal goes red, we pause rotation. We don't roll back unless something is actively broken; usually we just pause until we understand the signal.
After all nodes are on the new version, we run for a full business week before declaring the upgrade done. During that week the on-call team is primed for upgrade-related issues: a separate Slack channel, an explicit "if anything weird happens, mention this thread" instruction.
End-of-upgrade we do a short retrospective: what surprised us, what worked, anything to add to the next upgrade's checklist. The retrospective doc gets attached to the upgrade-N.M.md from Phase 1.
During the 1.29 upgrade, the Karpenter version we were on (an older 0.32.x) turned out to be incompatible with a kubelet behavior change in 1.29 around graceful node shutdown. New nodes came up but then occasionally got marked NotReady for ~90 seconds during their first reconciliation.
We caught this on staging during the soak. The fix was to upgrade Karpenter to 0.36.x first, then re-upgrade staging, then proceed. Cost us a week.
What kept this from being a real incident: the Phase 1 changelog read had flagged "kubelet graceful shutdown changes" as a behavior to monitor, the staging soak had time to surface it, and we hadn't promoted to production yet. None of the three on its own would have been enough; all three together turned a potential prod incident into a one-week schedule slip.
Skip a minor version. We thought about jumping 1.27 directly to 1.29 once. The upgrade matrix isn't designed for that — many tools test against N→N+1 only — and the failure modes compound. We've stuck to single minor jumps since.
Upgrade during a release week. Twice we tried to slot an upgrade into the week of a product release. Both times the upgrade got disrupted because all the engineers were busy with the release. Now upgrades happen in their own dedicated week, never adjacent to product launches.
Two things matter most: read the changelogs, and run a real staging cluster.
Most teams that have rough upgrades have one of these problems: they didn't read the changelogs (so they're surprised by behavior changes) or their staging cluster is too thin/synthetic to catch real issues. Both are fixable. Neither is glamorous.
The third thing, which doesn't matter for the first upgrade but matters a lot for the tenth: build a checklist and treat every upgrade as boring. The boring upgrades take less time, fail less often, and don't keep anyone up at night. Boring is the goal.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
Explore more articles in this category
You always have known vulnerabilities. The question is how you triage, patch, and respond. The discipline we run after a few real incidents and a lot of routine work.
Three terms that get mixed up constantly. The actual differences, where each one sits in the request path, when you reach for which, and where the same tool plays all three roles.
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.