We upgraded a 60-node EKS cluster from 1.27 to 1.31 over six months. Four minor versions, one bad surprise, zero customer impact. Here's the playbook.

On this page

Kubernetes Cluster Upgrade Strategy

We took our production EKS cluster from 1.27 to 1.31 over six months — four minor version jumps, three control-plane upgrades, full node-group rotations each time. The cluster runs about 60 nodes at peak with ~14 production services. No customer-visible impact.

That outcome wasn't from being clever; it was from being boring on purpose. Every step had a written checklist that we executed without improvising. This post is the checklist, plus the one bad surprise that didn't break us only because the rollback path was already prepped.

The cadence we settled on #

Roughly one minor-version upgrade every 6-8 weeks. Faster than that and the team felt rushed; slower and we'd accumulate dead deprecation work and end up doing two upgrades back-to-back, which is worse.

Each upgrade has the same five phases. Total elapsed time per upgrade: 10-14 days. Engineering time: maybe 2-3 days, distributed across the calendar.

Phase 1: Read every changelog (1 day)#

Before touching anything, one engineer reads:

The Kubernetes release notes for the target minor version, top to bottom
The EKS-specific release notes (AWS sometimes lags or has its own quirks)
The release notes for every major add-on we run: Istio, ArgoCD, Karpenter, Prometheus stack, External DNS, cert-manager, AWS Load Balancer Controller, the Vault Agent injector

The output of this phase is a one-page document called upgrade-N.M.md in our infra repo. It lists:

Deprecations that affect us (with file paths to manifests we need to update)
New defaults we should be aware of
Add-on minimum/maximum versions for the target K8s version
Anything in the changelogs marked "behavior change" or "potentially breaking"

This document gets a peer review like any PR. It's the most important hour of the upgrade.

For 1.30 specifically, the doc flagged that PodDisruptionBudget behavior had changed for unhealthy pods, that cgroups v2 was the default, and that the in-tree GCE PD plugin removal didn't affect us (we're on AWS). All three of those would have caused issues if we'd skipped reading.

Phase 2: Update manifests for deprecations (1-2 days)#

kubectl convert and kubent both help, but neither catches everything. We do a four-pass scan of every manifest in our infra repos:

kubent for known deprecations
Our own grep-based scan against a list of patterns we maintain
A dry-run apply against a fresh test cluster on the target version
Application teams reviewing their own manifests for less-obvious changes (e.g. annotation behavior changes that don't fail apply but change runtime behavior)

The PRs from this phase always land before any cluster work. They merge to main, deploy through the normal release pipeline, and we let them bake on the current cluster version for at least a week.

Phase 3: Upgrade the staging cluster (2-3 days)#

We have a staging cluster that mirrors production at smaller scale (10-15 nodes). It runs a copy of every production workload at about 10% replicas.

Order of operations on staging:

code

Day 1 morning:    Upgrade control plane
Day 1 afternoon:  Update aws-vpc-cni, kube-proxy, CoreDNS
Day 2:            Rotate node groups (gradual, 25% of nodes per cycle)
Day 3:            Soak. Run full integration tests. Check every dashboard.

Soak time matters. Some failure modes don't show up under load for hours. We've caught:

A subtle DNS resolution slowness that only manifested when CoreDNS pods came under heavy concurrent load (shifted the autoscaling on staging to provoke it)
A CSI driver issue that only triggered on PVC resize operations
A networking issue that only fired on specific pod-to-pod communication paths

If staging is clean for 48 continuous hours of running real (synthetic) traffic, we proceed.

Phase 4: Upgrade production (4-5 days, mostly waiting)#

Same order as staging, slower pace.

code

Day 1:        Control plane upgrade. Watch metrics for 24h.
Day 2-3:      Add-on upgrades, one at a time, with bake periods.
Day 4-5:      Node rotation, 10% of nodes at a time, 30-min spacing.

The node rotation is the riskiest part because it's the most concurrent activity. We use Karpenter, so it's mostly automatic — we drain old nodes, Karpenter spins up new ones at the target version. The 10% / 30-min cadence is a hard guardrail to limit blast radius if something goes wrong.

During node rotation, we watch:

Pod eviction success rate (any failures = stop and investigate)
Service endpoint availability (we want zero gaps for any service during rotation)
CPU/memory pressure on nodes during the bake period (a new node version can have different resource overhead)

If any signal goes red, we pause rotation. We don't roll back unless something is actively broken; usually we just pause until we understand the signal.

Phase 5: Soak, then close out (3-5 days)#

After all nodes are on the new version, we run for a full business week before declaring the upgrade done. During that week the on-call team is primed for upgrade-related issues: a separate Slack channel, an explicit "if anything weird happens, mention this thread" instruction.

End-of-upgrade we do a short retrospective: what surprised us, what worked, anything to add to the next upgrade's checklist. The retrospective doc gets attached to the upgrade-N.M.md from Phase 1.

The bad surprise that didn't break us #

During the 1.29 upgrade, the Karpenter version we were on (an older 0.32.x) turned out to be incompatible with a kubelet behavior change in 1.29 around graceful node shutdown. New nodes came up but then occasionally got marked NotReady for ~90 seconds during their first reconciliation.

We caught this on staging during the soak. The fix was to upgrade Karpenter to 0.36.x first, then re-upgrade staging, then proceed. Cost us a week.

What kept this from being a real incident: the Phase 1 changelog read had flagged "kubelet graceful shutdown changes" as a behavior to monitor, the staging soak had time to surface it, and we hadn't promoted to production yet. None of the three on its own would have been enough; all three together turned a potential prod incident into a one-week schedule slip.

What we wouldn't do again #

Skip a minor version. We thought about jumping 1.27 directly to 1.29 once. The upgrade matrix isn't designed for that — many tools test against N→N+1 only — and the failure modes compound. We've stuck to single minor jumps since.

Upgrade during a release week. Twice we tried to slot an upgrade into the week of a product release. Both times the upgrade got disrupted because all the engineers were busy with the release. Now upgrades happen in their own dedicated week, never adjacent to product launches.

What we don't bother with #

Stage upgrades by node pool (some teams do canary node pools first). Karpenter's gradual rotation gives us most of that benefit for less complexity.
Rolling back a control-plane upgrade. EKS doesn't support it cleanly. Our rollback strategy is "fix forward fast"; we've never needed it because Phase 3 catches the issues.
Multi-cluster blue/green for the cluster itself. Too expensive in our setup. We've thought about it for major version jumps but minor versions don't justify the cost.

What I'd tell a team doing this for the first time #

Two things matter most: read the changelogs, and run a real staging cluster.

Most teams that have rough upgrades have one of these problems: they didn't read the changelogs (so they're surprised by behavior changes) or their staging cluster is too thin/synthetic to catch real issues. Both are fixable. Neither is glamorous.

The third thing, which doesn't matter for the first upgrade but matters a lot for the tenth: build a checklist and treat every upgrade as boring. The boring upgrades take less time, fail less often, and don't keep anyone up at night. Boring is the goal.

Best Practices: Kubernetes Cluster Upgrade Strategy

Kubernetes Cluster Upgrade Strategy

The cadence we settled on #

Phase 1: Read every changelog (1 day)#

Phase 2: Update manifests for deprecations (1-2 days)#

Phase 3: Upgrade the staging cluster (2-3 days)#

Phase 4: Upgrade production (4-5 days, mostly waiting)#

Phase 5: Soak, then close out (3-5 days)#

The bad surprise that didn't break us #

What we wouldn't do again #

What we don't bother with #

What I'd tell a team doing this for the first time #

Stay Updated

Systemd Tricks We Use to Keep Services Boring

How We Stopped Terraform Drift from Surprising On-Call

More from DevOps

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Four Signals That Matter: Choosing SLIs Users Actually Feel

External Secrets Operator: One Secrets Workflow Across Clouds

GitHub Actions Reusable Workflows: DRY Pipelines at Org Scale

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

About Kiril Urbonas