After two years of running Karpenter on production EKS clusters, the NodePool patterns that survived, the ones we replaced, and the tuning that matters.
Karpenter replaced our Cluster Autoscaler about two years ago. The pitch — provisions nodes that match exactly what pending pods need, instead of scaling a pre-defined node group — held up. The reality is that getting Karpenter to behave well in production took several iterations on NodePool design. This post is what survived.
Karpenter watches for unschedulable pods, decides what kind of node would fit them, and provisions one (or several) directly via the cloud provider's API. It also handles the reverse: when nodes are underutilized, it consolidates pods onto fewer nodes and terminates the surplus.
Compared to Cluster Autoscaler, Karpenter is faster (no fixed node groups to manage, scaling decisions in seconds), more flexible (the right instance type per pod, not pre-decided), and operationally simpler (one controller, not a maze of ASGs). At scale the cost differences add up — picking the right instance for each workload saves real money.
The tradeoff is more responsibility on the team. Karpenter's defaults work for simple workloads; production usually needs tuning.
Three pools. Each is purpose-built; we tried fewer and more and landed on three.
system — for cluster add-ons (Argo CD, monitoring agents, CoreDNS, etc.). On-demand nodes only, no spot. Stable resource shapes (no r6gd weirdness), small types (m6i.large-ish). Taints prevent application pods from landing here.
web — for stateless application pods. Mixed instance types (Karpenter picks the cheapest fit from a wide list), 90% spot. Application pods tolerate this pool by default.
batch — for batch jobs, ML preprocessing, anything that can survive interruption. 100% spot, wide instance variety. Different tolerations so only batch workloads land here.
We tried having one NodePool per service early on — every team had its own. It was a maintenance disaster. Three shared pools with clear differentiators (interruption tolerance, workload type) is the right granularity.
web for spot + stability#The pool that took the most iteration:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: web
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: node.kubernetes.io/instance-type
operator: NotIn
values: ["t3.nano", "t3.micro"] # too small to be useful
nodeClassRef:
name: default
limits:
cpu: "2000"
memory: 8000Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 1m
budgets:
- nodes: "10%"
Key choices:
consolidateAfter: 1m. Karpenter waits 1 minute of underutilization before consolidating. Aggressive enough to save money; not so aggressive it thrashes during normal variance.budgets: 10% nodes. At most 10% of nodes in this pool can be disrupted at once. Bigger budgets = faster consolidation but more risk of mass pod evictions. 10% was our balance.Karpenter receives the 2-minute spot interruption notice from AWS via EventBridge and starts draining the node immediately. By the time the spot reclaim happens, the pods have ideally rescheduled elsewhere.
Two minutes is not always enough. Things we changed:
We've had ~50 spot interruptions per week on busy days across the fleet. With these mitigations, almost none turn into customer-visible incidents.
Two cases we don't put on spot:
Both go on the system (on-demand) pool with explicit selectors.
The NodePool says "what kind of node"; the EC2NodeClass says "what does that node look like."
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: Bottlerocket
role: KarpenterNodeRole
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "main-cluster"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "main-cluster"
blockDeviceMappings:
- deviceName: /dev/xvdb
ebs:
volumeSize: 100Gi
volumeType: gp3
encrypted: true
A few choices that matter:
amiFamily: Bottlerocket. Bottlerocket is the minimal-footprint, container-optimized OS from AWS. Smaller attack surface than full Amazon Linux. Boot time is ~30 seconds — fast for autoscaling response.karpenter.sh/discovery: <cluster-name> at Terraform-time.Before Karpenter, we had Cluster Autoscaler with fixed node groups: m5.4xlarge for general compute, m5.large for system pods. Effective utilization was ~55% across the fleet — lots of partial-fit nodes.
After Karpenter:
Net effect on the AWS compute bill: down ~52% for the same workload over the year following the migration. Numbers are approximate but the direction was clear.
Two ongoing pain points:
Pod-to-node-startup race. When Karpenter provisions a new node for a pending pod, the pod sometimes lands on a different existing node that became available in the meantime (because some other pod terminated). The new node sits idle for a few minutes before consolidation reclaims it. Wasted spend, but small.
Mixed-arch in the same Deployment. A Deployment with replicas: 10 might end up with 6 pods on AMD64 nodes and 4 on ARM64. Both architectures work but it can confuse perf analysis (per-pod metrics across two CPU shapes are not directly comparable). For perf-sensitive services we pin to one arch with node selectors.
Worth it but not trivial. The path that worked for us:
Took about two months for our fleet. The dual-running phase is a bit ugly (two autoscalers competing for the same cluster) but lets you migrate incrementally.
Three NodePools is enough for most teams. system, web, batch. Resist the per-team-pool urge.
Spot + on-demand fallback in the same pool. Don't make separate pools by capacity type.
Tune consolidateAfter to your tolerance. Faster consolidation = more savings + more churn. 1 minute works for us.
PDBs and termination grace periods matter on spot. Otherwise spot interruptions become incidents.
Watch interruption rates. AWS publishes per-instance-family spot interruption stats. Some families are way more interruption-prone than others. We exclude the worst offenders from our pools.
Karpenter is one operator; treat it like any other. Monitor it, alert on its health, upgrade it in the same cadence as other operators.
Karpenter is one of the highest-ROI Kubernetes operational moves we've made. The cost savings are real; the operational simplicity over fixed node groups is real; the team time it freed up was real. The patterns above aren't exotic — they're just what survived a few iterations.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
A curated list of shell one-liners that earn their place in real ops work — the ones I reach for weekly, not the trick-shot variety.
Explore more articles in this category
There are two hard problems in computer science." We've worked on the cache-invalidation one for a while. The patterns that hold up at scale and the ones that look clean and aren't.
We use Step Functions for batch processing, document ingestion, and a few agentic workflows. The patterns that work, the limits we hit, and where we'd reach for something else.
A working mental model for AWS VPCs — what each piece does, how they connect, and why "VPC" is the wrong mental model if you came from physical networks.