After two years of running Karpenter on production EKS clusters, the NodePool patterns that survived, the ones we replaced, and the tuning that matters.

On this page

Karpenter: Node Provisioning Patterns at Scale

Karpenter replaced our Cluster Autoscaler about two years ago. The pitch — provisions nodes that match exactly what pending pods need, instead of scaling a pre-defined node group — held up. The reality is that getting Karpenter to behave well in production took several iterations on NodePool design. This post is what survived.

What Karpenter actually does #

Karpenter watches for unschedulable pods, decides what kind of node would fit them, and provisions one (or several) directly via the cloud provider's API. It also handles the reverse: when nodes are underutilized, it consolidates pods onto fewer nodes and terminates the surplus.

Compared to Cluster Autoscaler, Karpenter is faster (no fixed node groups to manage, scaling decisions in seconds), more flexible (the right instance type per pod, not pre-decided), and operationally simpler (one controller, not a maze of ASGs). At scale the cost differences add up — picking the right instance for each workload saves real money.

The tradeoff is more responsibility on the team. Karpenter's defaults work for simple workloads; production usually needs tuning.

Our NodePool layout #

Three pools. Each is purpose-built; we tried fewer and more and landed on three.

system — for cluster add-ons (Argo CD, monitoring agents, CoreDNS, etc.). On-demand nodes only, no spot. Stable resource shapes (no r6gd weirdness), small types (m6i.large-ish). Taints prevent application pods from landing here.

web — for stateless application pods. Mixed instance types (Karpenter picks the cheapest fit from a wide list), 90% spot. Application pods tolerate this pool by default.

batch — for batch jobs, ML preprocessing, anything that can survive interruption. 100% spot, wide instance variety. Different tolerations so only batch workloads land here.

We tried having one NodePool per service early on — every team had its own. It was a maintenance disaster. Three shared pools with clear differentiators (interruption tolerance, workload type) is the right granularity.

Configuring `web` for spot + stability #

The pool that took the most iteration:

yaml.yaml

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: web
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: node.kubernetes.io/instance-type
          operator: NotIn
          values: ["t3.nano", "t3.micro"]   # too small to be useful
      nodeClassRef:
        name: default
  limits:
    cpu: "2000"
    memory: 8000Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m
    budgets:
      - nodes: "10%"

Key choices:

Spot + on-demand both allowed. Karpenter prefers spot when available; falls back to on-demand. The fallback is what makes spot reliable at scale — when spot capacity is unavailable in a particular instance family/AZ, on-demand fills the gap automatically.
AMD64 and ARM64 both allowed. Same — Karpenter picks the cheapest. Graviton (ARM) is typically 15–25% cheaper. Apps that don't have ARM compatibility issues benefit automatically; apps that do are pinned via affinity.
consolidateAfter: 1m. Karpenter waits 1 minute of underutilization before consolidating. Aggressive enough to save money; not so aggressive it thrashes during normal variance.
budgets: 10% nodes. At most 10% of nodes in this pool can be disrupted at once. Bigger budgets = faster consolidation but more risk of mass pod evictions. 10% was our balance.

Spot interruption handling #

Karpenter receives the 2-minute spot interruption notice from AWS via EventBridge and starts draining the node immediately. By the time the spot reclaim happens, the pods have ideally rescheduled elsewhere.

Two minutes is not always enough. Things we changed:

Pod terminationGracePeriodSeconds. Many of our pods had defaults of 30 seconds. For pods doing real work (database connections, in-flight requests), we bumped to 60-120 seconds. Now a pod has time to drain its connections before getting killed.
PodDisruptionBudgets for critical services. PDBs prevent Karpenter from evicting too many pods of the same Deployment at once. Stops the case where 5 pods of the same service all land on the spot node that just got interrupted.

We've had ~50 spot interruptions per week on busy days across the fleet. With these mitigations, almost none turn into customer-visible incidents.

When NOT to use spot #

Two cases we don't put on spot:

Stateful services without HA. A self-hosted Redis without sentinel/cluster mode — losing the spot node means losing the cache, which often cascades.
Services with long-running connections. WebSocket servers, gRPC streaming services. 2-minute warning isn't enough to drain thousands of clients gracefully.

Both go on the system (on-demand) pool with explicit selectors.

EC2NodeClass — the AWS-specific part #

The NodePool says "what kind of node"; the EC2NodeClass says "what does that node look like."

yaml.yaml

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: Bottlerocket
  role: KarpenterNodeRole
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "main-cluster"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "main-cluster"
  blockDeviceMappings:
    - deviceName: /dev/xvdb
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        encrypted: true

A few choices that matter:

amiFamily: Bottlerocket. Bottlerocket is the minimal-footprint, container-optimized OS from AWS. Smaller attack surface than full Amazon Linux. Boot time is ~30 seconds — fast for autoscaling response.
Subnet/SG discovery via tags. Karpenter picks subnets and security groups by tag, so we don't have to hardcode IDs. We tag everything with karpenter.sh/discovery: <cluster-name> at Terraform-time.
EBS encryption + gp3 volumes. gp3 is cheaper and faster than gp2 for our workload. Always encrypted.

Cost: what changed #

Before Karpenter, we had Cluster Autoscaler with fixed node groups: m5.4xlarge for general compute, m5.large for system pods. Effective utilization was ~55% across the fleet — lots of partial-fit nodes.

After Karpenter:

Effective utilization is ~78% (Karpenter picks instance sizes that fit pod shapes more precisely)
Spot usage went from 0% to ~85% of compute
ARM workloads are ~30% of the fleet (was 0% with the fixed node groups)

Net effect on the AWS compute bill: down ~52% for the same workload over the year following the migration. Numbers are approximate but the direction was clear.

What still bites us #

Two ongoing pain points:

Pod-to-node-startup race. When Karpenter provisions a new node for a pending pod, the pod sometimes lands on a different existing node that became available in the meantime (because some other pod terminated). The new node sits idle for a few minutes before consolidation reclaims it. Wasted spend, but small.

Mixed-arch in the same Deployment. A Deployment with replicas: 10 might end up with 6 pods on AMD64 nodes and 4 on ARM64. Both architectures work but it can confuse perf analysis (per-pod metrics across two CPU shapes are not directly comparable). For perf-sensitive services we pin to one arch with node selectors.

Migration from Cluster Autoscaler #

Worth it but not trivial. The path that worked for us:

Install Karpenter alongside Cluster Autoscaler, both running.
Create one NodePool for low-stakes workloads (e.g. batch jobs).
Migrate one workload to the new pool, observe.
Migrate more, expand NodePool variety.
Gradually drain and remove Cluster Autoscaler node groups.

Took about two months for our fleet. The dual-running phase is a bit ugly (two autoscalers competing for the same cluster) but lets you migrate incrementally.

What I'd tell a team starting #

Three NodePools is enough for most teams. system, web, batch. Resist the per-team-pool urge.

Spot + on-demand fallback in the same pool. Don't make separate pools by capacity type.

Tune consolidateAfter to your tolerance. Faster consolidation = more savings + more churn. 1 minute works for us.

PDBs and termination grace periods matter on spot. Otherwise spot interruptions become incidents.

Watch interruption rates. AWS publishes per-instance-family spot interruption stats. Some families are way more interruption-prone than others. We exclude the worst offenders from our pools.

Karpenter is one operator; treat it like any other. Monitor it, alert on its health, upgrade it in the same cadence as other operators.

Karpenter is one of the highest-ROI Kubernetes operational moves we've made. The cost savings are real; the operational simplicity over fixed node groups is real; the team time it freed up was real. The patterns above aren't exotic — they're just what survived a few iterations.

Karpenter — Node Provisioning Patterns at Scale

Karpenter: Node Provisioning Patterns at Scale

What Karpenter actually does #

Our NodePool layout #

Configuring `web` for spot + stability #

Spot interruption handling #

When NOT to use spot #

EC2NodeClass — the AWS-specific part #

Cost: what changed #

What still bites us #

Migration from Cluster Autoscaler #

What I'd tell a team starting #

Stay Updated

AI Agent Tool Design — Boundaries and Confirmations

Bash One-Liners We Actually Use

More from Cloud

CDN Cache Invalidation — Strategies That Don't Break in Production

AWS Step Functions for Workflow Orchestration

AWS VPC Explained — Subnets, Route Tables, and the Internet Gateway

CDN Cache Invalidation — Strategies That Don't Break in Production

AWS Step Functions for Workflow Orchestration

AWS VPC Explained — Subnets, Route Tables, and the Internet Gateway

AWS S3 Tutorial — Buckets, Permissions, and Common Pitfalls

Helm Chart Anti-Patterns We've Stopped Using

Terraform Tutorial — Your First Infrastructure-as-Code Project

About Admin

You might have missed

GitOps with Argo CD: Best Practices for 2025

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Prompt Engineering Best Practices: Maximizing LLM Performance