HPA, VPA, and Cluster Autoscaler / Karpenter solve overlapping problems badly when you don't understand which one owns what. The mental model that keeps them from fighting.
Three different autoscalers exist for Kubernetes, and they don't naturally play well together. We learned this the hard way during a traffic spike where HPA was trying to scale up replicas, VPA was trying to resize the existing pods, Cluster Autoscaler was provisioning new nodes — and the workload landed in an unstable state for ~30 minutes while these three subsystems argued.
After that incident we built a clearer mental model for which scaler owns what. The model below has held up across the next six months without similar incidents.
Horizontal Pod Autoscaler (HPA): changes the number of pod replicas based on metrics. "Too much CPU? Add more pods. Too little? Remove some."
Vertical Pod Autoscaler (VPA): changes the resource requests/limits of existing pods. "Pods using more CPU than requested? Bump up the request value."
Cluster Autoscaler / Karpenter: changes the number of nodes in the cluster. "Pods can't schedule because no node has room? Add a node."
The three operate at different layers (replica count, pod resources, node count) but they affect each other's outcomes.
Two pods with VPA on can be in the middle of getting their requests recalculated when HPA kicks in to add more replicas. The new replicas inherit stale requests; the cluster autoscaler provisions nodes for stale requests; the resource picture is wrong on multiple sides.
The classic failure: VPA recommends "this pod really needs 4 CPU not 1." HPA, looking at CPU usage, sees pods at 90% of their request and adds replicas. Now you have 5 pods each requesting 1 CPU, but each really needs 4. They don't fit on existing nodes; cluster autoscaler provisions giant nodes. Cost balloons. Eventually everything stabilizes but the path was wasteful.
A workload is in one of three states, and you pick scalers accordingly:
State A: stateless, well-understood load profile. Use HPA. Set static resource requests; don't run VPA. The requests come from baseline measurement; HPA scales replicas with traffic.
State B: workload with unpredictable resource needs per pod (e.g., serving heterogeneous customer workloads). Run VPA in recommendation mode (it suggests but doesn't apply). Use the suggestions to update requests at deploy time. Run HPA on top of the right-sized requests.
State C: workload where you genuinely need vertical scaling (databases, ML inference where each pod handles huge requests). Run VPA in auto mode. Don't run HPA on it; resize replaces the autoscaling story.
These three states cover most of our workloads. The trap is running both HPA and VPA-auto on the same deployment — they fight.
The cpu metric is the default and the worst. CPU usage isn't a good proxy for load on most modern services — they're often I/O bound, or they have idle threads, or the work is spiky in a way CPU averages hide.
We scale on application-specific metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 4
maxReplicas: 30
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "50"
"Scale to keep ~50 RPS per pod." Pulled from Prometheus via the metrics adapter. Maps directly to how we think about capacity.
For workloads where RPS isn't meaningful (background workers), we scale on queue depth:
metric:
name: queue_depth
target:
type: AverageValue
averageValue: "10"
"Scale to keep average queue depth at 10 jobs per worker."
The HPA's default behaviour is to scale up fast and scale down slowly. The defaults are reasonable for most workloads:
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
Translation: scale up by up to 100% per minute (after 60s stabilization). Scale down by up to 50% per minute (after 300s stabilization). We've adjusted the scale-down window for one workload that was flapping; everything else uses defaults.
We use VPA in three places, all in recommendation mode (not auto):
The recommendations show up in our dashboards. A human reviews and bumps the requests in the deployment. We don't let VPA do it automatically because in-place pod resize is still maturing in K8s; the disruption from VPA evicting pods to apply new requests is sometimes worse than the suboptimal sizing.
This is conservative. Some teams run VPA-auto and it works for them. We've found the predictability of explicit requests valuable.
We use Karpenter. It's faster than the older Cluster Autoscaler and handles bin-packing more aggressively. The configuration:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general
spec:
template:
spec:
requirements:
- key: karpenter.k8s.aws/instance-family
operator: In
values: [c7i, m7i, r7i]
- key: karpenter.sh/capacity-type
operator: In
values: [spot, on-demand]
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
limits:
cpu: 2000
memory: 2000Gi
Karpenter picks the right instance type based on aggregate pending pod requirements. When pods get rescheduled, it consolidates onto cheaper / fewer nodes.
The interaction with HPA: when HPA scales pods up, Karpenter sees pending pods and provisions nodes within 60-90s. The chain is HPA → schedule attempts fail → Karpenter steps in. As long as request values are accurate, this works fluidly.
The incident I mentioned at the top: VPA-auto on a service running HPA. VPA decided pods needed 4× the CPU. It evicted pods one by one to apply new requests. HPA, seeing reduced pod count, scaled up. The new pods had the OLD request value because VPA hadn't gotten to them yet. Cluster Autoscaler provisioned nodes for the old request size.
Net result: we briefly had 3× more pods than needed, on undersized nodes, while VPA was still mid-resize. CPU saturated. Latency spiked. Eventually it stabilized.
The fix was switching that service to "VPA-recommendation, manual application" mode. Recommendations now flow into our deploy pipeline instead of being applied autonomously.
A few queries we run weekly:
For a new workload, start with HPA on a meaningful metric (RPS, queue depth — not CPU). Set requests based on baseline measurement. Don't enable VPA initially.
Run VPA in recommendation mode periodically to catch drift. Don't put it on auto unless you've thought hard about why.
Use Karpenter (or Cluster Autoscaler if Karpenter doesn't fit). Most defaults are fine.
The biggest mistake is overengineering autoscaling early. A simple HPA on a meaningful metric handles 80% of cases. The complexity of VPA + Karpenter + custom metrics is justified for high-scale or unusual workloads, not the average service.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We run a fleet of LLM agents on Kubernetes. They're stateful, bursty, and expensive — none of which K8s defaults are good at. Here's what we changed.
When everything seems "slow," a baseline gives you something to measure against. The capture-and-compare workflow we use on every Linux host.
Explore more articles in this category
You always have known vulnerabilities. The question is how you triage, patch, and respond. The discipline we run after a few real incidents and a lot of routine work.
Three terms that get mixed up constantly. The actual differences, where each one sits in the request path, when you reach for which, and where the same tool plays all three roles.
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.