HPA, VPA, and Cluster Autoscaler / Karpenter solve overlapping problems badly when you don't understand which one owns what. The mental model that keeps them from fighting.

On this page

Kubernetes Autoscaling: HPA, VPA, and Cluster Autoscaler

Three different autoscalers exist for Kubernetes, and they don't naturally play well together. We learned this the hard way during a traffic spike where HPA was trying to scale up replicas, VPA was trying to resize the existing pods, Cluster Autoscaler was provisioning new nodes — and the workload landed in an unstable state for ~30 minutes while these three subsystems argued.

After that incident we built a clearer mental model for which scaler owns what. The model below has held up across the next six months without similar incidents.

What each scaler actually does #

Horizontal Pod Autoscaler (HPA): changes the number of pod replicas based on metrics. "Too much CPU? Add more pods. Too little? Remove some."

Vertical Pod Autoscaler (VPA): changes the resource requests/limits of existing pods. "Pods using more CPU than requested? Bump up the request value."

Cluster Autoscaler / Karpenter: changes the number of nodes in the cluster. "Pods can't schedule because no node has room? Add a node."

The three operate at different layers (replica count, pod resources, node count) but they affect each other's outcomes.

Where they conflict #

Two pods with VPA on can be in the middle of getting their requests recalculated when HPA kicks in to add more replicas. The new replicas inherit stale requests; the cluster autoscaler provisions nodes for stale requests; the resource picture is wrong on multiple sides.

The classic failure: VPA recommends "this pod really needs 4 CPU not 1." HPA, looking at CPU usage, sees pods at 90% of their request and adds replicas. Now you have 5 pods each requesting 1 CPU, but each really needs 4. They don't fit on existing nodes; cluster autoscaler provisions giant nodes. Cost balloons. Eventually everything stabilizes but the path was wasteful.

The mental model #

A workload is in one of three states, and you pick scalers accordingly:

State A: stateless, well-understood load profile. Use HPA. Set static resource requests; don't run VPA. The requests come from baseline measurement; HPA scales replicas with traffic.

State B: workload with unpredictable resource needs per pod (e.g., serving heterogeneous customer workloads). Run VPA in recommendation mode (it suggests but doesn't apply). Use the suggestions to update requests at deploy time. Run HPA on top of the right-sized requests.

State C: workload where you genuinely need vertical scaling (databases, ML inference where each pod handles huge requests). Run VPA in auto mode. Don't run HPA on it; resize replaces the autoscaling story.

These three states cover most of our workloads. The trap is running both HPA and VPA-auto on the same deployment — they fight.

HPA: what to scale on #

The cpu metric is the default and the worst. CPU usage isn't a good proxy for load on most modern services — they're often I/O bound, or they have idle threads, or the work is spiky in a way CPU averages hide.

We scale on application-specific metrics:

yaml.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 4
  maxReplicas: 30
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "50"

"Scale to keep ~50 RPS per pod." Pulled from Prometheus via the metrics adapter. Maps directly to how we think about capacity.

For workloads where RPS isn't meaningful (background workers), we scale on queue depth:

yaml.yaml

metric:
  name: queue_depth
target:
  type: AverageValue
  averageValue: "10"

"Scale to keep average queue depth at 10 jobs per worker."

HPA: stabilization windows #

The HPA's default behaviour is to scale up fast and scale down slowly. The defaults are reasonable for most workloads:

yaml.yaml

behavior:
  scaleUp:
    stabilizationWindowSeconds: 60
    policies:
      - type: Percent
        value: 100
        periodSeconds: 60
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
      - type: Percent
        value: 50
        periodSeconds: 60

Translation: scale up by up to 100% per minute (after 60s stabilization). Scale down by up to 50% per minute (after 300s stabilization). We've adjusted the scale-down window for one workload that was flapping; everything else uses defaults.

VPA: when it actually pays off #

We use VPA in three places, all in recommendation mode (not auto):

New services where we don't know what to set
Services where load profile changed (after a refactor or upstream change)
Periodically (quarterly) to catch slow drift in resource needs

The recommendations show up in our dashboards. A human reviews and bumps the requests in the deployment. We don't let VPA do it automatically because in-place pod resize is still maturing in K8s; the disruption from VPA evicting pods to apply new requests is sometimes worse than the suboptimal sizing.

This is conservative. Some teams run VPA-auto and it works for them. We've found the predictability of explicit requests valuable.

Cluster Autoscaler / Karpenter #

We use Karpenter. It's faster than the older Cluster Autoscaler and handles bin-packing more aggressively. The configuration:

yaml.yaml

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general
spec:
  template:
    spec:
      requirements:
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: [c7i, m7i, r7i]
        - key: karpenter.sh/capacity-type
          operator: In
          values: [spot, on-demand]
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
  limits:
    cpu: 2000
    memory: 2000Gi

Karpenter picks the right instance type based on aggregate pending pod requirements. When pods get rescheduled, it consolidates onto cheaper / fewer nodes.

The interaction with HPA: when HPA scales pods up, Karpenter sees pending pods and provisions nodes within 60-90s. The chain is HPA → schedule attempts fail → Karpenter steps in. As long as request values are accurate, this works fluidly.

What broke during our bad day #

The incident I mentioned at the top: VPA-auto on a service running HPA. VPA decided pods needed 4× the CPU. It evicted pods one by one to apply new requests. HPA, seeing reduced pod count, scaled up. The new pods had the OLD request value because VPA hadn't gotten to them yet. Cluster Autoscaler provisioned nodes for the old request size.

Net result: we briefly had 3× more pods than needed, on undersized nodes, while VPA was still mid-resize. CPU saturated. Latency spiked. Eventually it stabilized.

The fix was switching that service to "VPA-recommendation, manual application" mode. Recommendations now flow into our deploy pipeline instead of being applied autonomously.

What we don't do #

Run both HPA and VPA-auto on the same deployment. They don't compose.
Scale on memory. Memory usage is sticky; pods that grew memory don't shed it as load drops. Memory-based HPA tends to under-scale-down. We use CPU or app-level metrics, not memory, except in specific cases.
Use predictive scaling (e.g., AWS Predictive Scaling for ASGs). We tried it; the predictions weren't accurate enough to be worth the complexity. Reactive scaling with good metrics is hard to beat.

Useful operational checks #

A few queries we run weekly:

Pods continuously near their resource limits: indicates VPA recommendation should bump requests, or the pod has a bug.
HPA at min replicas for > 24h: maybe min is too high; you're paying for unused capacity.
HPA at max replicas frequently: max is too low or you have a quality problem masking as load.
Nodes consistently at < 30% utilization: Karpenter consolidation may be limited by pod anti-affinity rules; investigate.

What I'd tell a team starting #

For a new workload, start with HPA on a meaningful metric (RPS, queue depth — not CPU). Set requests based on baseline measurement. Don't enable VPA initially.

Run VPA in recommendation mode periodically to catch drift. Don't put it on auto unless you've thought hard about why.

Use Karpenter (or Cluster Autoscaler if Karpenter doesn't fit). Most defaults are fine.

The biggest mistake is overengineering autoscaling early. A simple HPA on a meaningful metric handles 80% of cases. The complexity of VPA + Karpenter + custom metrics is justified for high-scale or unusual workloads, not the average service.

Kubernetes Autoscaling: HPA vs VPA vs Cluster Autoscaler

Kubernetes Autoscaling: HPA, VPA, and Cluster Autoscaler

What each scaler actually does #

Where they conflict #

The mental model #

HPA: what to scale on #

HPA: stabilization windows #

VPA: when it actually pays off #

Cluster Autoscaler / Karpenter #

What broke during our bad day #

What we don't do #

Useful operational checks #

What I'd tell a team starting #

Stay Updated

Orchestrating AI Agents on Kubernetes

Practical Guide: Linux Performance Baseline Methodology

More from DevOps

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Four Signals That Matter: Choosing SLIs Users Actually Feel

External Secrets Operator: One Secrets Workflow Across Clouds

Kustomize Overlays That Scale Across Environments

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

About Kiril Urbonas