Bad resource requests waste money or trigger OOMs. The methodology we use to right-size requests based on actual usage, and the gotchas the autoscalers don't fix.

On this page

Kubernetes Resource Requests — Right-Sizing Without Guessing

Resource requests in Kubernetes are one of those settings that look simple and quietly cost you a lot. Set them too low and pods get evicted or OOM-killed at random. Set them too high and you waste cluster capacity proportional to the over-allocation. Most teams pick numbers by hand the first time and never revisit them; the numbers drift far from reality, and the cluster bill grows.

This post is the methodology we run to keep resource requests close to actual usage. It's not glamorous — there's no auto-magic. But it consistently reclaims 25–40% of cluster capacity across the services we apply it to.

What requests vs limits actually do #

A short refresher because the distinction matters:

resources.requests is what the scheduler uses to decide if a pod fits on a node. It's a reservation — that much CPU/memory is set aside for this pod regardless of what it actually uses.
resources.limits is the cap — if the pod tries to use more memory than the limit, the kernel OOM-kills it; CPU above the limit gets throttled.

Requests drive cost (you pay for reserved capacity even if the pod is idle). Limits drive reliability (a runaway pod can't take down its neighbors).

The most common misconfiguration: request = limit. This is sometimes correct (latency-critical workloads that need their full reservation) and often just a defensive guess that wastes capacity.

The data you actually need #

Before changing anything, gather usage data per pod over a realistic window:

CPU: the p95 and p99 of actual CPU usage over the last 14 days, per pod.
Memory: the maximum (not p95!) of actual memory usage over the last 14 days. Memory limits are hard caps; once you hit them you die.

p95 vs max matters for the two resources differently:

CPU is compressible. Going over your request briefly is fine — you compete with other pods for spare cycles, but you don't get killed. So request can be near p95 or even median; spare CPU on the node absorbs spikes.
Memory is not compressible. Going over your limit means you die. So limits must be above your worst-case memory; requests can be below that if you're OK with eviction risk.

We use Prometheus + kube-state-metrics + cadvisor for these. The query we run:

promql.promql

quantile_over_time(0.95, container_cpu_usage_seconds_total{...}[14d])

max_over_time(container_memory_working_set_bytes{...}[14d])

(Translated into actual Prometheus queries you'd run; the shape is what matters.)

The methodology, per service #

For each service, the loop:

Pull 14 days of usage data. CPU p95 + memory max per pod.
Set request to ~1.2× p95 CPU usage and ~1.15× max memory. The buffer is for spikes and growth.
Set limit to ~1.5× request for memory (cap protection); leave CPU limit unset if your workload tolerates throttling, or set to ~3× request if not.
Deploy, observe for a week. Watch for evictions, OOM-kills, throttling rate.
Adjust if needed. Usually one iteration is enough.

Aggressive teams skip the buffer and run at exactly 1.0× p95. We've found the buffer pays for itself — without it, normal day-to-day variance triggers evictions and pages.

The big shift: stop using the request you copied from another service's manifest. Use the data.

What we keep finding #

Patterns across our services:

Most services are over-requested 2–5×. A copy-paste default of requests: {cpu: 500m, memory: 512Mi} becomes the standard. Actual usage for many: 100m CPU, 200Mi memory. 5× and 2.5× over-provisioned respectively.

A few services are under-requested. Usually older ones that haven't been touched in a year while their workload grew. They run at ~200% of their request, are constantly throttled, and the team has gotten used to "this service is just slow."

Memory limits set without thinking. limits.memory: 2Gi when usage is 200Mi. Limit isn't the problem; nothing's hitting it. But the request being equal to the limit (2Gi) is.

Java/JVM workloads with wrong heap sizes. JVM defaults to a heap size based on a % of container memory. If your container's memory limit is 2Gi but actual usage is 500Mi, the JVM is allocating 1.5Gi of heap "just in case." Setting -XX:MaxRAMPercentage=50 or -Xmx512m explicitly aligns heap with actual need.

Tools that help #

A few that earn their place:

Vertical Pod Autoscaler (VPA) in recommendation mode. VPA can run in Off mode, where it generates recommendations but doesn't apply them automatically. The recommendations are the data you'd compute yourself, packaged in a CRD. We run VPA in Off mode on every namespace and use its recommendations as the input to the manual right-sizing process.

We don't run VPA in Auto mode — auto-applying its recommendations involves pod restarts, and the recommendations themselves can be unstable for bursty workloads. Recommendations + human review is the sweet spot.

Kubecost (or the open-source OpenCost). Surfaces cost per pod per workload. Makes the "this service is 4× over-provisioned" conversation concrete: "this service is costing $400/month; right-sized it would be $80."

Goldilocks. Sits on top of VPA and produces dashboards showing current vs recommended requests per workload. Good for periodic reviews; not strictly necessary if you already query VPA recommendations directly.

When request = limit is correct #

A few cases where the standard advice ("requests below limits") doesn't apply:

Latency-sensitive workloads (low-latency APIs, real-time data processing). You want predictable performance, which means guaranteed CPU. Setting request = limit gives the pod the Guaranteed QoS class — first to get CPU under contention.
Single-pod-per-node high-traffic services where the node IS the pod. The "right-size" question doesn't really apply.
Workloads sensitive to noisy neighbors. When CPU spikes are unacceptable, guaranteed QoS is the only safe path.

For these, set request = limit explicitly with the value you actually need. Just don't do it because of a copy-paste; do it because of the workload's requirements.

Where right-sizing doesn't help #

A few situations:

HPA-managed deployments where the bottleneck is replica count, not pod size. If your autoscaler is constantly adding pods, you don't have an over-request problem; you have a scaling problem. Fix that first.

Workloads with bimodal usage. A service that's idle most of the day and bursts at peak. Right-sizing to "average" leaves no headroom for peaks; right-sizing to peaks wastes capacity off-peak. Solutions: HPA for replica count, or accepting some waste for predictability.

Cold-start sensitive workloads. Some apps need warm capacity (e.g. JVM apps with long startup). Cutting requests aggressively can cause unnecessary scaling churn. Conservative on these.

What we measure after right-sizing #

Cluster CPU & memory allocation vs usage. The gap shrinks. Typical post-right-sizing: 70–85% of allocated capacity is actually used. Pre-right-sizing was often 30–45%.
Pod evictions due to resource pressure. Should stay near zero. If they spike after right-sizing, we over-trimmed.
OOM kill count per workload. Same — should be zero. If a pod is getting OOM-killed, raise its memory limit (and probably its request).
CPU throttling rate. Per-pod metric showing how often the pod was throttled. Some throttling is fine for non-latency-sensitive workloads; high sustained throttling means under-requested.

Common mistakes #

Setting requests based on what the app says it needs. Application docs ("requires 2GB memory") are often very conservative. Use actual measurements.

Right-sizing once and never revisiting. Workloads change. Revisit every quarter for any service whose usage has materially shifted.

Trying to right-size before you have monitoring. If you don't have per-pod usage data, you're guessing. Wire up monitoring first.

Aggressive cuts during incident response. Right-sizing is a quiet-time activity. Don't change requests during an outage.

What I'd tell a team starting #

Run VPA in Off mode on everything. Free; no behavior change; gives you data.
Pick the 5 most expensive workloads first. That's where the dollars are.
One iteration of measurement, change, observe per quarter. Don't churn weekly.
Stop using copy-paste defaults for new services. Set the initial request based on actual benchmark or a similar service's measured usage.

What to read next #

Karpenter — node provisioning patterns at scale — what right-sized pods give the node-autoscaler a chance to do well
Kubernetes 101 — Pods, Deployments, and Services explained — the basics this builds on
Container resource limits — what they actually do — limits-side details
AWS cost optimization strategies — broader cost work, of which this is one piece

Right-sizing isn't exotic and it isn't subtle. It's just doing the measurement-and-adjustment work that most teams skip. The 25–40% cluster-capacity savings shows up reliably across services we've applied it to. The cost is a few hours per service, every quarter. Best ROI per hour of engineering I know.

Kubernetes Resource Requests — Right-Sizing Without Guessing

Kubernetes Resource Requests — Right-Sizing Without Guessing

What requests vs limits actually do #

The data you actually need #

The methodology, per service #

What we keep finding #

Tools that help #

When request = limit is correct #

Where right-sizing doesn't help #

What we measure after right-sizing #

Common mistakes #

What I'd tell a team starting #

What to read next #

Stay Updated

Supply Chain Security — SBOMs, Attestation, and What to Actually Verify

Container Resource Limits — What They Actually Do at the Kernel Level

More from Cloud

The Edge Computing Playbook — What to Run at the Edge (and What Not To)

Detecting and Rotating Leaked Cloud Credentials

Observability for Edge Functions — Logs, Traces, and Metrics

The Edge Computing Playbook — What to Run at the Edge (and What Not To)

Detecting and Rotating Leaked Cloud Credentials

Observability for Edge Functions — Logs, Traces, and Metrics

Edge Auth — Validating JWTs Without Origin Round-Trips

Long Context vs RAG — When to Use Which

mTLS for Service-to-Service Auth — Beyond API Keys

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

AWS Graviton Migration: What Broke and What We Saved

About Kiril Urbonas