Bad resource requests waste money or trigger OOMs. The methodology we use to right-size requests based on actual usage, and the gotchas the autoscalers don't fix.
Resource requests in Kubernetes are one of those settings that look simple and quietly cost you a lot. Set them too low and pods get evicted or OOM-killed at random. Set them too high and you waste cluster capacity proportional to the over-allocation. Most teams pick numbers by hand the first time and never revisit them; the numbers drift far from reality, and the cluster bill grows.
This post is the methodology we run to keep resource requests close to actual usage. It's not glamorous — there's no auto-magic. But it consistently reclaims 25–40% of cluster capacity across the services we apply it to.
A short refresher because the distinction matters:
resources.requests is what the scheduler uses to decide if a pod fits on a node. It's a reservation — that much CPU/memory is set aside for this pod regardless of what it actually uses.resources.limits is the cap — if the pod tries to use more memory than the limit, the kernel OOM-kills it; CPU above the limit gets throttled.Requests drive cost (you pay for reserved capacity even if the pod is idle). Limits drive reliability (a runaway pod can't take down its neighbors).
The most common misconfiguration: request = limit. This is sometimes correct (latency-critical workloads that need their full reservation) and often just a defensive guess that wastes capacity.
Before changing anything, gather usage data per pod over a realistic window:
p95 vs max matters for the two resources differently:
We use Prometheus + kube-state-metrics + cadvisor for these. The query we run:
quantile_over_time(0.95, container_cpu_usage_seconds_total{...}[14d])
max_over_time(container_memory_working_set_bytes{...}[14d])
(Translated into actual Prometheus queries you'd run; the shape is what matters.)
For each service, the loop:
Aggressive teams skip the buffer and run at exactly 1.0× p95. We've found the buffer pays for itself — without it, normal day-to-day variance triggers evictions and pages.
The big shift: stop using the request you copied from another service's manifest. Use the data.
Patterns across our services:
Most services are over-requested 2–5×. A copy-paste default of requests: {cpu: 500m, memory: 512Mi} becomes the standard. Actual usage for many: 100m CPU, 200Mi memory. 5× and 2.5× over-provisioned respectively.
A few services are under-requested. Usually older ones that haven't been touched in a year while their workload grew. They run at ~200% of their request, are constantly throttled, and the team has gotten used to "this service is just slow."
Memory limits set without thinking. limits.memory: 2Gi when usage is 200Mi. Limit isn't the problem; nothing's hitting it. But the request being equal to the limit (2Gi) is.
Java/JVM workloads with wrong heap sizes. JVM defaults to a heap size based on a % of container memory. If your container's memory limit is 2Gi but actual usage is 500Mi, the JVM is allocating 1.5Gi of heap "just in case." Setting -XX:MaxRAMPercentage=50 or -Xmx512m explicitly aligns heap with actual need.
A few that earn their place:
Vertical Pod Autoscaler (VPA) in recommendation mode. VPA can run in Off mode, where it generates recommendations but doesn't apply them automatically. The recommendations are the data you'd compute yourself, packaged in a CRD. We run VPA in Off mode on every namespace and use its recommendations as the input to the manual right-sizing process.
We don't run VPA in Auto mode — auto-applying its recommendations involves pod restarts, and the recommendations themselves can be unstable for bursty workloads. Recommendations + human review is the sweet spot.
Kubecost (or the open-source OpenCost). Surfaces cost per pod per workload. Makes the "this service is 4× over-provisioned" conversation concrete: "this service is costing $400/month; right-sized it would be $80."
Goldilocks. Sits on top of VPA and produces dashboards showing current vs recommended requests per workload. Good for periodic reviews; not strictly necessary if you already query VPA recommendations directly.
A few cases where the standard advice ("requests below limits") doesn't apply:
Guaranteed QoS class — first to get CPU under contention.For these, set request = limit explicitly with the value you actually need. Just don't do it because of a copy-paste; do it because of the workload's requirements.
A few situations:
HPA-managed deployments where the bottleneck is replica count, not pod size. If your autoscaler is constantly adding pods, you don't have an over-request problem; you have a scaling problem. Fix that first.
Workloads with bimodal usage. A service that's idle most of the day and bursts at peak. Right-sizing to "average" leaves no headroom for peaks; right-sizing to peaks wastes capacity off-peak. Solutions: HPA for replica count, or accepting some waste for predictability.
Cold-start sensitive workloads. Some apps need warm capacity (e.g. JVM apps with long startup). Cutting requests aggressively can cause unnecessary scaling churn. Conservative on these.
Setting requests based on what the app says it needs. Application docs ("requires 2GB memory") are often very conservative. Use actual measurements.
Right-sizing once and never revisiting. Workloads change. Revisit every quarter for any service whose usage has materially shifted.
Trying to right-size before you have monitoring. If you don't have per-pod usage data, you're guessing. Wire up monitoring first.
Aggressive cuts during incident response. Right-sizing is a quiet-time activity. Don't change requests during an outage.
Right-sizing isn't exotic and it isn't subtle. It's just doing the measurement-and-adjustment work that most teams skip. The 25–40% cluster-capacity savings shows up reliably across services we've applied it to. The cost is a few hours per service, every quarter. Best ROI per hour of engineering I know.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
SBOMs and signed attestations sound like checkboxes until you need to answer "did this artifact come from our pipeline?" The minimum viable supply-chain story we run.
cpu.shares vs cpu.cfs_quota_us vs memory.max — the cgroup mechanics behind Kubernetes resource limits, and the surprises that explain the weird symptoms you've seen.
Explore more articles in this category
Edge compute is useless without an edge data layer. Three serverless databases that put data within ms of your edge functions, with the tradeoffs that aren't on the marketing pages.
OIDC federation between AWS, GCP, and CI providers let us delete every long-lived cloud credential we had. The setup, the gotchas, and the trust-relationship discipline.
There are two hard problems in computer science." We've worked on the cache-invalidation one for a while. The patterns that hold up at scale and the ones that look clean and aren't.