cpu.shares vs cpu.cfs_quota_us vs memory.max — the cgroup mechanics behind Kubernetes resource limits, and the surprises that explain the weird symptoms you've seen.
A Kubernetes pod spec has resources.requests.cpu: 500m and resources.limits.cpu: 1. What does that mean to the Linux kernel? The mental model most teams use is "the container gets between 500m and 1000m of CPU." That's roughly right and substantially wrong. Knowing what's actually happening at the cgroup level explains a bunch of weird symptoms — random throttling on a barely-loaded service, OOM-kills at memory levels well below the limit, "CPU at 50% but everything's slow."
This post is what the runtime actually does with your resource limits.
When you set resources.requests.cpu: 500m, the container runtime translates that to a cgroup cpu.shares value of 512 (1024 = 1 vCPU; 500m = 0.5 vCPU = 512 shares). cpu.shares is a weight — it controls how much CPU time a cgroup gets relative to other cgroups when there's contention.
Read carefully: it only matters when there's contention. If your node has spare CPU, your container gets all the CPU it wants regardless of its request. The request is a guarantee under contention, not a cap.
The scheduler uses requests to place the pod (the sum of requests on a node must be ≤ node capacity), but at runtime, the kernel only enforces shares relatively.
Practical consequence: a node packed at 90% of CPU request capacity can have one cgroup getting much more than its "request" share if its neighbors are idle. This is fine; it's the design.
CPU limits work differently. They use CFS bandwidth control, not shares. The kernel allocates the cgroup a quota of CPU-microseconds per period:
cpu.cfs_period_us = 100000 (100ms by default)cpu.cfs_quota_us = limit_in_millicpu × 100 (so 500m → 50000 = 50% of a period)For each 100ms window, the cgroup can use at most 50ms of CPU time. Once it hits that quota, it's throttled until the next period starts.
This has a subtle and important consequence: a cgroup with a 1-CPU limit can still get throttled even when the node is idle. The limit is per-period, not per-second. A 100ms CPU burst (over a period of less than 100ms) will throttle the cgroup until the next period.
This is why you can see "CPU at 50% but everything's slow" — the average CPU usage is 50%, but the instantaneous usage hit the quota during specific 100ms windows, causing throttling.
Real example: a Java service with a CPU limit of 2. The JVM has many threads doing batch GC work that runs concurrently. During GC, every thread is trying to run at once — total demand exceeds 2 CPUs for the duration of the GC. The cgroup hits its quota in <50ms; gets throttled for the rest of the period. GC takes 3× longer than it should because it kept getting paused.
The fix is often "remove the CPU limit." This sounds reckless but is correct in many cases: if you have HA across nodes and your concern was "a runaway service can't eat the whole node," the request (via cpu.shares) already prevents that under contention. The limit was only protecting against the case of no contention, which is the case where you don't need protection.
We've removed CPU limits on a number of latency-sensitive workloads. Removed throttling, no other side effects. Memory limits stay.
Memory limits work the opposite way from CPU limits — they're a hard ceiling, not a quota. The cgroup can use up to memory.max bytes; one byte over and the kernel OOM-killer terminates the worst-offending process in the cgroup (often the container's main process, which means container death).
Memory is not compressible: you can't "throttle" a process's memory the way you throttle its CPU. When the kernel hits the wall, things die.
A few wrinkles:
The OOM killer chooses which process to kill. Within the cgroup, it picks the one with the highest "oom_score" (a kernel heuristic balancing memory use, age, and oom_score_adj). For containers with one process, this is straightforward. For containers running an init + a worker, the worker usually gets killed, the init exits, the container dies.
Page cache counts. Memory usage from the cgroup's perspective includes page cache (file-system buffers). A workload that does heavy file I/O can have a memory usage much higher than its RSS. Limits include both. This surprises people who size limits based on ps/RSS.
Memory pressure precedes OOM kill. Before going over the limit, the cgroup will swap (if swap is enabled, which it usually isn't in Kubernetes) or reclaim page cache. You can see this in memory.pressure cgroup files — it's a leading indicator before actual OOM kills.
A common confusion: a container with a 2Gi memory limit, but cat /proc/meminfo inside the container shows the host's full memory (often much larger).
This is because /proc/meminfo reads from the host kernel, not the cgroup. The container's namespace doesn't isolate /proc/meminfo (or /proc/cpuinfo, or several others). Tools that read these files (like the JVM sizing its heap based on system memory) misbehave inside containers.
Modern JVMs (Java 11+) and Node.js automatically detect cgroup limits via /sys/fs/cgroup/... files. Older runtimes need explicit configuration (-XX:MaxRAMPercentage, --max-old-space-size).
This is the source of countless "my container died at 2Gi even though /proc/meminfo shows 32Gi" debugging sessions.
When you set cpu: 100m (0.1 CPU), the kernel doesn't give you 10% of a single CPU. It gives you a quota of 10ms per 100ms period, which you can use across any available cores.
So a 100m limit on a 16-core node lets your workload use 10ms of any of the 16 cores per 100ms period. A pure single-threaded workload can use a single core for 10ms. A multi-threaded workload can run on 16 cores simultaneously and burn the entire quota in 1ms wall-clock.
In practice this means:
The default 100ms period is a Kubernetes/runtime default. The kernel supports periods as short as 1ms. Shorter periods = more granular throttling (less wall-clock waiting after hitting the quota).
There's a kernel knob (--cpu-cfs-period-us in kubelet config) that affects this. Most teams leave it at the default. We've experimented with shorter periods (5ms) on latency-sensitive workloads and seen p99 latency improvements when the workload was hitting throttling. Side effects: slightly more scheduling overhead. Not a universal win; depends on the workload.
Three rules that we've found generally work:
Set memory limits aggressively. Memory limits prevent runaway processes from taking down the node. Set them at ~1.3× max observed usage. Set them.
Be careful with CPU limits. For latency-sensitive workloads, consider removing them entirely. The request (cpu.shares) handles the "noisy neighbor" case under contention; the limit only adds throttling risk.
Match runtime config to limits. JVM -XX:MaxRAMPercentage, Node --max-old-space-size, Python's gc settings — all should be aware of the container's memory limit, not the host's.
Container resource limits look like simple numbers and aren't. Knowing what the kernel does with them turns "weird unexplained slowness" into "the cgroup hit cfs_quota_us at 47ms into the period and got throttled for 53ms." The diagnostic value alone is worth understanding the mechanism.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Bad resource requests waste money or trigger OOMs. The methodology we use to right-size requests based on actual usage, and the gotchas the autoscalers don't fix.
Static thresholds on error rate produce noisy alerts. Burn-rate alerting flips the question to "are we burning the error budget faster than we can sustain?" — and pages only on real problems.
Explore more articles in this category
bpftrace one-liners replace strace, perf top, and a half-dozen ad-hoc debugging scripts. The patterns that actually earn their place when you're troubleshooting at 2 AM.
We migrated most scheduled jobs from cron to systemd timers. The wins, the gotchas, and the cases we kept on cron anyway.
A curated list of shell one-liners that earn their place in real ops work — the ones I reach for weekly, not the trick-shot variety.