cpu.shares vs cpu.cfs_quota_us vs memory.max — the cgroup mechanics behind Kubernetes resource limits, and the surprises that explain the weird symptoms you've seen.

On this page

Container Resource Limits — What They Actually Do at the Kernel Level

A Kubernetes pod spec has resources.requests.cpu: 500m and resources.limits.cpu: 1. What does that mean to the Linux kernel? The mental model most teams use is "the container gets between 500m and 1000m of CPU." That's roughly right and substantially wrong. Knowing what's actually happening at the cgroup level explains a bunch of weird symptoms — random throttling on a barely-loaded service, OOM-kills at memory levels well below the limit, "CPU at 50% but everything's slow."

This post is what the runtime actually does with your resource limits.

CPU requests = cpu.shares #

When you set resources.requests.cpu: 500m, the container runtime translates that to a cgroup cpu.shares value of 512 (1024 = 1 vCPU; 500m = 0.5 vCPU = 512 shares). cpu.shares is a weight — it controls how much CPU time a cgroup gets relative to other cgroups when there's contention.

Read carefully: it only matters when there's contention. If your node has spare CPU, your container gets all the CPU it wants regardless of its request. The request is a guarantee under contention, not a cap.

The scheduler uses requests to place the pod (the sum of requests on a node must be ≤ node capacity), but at runtime, the kernel only enforces shares relatively.

Practical consequence: a node packed at 90% of CPU request capacity can have one cgroup getting much more than its "request" share if its neighbors are idle. This is fine; it's the design.

CPU limits = cpu.cfs_quota_us + cpu.cfs_period_us #

CPU limits work differently. They use CFS bandwidth control, not shares. The kernel allocates the cgroup a quota of CPU-microseconds per period:

cpu.cfs_period_us = 100000 (100ms by default)
cpu.cfs_quota_us = limit_in_millicpu × 100 (so 500m → 50000 = 50% of a period)

For each 100ms window, the cgroup can use at most 50ms of CPU time. Once it hits that quota, it's throttled until the next period starts.

This has a subtle and important consequence: a cgroup with a 1-CPU limit can still get throttled even when the node is idle. The limit is per-period, not per-second. A 100ms CPU burst (over a period of less than 100ms) will throttle the cgroup until the next period.

This is why you can see "CPU at 50% but everything's slow" — the average CPU usage is 50%, but the instantaneous usage hit the quota during specific 100ms windows, causing throttling.

Why this matters in practice #

Real example: a Java service with a CPU limit of 2. The JVM has many threads doing batch GC work that runs concurrently. During GC, every thread is trying to run at once — total demand exceeds 2 CPUs for the duration of the GC. The cgroup hits its quota in <50ms; gets throttled for the rest of the period. GC takes 3× longer than it should because it kept getting paused.

The fix is often "remove the CPU limit." This sounds reckless but is correct in many cases: if you have HA across nodes and your concern was "a runaway service can't eat the whole node," the request (via cpu.shares) already prevents that under contention. The limit was only protecting against the case of no contention, which is the case where you don't need protection.

We've removed CPU limits on a number of latency-sensitive workloads. Removed throttling, no other side effects. Memory limits stay.

Memory limits = memory.max (or memory.limit_in_bytes on cgroup v1)#

Memory limits work the opposite way from CPU limits — they're a hard ceiling, not a quota. The cgroup can use up to memory.max bytes; one byte over and the kernel OOM-killer terminates the worst-offending process in the cgroup (often the container's main process, which means container death).

Memory is not compressible: you can't "throttle" a process's memory the way you throttle its CPU. When the kernel hits the wall, things die.

A few wrinkles:

The OOM killer chooses which process to kill. Within the cgroup, it picks the one with the highest "oom_score" (a kernel heuristic balancing memory use, age, and oom_score_adj). For containers with one process, this is straightforward. For containers running an init + a worker, the worker usually gets killed, the init exits, the container dies.

Page cache counts. Memory usage from the cgroup's perspective includes page cache (file-system buffers). A workload that does heavy file I/O can have a memory usage much higher than its RSS. Limits include both. This surprises people who size limits based on ps/RSS.

Memory pressure precedes OOM kill. Before going over the limit, the cgroup will swap (if swap is enabled, which it usually isn't in Kubernetes) or reclaim page cache. You can see this in memory.pressure cgroup files — it's a leading indicator before actual OOM kills.

Why containers see "wrong" memory in /proc/meminfo #

A common confusion: a container with a 2Gi memory limit, but cat /proc/meminfo inside the container shows the host's full memory (often much larger).

This is because /proc/meminfo reads from the host kernel, not the cgroup. The container's namespace doesn't isolate /proc/meminfo (or /proc/cpuinfo, or several others). Tools that read these files (like the JVM sizing its heap based on system memory) misbehave inside containers.

Modern JVMs (Java 11+) and Node.js automatically detect cgroup limits via /sys/fs/cgroup/... files. Older runtimes need explicit configuration (-XX:MaxRAMPercentage, --max-old-space-size).

This is the source of countless "my container died at 2Gi even though /proc/meminfo shows 32Gi" debugging sessions.

CPU "fractional" allocation isn't fractional #

When you set cpu: 100m (0.1 CPU), the kernel doesn't give you 10% of a single CPU. It gives you a quota of 10ms per 100ms period, which you can use across any available cores.

So a 100m limit on a 16-core node lets your workload use 10ms of any of the 16 cores per 100ms period. A pure single-threaded workload can use a single core for 10ms. A multi-threaded workload can run on 16 cores simultaneously and burn the entire quota in 1ms wall-clock.

In practice this means:

Heavily parallel workloads with low limits get throttled aggressively (lots of cores wanting CPU; quota exhausted fast).
Single-threaded workloads use their quota more evenly across wall-clock time.

CPU period tuning #

The default 100ms period is a Kubernetes/runtime default. The kernel supports periods as short as 1ms. Shorter periods = more granular throttling (less wall-clock waiting after hitting the quota).

There's a kernel knob (--cpu-cfs-period-us in kubelet config) that affects this. Most teams leave it at the default. We've experimented with shorter periods (5ms) on latency-sensitive workloads and seen p99 latency improvements when the workload was hitting throttling. Side effects: slightly more scheduling overhead. Not a universal win; depends on the workload.

What we tell teams #

Three rules that we've found generally work:

Set memory limits aggressively. Memory limits prevent runaway processes from taking down the node. Set them at ~1.3× max observed usage. Set them.
Be careful with CPU limits. For latency-sensitive workloads, consider removing them entirely. The request (cpu.shares) handles the "noisy neighbor" case under contention; the limit only adds throttling risk.
Match runtime config to limits. JVM -XX:MaxRAMPercentage, Node --max-old-space-size, Python's gc settings — all should be aware of the container's memory limit, not the host's.

What to read next #

Linux container internals — how containers actually work — the foundation cgroups are built on
Kubernetes resource requests — right-sizing without guessing — the request side of the picture
Linux performance tuning for production servers — broader perf knobs
eBPF tools for everyday ops — bpftrace patterns — how to diagnose throttling

Container resource limits look like simple numbers and aren't. Knowing what the kernel does with them turns "weird unexplained slowness" into "the cgroup hit cfs_quota_us at 47ms into the period and got throttled for 53ms." The diagnostic value alone is worth understanding the mechanism.

Container Resource Limits — What They Actually Do at the Kernel Level

Container Resource Limits — What They Actually Do at the Kernel Level

CPU requests = cpu.shares #

CPU limits = cpu.cfs_quota_us + cpu.cfs_period_us #

Why this matters in practice #

Memory limits = memory.max (or memory.limit_in_bytes on cgroup v1)#

Why containers see "wrong" memory in /proc/meminfo #

CPU "fractional" allocation isn't fractional #

CPU period tuning #

What we tell teams #

What to read next #

Stay Updated

Kubernetes Resource Requests — Right-Sizing Without Guessing

Burn-Rate Alerting — The SLO Discipline That Prevents Alert Fatigue

More from Linux

SSH Hardening in 2026: Keys, Certificates, and Bastion Patterns

Linux TCP Tuning for High-Throughput Services

Debugging Latency with eBPF: bpftrace One-Liners That Find It

SSH Hardening in 2026: Keys, Certificates, and Bastion Patterns

Linux TCP Tuning for High-Throughput Services

Debugging Latency with eBPF: bpftrace One-Liners That Find It

systemd Timers vs Cron: Migrating Scheduled Jobs the Right Way

The Edge Computing Playbook — What to Run at the Edge (and What Not To)

Observability for Edge Functions — Logs, Traces, and Metrics

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

About Kiril Urbonas

AWS Graviton Migration: What Broke and What We Saved