A container is a process with extra kernel features applied. Walking through namespaces, cgroups, and the actual mechanics — the level of detail that makes "container weirdness" debuggable.

On this page

How Containers Actually Work in Linux

Most container content stops at "containers are lightweight VMs" or "containers share the host kernel." Both are true and both are useless when you're debugging why your container does something weird. This post walks through what's actually under the hood, at the level of detail you need when something goes wrong.

The summary upfront #

A container is a regular Linux process with extra kernel features applied: namespaces for isolation, cgroups for resource limits, and a filesystem layout that looks different from the host. Everything else (Docker, containerd, runc, Kubernetes) is tooling on top of these primitives.

If you understand the primitives, "container behavior" becomes "process behavior in a particular kernel configuration," which is much easier to debug.

Namespaces: the isolation primitive #

A namespace is a kernel feature that gives a process a different view of some part of the system. Linux has eight namespace types:

PID: the process tree. Process inside the namespace has PID 1.
Mount: filesystem mounts. Process sees a different /.
Network: network interfaces, routing tables, sockets.
UTS: hostname and domain name.
IPC: System V IPC and POSIX message queues.
User: UIDs and GIDs (mapping between namespace and host).
Cgroup: which cgroup the process appears to be in.
Time: monotonic and boot clocks (newer kernels).

When you run a container, the runtime creates a new instance of (most of) these namespaces and starts the container process inside them.

You can see them yourself:

sh.sh

# In a running container, check its namespaces
$ ls -la /proc/self/ns/
lrwxrwxrwx 1 root root 0 ipc -> 'ipc:[4026532...]'
lrwxrwxrwx 1 root root 0 mnt -> 'mnt:[4026532...]'
lrwxrwxrwx 1 root root 0 net -> 'net:[4026532...]'
lrwxrwxrwx 1 root root 0 pid -> 'pid:[4026532...]'
...

The numbers in brackets are the namespace IDs. Different containers have different IDs; the host has its own.

You can also enter another process's namespace with nsenter. This is how kubectl exec works under the hood — it joins the container's namespaces.

What each namespace gives you #

PID namespace. Inside the container, the entrypoint process sees itself as PID 1. It only sees other processes in the same PID namespace. The host can see all of them with their host-side PIDs (different from the inside-container PIDs).

This is why "the container died but the host didn't notice" can happen — the container process has a PID 1, but if PID 1 dies, the kernel terminates the namespace. From the host, the container's processes just disappear.

Mount namespace. The container sees a different filesystem layout. Typically the runtime creates a private root filesystem, mounts the container image's layers, and chroots/pivot_root's into it. The container can't see the host's /etc/passwd, host's /var/log, etc.

This is why "the file isn't there" inside the container even though it's there on the host — different mount namespace, different filesystem view.

Network namespace. The container has its own network interfaces. Most commonly: a veth pair where one end is in the container, the other in the host (or a bridge / virtual switch). Routes, iptables rules, sockets — all are per-namespace.

A subtle thing: container's lo interface is a different lo than the host's lo. They don't connect.

User namespace. Maps UIDs/GIDs between namespace and host. Inside the container, root might be UID 0; on the host, that maps to (e.g.) UID 100000. So a "root" process in the container is actually unprivileged on the host.

Most production setups don't use user namespaces by default (Docker / Kubernetes typically run "rootless" via a different mechanism). When they do, things like file ownership get interesting because the in-container UID differs from the on-host UID.

Cgroups: the resource limit primitive #

Namespaces isolate; cgroups limit. A cgroup ("control group") is a kernel feature for accounting and limiting resource usage of a group of processes.

Cgroups v2 (the modern version) controllers we care about:

cpu: CPU time
memory: memory usage and limits
io: block I/O bandwidth
pids: process/thread count

When a container starts, the runtime creates a cgroup for it and adds the container's processes to that cgroup. Limits set on the cgroup (via the runtime — e.g., --memory=2g) translate to writes to the cgroup's control files.

You can see this directly:

sh.sh

# Find a container's cgroup
$ cat /proc/<pid>/cgroup
0::/system.slice/docker-<container-id>.scope

# Look at its memory limit
$ cat /sys/fs/cgroup/system.slice/docker-<container-id>.scope/memory.max
2147483648

When the container's memory.usage hits memory.max, the kernel triggers OOM kill within the cgroup. The host doesn't crash; just one container dies.

This is why "memory limit exceeded" kills the container but not the host — the cgroup boundary contains the OOM scope.

The container image: layered filesystem #

A container image is a stack of layers. Each layer is a tarball of changes (files added, modified, removed) on top of the previous layer.

When the runtime starts a container, it:

Creates an overlay filesystem with the image's layers as the lower layers (read-only)
Adds a writable upper layer for runtime changes
Mounts this overlay as the container's root filesystem

OverlayFS is the kernel feature that does this. The container sees a unified filesystem; the actual data is split across multiple directories on the host.

Why this matters operationally:

Multiple containers from the same image share the lower layers. Disk-efficient.
Writes go to the upper layer (per-container). Lost when the container is removed.
Persistent data needs volumes. A volume bind-mounts a host directory (or a specific filesystem) into the container, bypassing the layered filesystem for that path.

How `docker run` actually works #

Walking through what happens when you docker run nginx:

Docker daemon receives the request.
Image is pulled if not already local. Each layer is a content-addressed blob; cached after first pull.
OverlayFS is set up. The image's layers are mounted; an empty upper layer is created.
A new process is created with clone() and the relevant namespace flags (CLONE_NEWPID, CLONE_NEWNS, CLONE_NEWNET, etc.).
The process pivot_roots into the overlay filesystem.
A cgroup is created for the container; the process is added.
Network is set up (veth pair, bridge attachment, etc.).
The container's entrypoint is exec'd as PID 1 in the new PID namespace.
The Docker daemon attaches to the container's stdio so docker logs works.

That's it. No virtualization, no hypervisor, no second kernel. Just one process with extra kernel features applied.

Common "container weirdness" explained #

"My container exited immediately." The entrypoint (PID 1) terminated. PID 1 has special semantics — when it dies, the kernel terminates the PID namespace. Common cause: a bash entrypoint that ran a command and then exited.

"My signals aren't being handled." PID 1 in Linux has reduced signal-handling defaults. Many runtimes don't pass through SIGTERM / SIGINT correctly. Use a proper init like tini as PID 1, or write your app to handle it.

"Why is /proc/1/status showing weird limits?" /proc inside the container shows the container's view, but some files (like /proc/cpuinfo) reflect the host's CPU info, not the cgroup limits. This confuses tools like the JVM that read /proc/cpuinfo to size thread pools.

"My memory usage looks higher than I set the limit." Linux cgroup memory accounting includes page cache and other non-RSS memory. The "limit" is for total memory; the "RSS" you see in top is just the process's resident set, not the full cgroup usage.

"The container can see processes from another container." Should never happen with proper PID namespace setup. If it does, the namespace setup is broken (e.g., --pid=host was passed, sharing the host's PID namespace).

"DNS works on the host but not in the container." Different network namespaces; different /etc/resolv.conf. The container's resolv.conf is set by the runtime — usually pointing to a runtime-provided resolver.

Container = process: things this implies #

If a container is just a process with extras, then:

Standard process tools work: strace, lsof, gdb etc. can attach to container processes from the host (you need to be root and may need to enter the container's namespaces with nsenter).
Container start time is microseconds. Just like starting a process. The "slow container start" you see is image pull, layer extract, network setup — not the process start itself.
You can run "container" workloads without a container runtime. Just unshare, cgroup configuration, and an exec is enough. The runtime is convenience.
Containers don't "boot." No init system unless you put one in. No services starting up. Just your entrypoint, immediately running.

What containers DON'T isolate #

Things shared between containers and host:

The kernel. Same kernel, same kernel modules, same syscall surface. A bug in the kernel affects everything.
The hostname (unless using UTS namespace, which most do).
Hardware devices (unless explicitly virtualized — GPUs, etc.).
System time (mostly — time namespace is recent and not always used).
Hardware-level state like CPU cache, memory bandwidth.

This is the trade vs VMs: VMs isolate at the hardware level; containers isolate at the syscall level. Containers are lighter; VMs are more isolated.

What this means for security #

Containers are a security boundary, but a thinner one than VMs:

A kernel-level vulnerability can escape any container.
Misconfigured container settings (--privileged, --cap-add ALL, mounted /var/run/docker.sock) defeat the security entirely.
Default Docker / Kubernetes settings are reasonable but not bulletproof. We add layers: seccomp profiles, AppArmor/SELinux, read-only root filesystem, no CAP_SYS_ADMIN, non-root user.

For multi-tenant scenarios, the security boundary often needs to be stronger than a normal container. Options: gVisor (an additional userspace kernel), Kata Containers (lightweight VMs that look like containers), Firecracker (microVMs from AWS).

What I'd tell someone learning containers #

Read the manpages: namespaces(7), cgroups(7), unshare(1). The kernel docs are clearer than most blog posts.

Try unshare and nsenter directly. Spawning a process in a new PID namespace with one command makes the abstraction concrete.

Look inside /proc and /sys/fs/cgroup. The kernel exposes everything; reading these files demystifies what runtimes are doing.

When something is weird, check the namespaces and cgroups. "Why is this happening" usually has an answer in /proc/<pid>/ns/ or the cgroup files.

Don't think of containers as VMs. They're processes with extras. The mental model of "a small Linux box" leads to wrong intuitions about security, isolation, and behavior.

Containers are one of the great kernel-features-as-platform stories. They didn't add new fundamental capabilities to Linux; they composed existing ones (namespaces, cgroups, OverlayFS) into something useful. Understanding the layers under the runtime is the difference between treating containers as magic and treating them as engineering.

Linux Container Internals: Understanding How Containers Work

How Containers Actually Work in Linux

The summary upfront #

Namespaces: the isolation primitive #

What each namespace gives you #

Cgroups: the resource limit primitive #

The container image: layered filesystem #

How `docker run` actually works #

Common "container weirdness" explained #

Container = process: things this implies #

What containers DON'T isolate #

What this means for security #

What I'd tell someone learning containers #

Stay Updated

Systemd Tricks We Use to Keep Services Boring

How We Stopped Terraform Drift from Surprising On-Call

More from Linux

SSH Hardening in 2026: Keys, Certificates, and Bastion Patterns

Linux TCP Tuning for High-Throughput Services

Debugging Latency with eBPF: bpftrace One-Liners That Find It

SSH Hardening in 2026: Keys, Certificates, and Bastion Patterns

Linux TCP Tuning for High-Throughput Services

Debugging Latency with eBPF: bpftrace One-Liners That Find It

systemd Timers vs Cron: Migrating Scheduled Jobs the Right Way

Four Signals That Matter: Choosing SLIs Users Actually Feel

Docker Compose in Production: When It Fits and When It Doesn't

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

About Kiril Urbonas