A container is a process with extra kernel features applied. Walking through namespaces, cgroups, and the actual mechanics — the level of detail that makes "container weirdness" debuggable.
Most container content stops at "containers are lightweight VMs" or "containers share the host kernel." Both are true and both are useless when you're debugging why your container does something weird. This post walks through what's actually under the hood, at the level of detail you need when something goes wrong.
A container is a regular Linux process with extra kernel features applied: namespaces for isolation, cgroups for resource limits, and a filesystem layout that looks different from the host. Everything else (Docker, containerd, runc, Kubernetes) is tooling on top of these primitives.
If you understand the primitives, "container behavior" becomes "process behavior in a particular kernel configuration," which is much easier to debug.
A namespace is a kernel feature that gives a process a different view of some part of the system. Linux has eight namespace types:
/.When you run a container, the runtime creates a new instance of (most of) these namespaces and starts the container process inside them.
You can see them yourself:
# In a running container, check its namespaces
$ ls -la /proc/self/ns/
lrwxrwxrwx 1 root root 0 ipc -> 'ipc:[4026532...]'
lrwxrwxrwx 1 root root 0 mnt -> 'mnt:[4026532...]'
lrwxrwxrwx 1 root root 0 net -> 'net:[4026532...]'
lrwxrwxrwx 1 root root 0 pid -> 'pid:[4026532...]'
...
The numbers in brackets are the namespace IDs. Different containers have different IDs; the host has its own.
You can also enter another process's namespace with nsenter. This is how kubectl exec works under the hood — it joins the container's namespaces.
PID namespace. Inside the container, the entrypoint process sees itself as PID 1. It only sees other processes in the same PID namespace. The host can see all of them with their host-side PIDs (different from the inside-container PIDs).
This is why "the container died but the host didn't notice" can happen — the container process has a PID 1, but if PID 1 dies, the kernel terminates the namespace. From the host, the container's processes just disappear.
Mount namespace. The container sees a different filesystem layout. Typically the runtime creates a private root filesystem, mounts the container image's layers, and chroots/pivot_root's into it. The container can't see the host's /etc/passwd, host's /var/log, etc.
This is why "the file isn't there" inside the container even though it's there on the host — different mount namespace, different filesystem view.
Network namespace. The container has its own network interfaces. Most commonly: a veth pair where one end is in the container, the other in the host (or a bridge / virtual switch). Routes, iptables rules, sockets — all are per-namespace.
A subtle thing: container's lo interface is a different lo than the host's lo. They don't connect.
User namespace. Maps UIDs/GIDs between namespace and host. Inside the container, root might be UID 0; on the host, that maps to (e.g.) UID 100000. So a "root" process in the container is actually unprivileged on the host.
Most production setups don't use user namespaces by default (Docker / Kubernetes typically run "rootless" via a different mechanism). When they do, things like file ownership get interesting because the in-container UID differs from the on-host UID.
Namespaces isolate; cgroups limit. A cgroup ("control group") is a kernel feature for accounting and limiting resource usage of a group of processes.
Cgroups v2 (the modern version) controllers we care about:
When a container starts, the runtime creates a cgroup for it and adds the container's processes to that cgroup. Limits set on the cgroup (via the runtime — e.g., --memory=2g) translate to writes to the cgroup's control files.
You can see this directly:
# Find a container's cgroup
$ cat /proc/<pid>/cgroup
0::/system.slice/docker-<container-id>.scope
# Look at its memory limit
$ cat /sys/fs/cgroup/system.slice/docker-<container-id>.scope/memory.max
2147483648
When the container's memory.usage hits memory.max, the kernel triggers OOM kill within the cgroup. The host doesn't crash; just one container dies.
This is why "memory limit exceeded" kills the container but not the host — the cgroup boundary contains the OOM scope.
A container image is a stack of layers. Each layer is a tarball of changes (files added, modified, removed) on top of the previous layer.
When the runtime starts a container, it:
OverlayFS is the kernel feature that does this. The container sees a unified filesystem; the actual data is split across multiple directories on the host.
Why this matters operationally:
docker run actually works#Walking through what happens when you docker run nginx:
clone() and the relevant namespace flags (CLONE_NEWPID, CLONE_NEWNS, CLONE_NEWNET, etc.).docker logs works.That's it. No virtualization, no hypervisor, no second kernel. Just one process with extra kernel features applied.
"My container exited immediately." The entrypoint (PID 1) terminated. PID 1 has special semantics — when it dies, the kernel terminates the PID namespace. Common cause: a bash entrypoint that ran a command and then exited.
"My signals aren't being handled." PID 1 in Linux has reduced signal-handling defaults. Many runtimes don't pass through SIGTERM / SIGINT correctly. Use a proper init like tini as PID 1, or write your app to handle it.
"Why is /proc/1/status showing weird limits?" /proc inside the container shows the container's view, but some files (like /proc/cpuinfo) reflect the host's CPU info, not the cgroup limits. This confuses tools like the JVM that read /proc/cpuinfo to size thread pools.
"My memory usage looks higher than I set the limit." Linux cgroup memory accounting includes page cache and other non-RSS memory. The "limit" is for total memory; the "RSS" you see in top is just the process's resident set, not the full cgroup usage.
"The container can see processes from another container." Should never happen with proper PID namespace setup. If it does, the namespace setup is broken (e.g., --pid=host was passed, sharing the host's PID namespace).
"DNS works on the host but not in the container." Different network namespaces; different /etc/resolv.conf. The container's resolv.conf is set by the runtime — usually pointing to a runtime-provided resolver.
If a container is just a process with extras, then:
strace, lsof, gdb etc. can attach to container processes from the host (you need to be root and may need to enter the container's namespaces with nsenter).unshare, cgroup configuration, and an exec is enough. The runtime is convenience.Things shared between containers and host:
This is the trade vs VMs: VMs isolate at the hardware level; containers isolate at the syscall level. Containers are lighter; VMs are more isolated.
Containers are a security boundary, but a thinner one than VMs:
--privileged, --cap-add ALL, mounted /var/run/docker.sock) defeat the security entirely.CAP_SYS_ADMIN, non-root user.For multi-tenant scenarios, the security boundary often needs to be stronger than a normal container. Options: gVisor (an additional userspace kernel), Kata Containers (lightweight VMs that look like containers), Firecracker (microVMs from AWS).
Read the manpages: namespaces(7), cgroups(7), unshare(1). The kernel docs are clearer than most blog posts.
Try unshare and nsenter directly. Spawning a process in a new PID namespace with one command makes the abstraction concrete.
Look inside /proc and /sys/fs/cgroup. The kernel exposes everything; reading these files demystifies what runtimes are doing.
When something is weird, check the namespaces and cgroups. "Why is this happening" usually has an answer in /proc/<pid>/ns/ or the cgroup files.
Don't think of containers as VMs. They're processes with extras. The mental model of "a small Linux box" leads to wrong intuitions about security, isolation, and behavior.
Containers are one of the great kernel-features-as-platform stories. They didn't add new fundamental capabilities to Linux; they composed existing ones (namespaces, cgroups, OverlayFS) into something useful. Understanding the layers under the runtime is the difference between treating containers as magic and treating them as engineering.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
Explore more articles in this category
We migrated most scheduled jobs from cron to systemd timers. The wins, the gotchas, and the cases we kept on cron anyway.
A curated list of shell one-liners that earn their place in real ops work — the ones I reach for weekly, not the trick-shot variety.
Generate an SSH key, set up passwordless login, and configure aliases for the servers you use daily — all without copy-pasting yet another long command.