Skip to content
Published on

Containers Are a Lie — The Kernel Truth Behind cgroups and Namespaces

Authors

Introduction

Let me defend the provocative title first. There is no "container" object in the kernel source code. Nothing like struct container exists. What we call a container is merely the result of layering namespaces (what can be seen), cgroups (how much can be used), and security mechanisms (what can be done) onto a process. Docker, containerd, runc — all of them are ultimately orchestrators that combine these kernel features.

Why does this matter? If your mental model is "a container is a lightweight VM," you cannot understand why container escape vulnerabilities exist, why all containers share one kernel, or why OOM Kill behaves the way it does. With the accurate model — a combination of kernel features — you can peel away the abstraction during an incident and reach the truth by reading /sys/fs/cgroup and /proc directly.

In this post we build that combination by hand. We experience each namespace with unshare, manipulate cgroup files directly, and finally build a mini container in bash and Go.

Container = A Combination of Kernel Features

First, the overall picture.

        The illusion called "container"
  +--------------------------------------+
  |  Actually just a Linux process       |
  |                                      |
  |  + namespaces (isolation: what is    |
  |    visible) pid, net, mnt, uts, ipc, |
  |    user, cgroup, (time)              |
  |                                      |
  |  + cgroups (limits: how much can     |
  |    be used) cpu, memory, io, pids... |
  |                                      |
  |  + security layers (privilege: what  |
  |    can be done) capabilities,        |
  |    seccomp, LSM (SELinux/AppArmor)   |
  |                                      |
  |  + root filesystem (overlayfs image) |
  +--------------------------------------+
            |
            v
     Everyone shares one host kernel
     (the fundamental difference from VMs)

The core thesis: a process inside a container is an ordinary process visible with ps on the host. Only its view, its limits, and its privileges have been manipulated.

A Tour of the Seven Namespaces — Hands-on with unshare

Namespaces partition the view of kernel resources. There are eight stable types today (including time), but seven form the core of containers. Let us try each with the unshare command.

NamespaceWhat it isolatesRole in containers
pidprocess ID spacePIDs start from 1 inside the container
netthe entire network stackper-container interfaces/routes/firewall
mntmount pointsper-container filesystem tree
utshostname, domain nameper-container hostname
ipcSystem V IPC, POSIX message queuesshared memory isolation
userUID/GID mappingsthe key to rootless containers
cgroupcgroup root viewhides where your cgroup actually is

One by one:

# 1) uts: the simplest namespace; hostname isolation
sudo unshare --uts bash
hostname container-test   # the host hostname stays unchanged
hostname                  # container-test
exit

# 2) pid: PID space isolation; --fork and --mount-proc are key
sudo unshare --pid --fork --mount-proc bash
echo $$                   # 1  <- this shell is PID 1
ps aux                    # only 2-3 processes visible
exit

# 3) net: you get an empty network stack
sudo unshare --net bash
ip link                   # only lo, and it is DOWN
exit

# 4) mnt: mount tree isolation
sudo unshare --mount bash
mount -t tmpfs tmpfs /mnt # a mount invisible to the host
findmnt /mnt
exit

# 5) ipc: shared memory isolation
ipcmk -M 1024             # create shared memory on the host
sudo unshare --ipc bash
ipcs -m                   # not visible (isolated)
exit

# 6) user: a non-root user becomes "root only inside the namespace"
unshare --user --map-root-user bash   # no sudo needed!
id                        # uid=0(root) ... but
cat /proc/self/uid_map    # see the mapping: 0 <yourUID> 1
exit

# 7) cgroup: isolates the view of cgroup paths
sudo unshare --cgroup bash
cat /proc/self/cgroup     # looks like the root (/)
exit

Combine them all and you already have the skeleton of a container.

unshare --user --map-root-user --pid --fork --mount-proc \
        --net --uts --ipc --cgroup bash

User Namespaces and Rootless Containers

The user namespace is the most important piece from a security standpoint. The key is UID mapping. If UID 0 (root) inside the namespace maps to an unprivileged host UID (say 100000), a process can act like root inside the container while being an ordinary user from the host point of view.

   View inside the container       The truth on the host
  +----------------+          +-----------------------+
  | uid 0 (root)   |  ----->  | uid 100000 (no power) |
  | uid 1 (daemon) |  ----->  | uid 100001            |
  | ...            |          | ...                   |
  | uid 65535      |  ----->  | uid 165535            |
  +----------------+          +-----------------------+

  /etc/subuid and /etc/subgid define the mappable ranges
  the newuidmap/newgidmap helpers write the mappings
# Check mapping ranges on the host
cat /etc/subuid    # e.g. youngju:100000:65536
cat /etc/subgid

# Check the mapping of a running process (substitute a real PID)
cat /proc/12345/uid_map

Thanks to this mapping, Podman and the rootless mode of recent Docker run even the daemon under ordinary user privileges. If an attacker escapes the container, all they obtain is an unprivileged host UID, dramatically shrinking the blast radius. Kubernetes user namespace support (hostUsers false) has also progressed through beta-on-by-default in 1.33 toward stabilization.

cgroup v2 — The Unified Hierarchy and Its File Interface

If namespaces are the view, cgroups are the limits. cgroup v2 unified the per-controller hierarchies of v1 into a single tree and is now the default on major distributions and in Kubernetes.

   /sys/fs/cgroup            (root, cgroup2 mount)
   |-- cgroup.controllers     available controllers
   |-- cgroup.subtree_control controllers delegated to children
   |-- system.slice/          systemd services
   |-- user.slice/            login sessions
   +-- kubepods.slice/        Kubernetes pods
       +-- kubepods-burstable.slice/
           +-- kubepods-burstable-pod<hash>.slice/
               +-- cri-containerd-<hash>.scope/
                   |-- cpu.max
                   |-- memory.max
                   |-- io.max
                   +-- pids.max

The beauty of cgroups is that every control is a file read/write. Let us manipulate them directly.

# Create a new cgroup and enable controllers
sudo mkdir /sys/fs/cgroup/lab
echo "+cpu +memory +io +pids" | sudo tee /sys/fs/cgroup/cgroup.subtree_control

# cpu: 20ms per 100ms period = 0.2 CPU
echo "20000 100000" | sudo tee /sys/fs/cgroup/lab/cpu.max

# memory: soft pressure line 192MiB, hard limit 256MiB
echo "201326592" | sudo tee /sys/fs/cgroup/lab/memory.high
echo "268435456" | sudo tee /sys/fs/cgroup/lab/memory.max

# pids: fork bomb protection
echo 128 | sudo tee /sys/fs/cgroup/lab/pids.max

# Put the current shell in and observe
echo $$ | sudo tee /sys/fs/cgroup/lab/cgroup.procs
cat /sys/fs/cgroup/lab/memory.current   # current usage
cat /sys/fs/cgroup/lab/cpu.stat         # usage, nr_throttled
cat /sys/fs/cgroup/lab/memory.events    # low/high/max/oom/oom_kill counts

memory.high vs memory.max — The Two Faces of OOM

The difference between these two is operationally decisive.

Aspectmemory.highmemory.max
Behavior when exceededreclaim pressure + forced allocation slowdownreclaim attempt, then OOM Kill on failure
Process survivaldoes not die (slows down)can be killed instantly via oom_kill
Purposegradual pressure, early warningthe last line of defense
Observabilitythe high count in memory.eventsthe oom_kill count in memory.events

A process that exceeds only memory.high does not die; instead its allocations slow down and it loses time to reclaim. If "the pod does not die but suddenly got slow," check the high count in memory.events and PSI (pressure stall information).

# The honest indicator of memory pressure: PSI
cat /sys/fs/cgroup/lab/memory.pressure
# some avg10=12.34 ... some tasks spent time waiting on memory
# full avg10=3.21  ... all tasks stalled simultaneously (serious)

Kubernetes memory limits are implemented as memory.max. This is exactly why a container exceeding its limit restarts as OOMKilled.

Building a Mini Container by Hand

Now we combine the pieces and build a container ourselves. First the bash version.

#!/bin/bash
# mini-container.sh — combining unshare + pivot_root + cgroup
# Prerequisite: a root filesystem (e.g. an extracted alpine mini rootfs)
set -euo pipefail

ROOTFS=/opt/rootfs-alpine
CG=/sys/fs/cgroup/minic

# 1) Prepare the cgroup: 0.5 CPU, 256MiB, 64 processes
mkdir -p "$CG"
echo "50000 100000" > "$CG/cpu.max"
echo "268435456"    > "$CG/memory.max"
echo 64             > "$CG/pids.max"

# 2) Create namespaces and run the setup script inside them
exec unshare --pid --fork --mount --uts --ipc --net bash -c '
  set -euo pipefail
  ROOTFS=/opt/rootfs-alpine

  # 2-1) Register ourselves in the cgroup
  echo $$ > /sys/fs/cgroup/minic/cgroup.procs

  # 2-2) Hostname
  hostname minic

  # 2-3) Block mount propagation (avoid polluting the host)
  mount --make-rprivate /

  # 2-4) Prepare pivot_root: new_root must be a mount point
  mount --bind "$ROOTFS" "$ROOTFS"
  cd "$ROOTFS"
  mkdir -p old_root

  # 2-5) Swap the root. Unlike chroot, escaping is hard
  pivot_root . old_root

  # 2-6) Essential virtual filesystems relative to the new root
  mount -t proc  proc  /proc
  mount -t tmpfs tmpfs /tmp

  # 2-7) Cut off access to the old root
  umount -l /old_root
  rmdir /old_root

  # 2-8) Run the container init
  exec /bin/sh
'

Three key points. First, we use pivot_root, not chroot. chroot only changes the view; pivot_root swaps the actual root of the mount namespace and detaches the old root, making escape much harder. Second, without mount --make-rprivate to stop mount event propagation, an umount inside the container could propagate to the host. Third, proc must be remounted inside the pid namespace for ps to show the isolated view.

The same thing in Go becomes a miniature runc.

// minic.go — a mini container in Go (proof of concept)
package main

import (
	"fmt"
	"os"
	"os/exec"
	"path/filepath"
	"syscall"
)

func main() {
	switch os.Args[1] {
	case "run": // parent: re-exec ourselves with new namespaces
		parent()
	case "child": // child: set things up inside the new namespaces
		child()
	default:
		panic("usage: minic run <cmd>")
	}
}

func parent() {
	cmd := exec.Command("/proc/self/exe",
		append([]string{"child"}, os.Args[2:]...)...)
	cmd.SysProcAttr = &syscall.SysProcAttr{
		Cloneflags: syscall.CLONE_NEWPID | syscall.CLONE_NEWNS |
			syscall.CLONE_NEWUTS | syscall.CLONE_NEWIPC |
			syscall.CLONE_NEWNET,
	}
	cmd.Stdin, cmd.Stdout, cmd.Stderr = os.Stdin, os.Stdout, os.Stderr
	if err := cmd.Run(); err != nil {
		fmt.Println("error:", err)
		os.Exit(1)
	}
}

func child() {
	rootfs := "/opt/rootfs-alpine"

	// cgroup registration (the v2 file interface, plain and simple)
	cg := "/sys/fs/cgroup/minic"
	os.MkdirAll(cg, 0755)
	os.WriteFile(filepath.Join(cg, "memory.max"),
		[]byte("268435456"), 0644)
	os.WriteFile(filepath.Join(cg, "pids.max"), []byte("64"), 0644)
	os.WriteFile(filepath.Join(cg, "cgroup.procs"),
		[]byte(fmt.Sprint(os.Getpid())), 0644)

	syscall.Sethostname([]byte("minic"))

	// Block mount propagation + pivot_root
	syscall.Mount("", "/", "", syscall.MS_REC|syscall.MS_PRIVATE, "")
	syscall.Mount(rootfs, rootfs, "", syscall.MS_BIND, "")
	os.MkdirAll(rootfs+"/old_root", 0700)
	syscall.PivotRoot(rootfs, rootfs+"/old_root")
	os.Chdir("/")
	syscall.Mount("proc", "/proc", "proc", 0, "")
	syscall.Unmount("/old_root", syscall.MNT_DETACH)
	os.Remove("/old_root")

	// Execute the user-supplied command
	cmd := exec.Command(os.Args[2], os.Args[3:]...)
	cmd.Stdin, cmd.Stdout, cmd.Stderr = os.Stdin, os.Stdout, os.Stderr
	cmd.Run()
}
go build -o minic minic.go
sudo ./minic run /bin/sh
# inside: hostname -> minic, ps aux -> from PID 1, ip link -> only lo

Roughly 70 lines reproduce the core isolation of Docker. What separates this from a production runtime starts here — image management, network wiring, and the security layers.

overlayfs — How Image Layers Work

The layer structure of container images is implemented with overlayfs.

   The filesystem the container sees (merged)
  +------------------------------------+
  |  /bin/sh   /etc/app.conf   /tmp/x  |
  +------------------------------------+
      ^ a single composed view
      |
  upperdir (writable, the container layer)   <- only changes recorded
  +------------------------------------+
  |  /etc/app.conf (modified)  /tmp/x  |
  +------------------------------------+
  lowerdir (read-only, the image layers)     <- several can be stacked
  +------------------------------------+
  |  layer3: app binaries              |
  |  layer2: runtime/libraries         |
  |  layer1: base OS                   |
  +------------------------------------+

  - read:   search top-down; the uppermost file wins
  - write:  modifying a lowerdir file copies it up to upperdir first
            (copy-up)
  - delete: a whiteout marker is created in upperdir
            (nothing is actually deleted)
# Hands-on with overlayfs
mkdir -p /tmp/ov/lower /tmp/ov/upper /tmp/ov/work /tmp/ov/merged
echo "from-image" > /tmp/ov/lower/base.txt
sudo mount -t overlay overlay \
  -o lowerdir=/tmp/ov/lower,upperdir=/tmp/ov/upper,workdir=/tmp/ov/work \
  /tmp/ov/merged

echo "modified" > /tmp/ov/merged/base.txt   # triggers copy-up
cat /tmp/ov/lower/base.txt                  # from-image (original kept)
cat /tmp/ov/upper/base.txt                  # modified (the delta)

A hundred containers using the same image share the lowerdir and each carries only a thin upperdir, making container creation fast and disk usage low. The flip side is copy-up cost: workloads that modify large files in the container layer (e.g. database data) must use volumes.

capabilities — Splitting root Apart

Traditional Unix privilege was a dichotomy: root (omnipotent) or ordinary user (powerless). Capabilities split root privileges into about 40 pieces so only the needed ones are granted.

# Inspect capabilities of the current process
capsh --print
grep Cap /proc/self/status   # CapEff is the effective capability bitmask
capsh --decode=00000000a80425fb   # decode the bitmask

# Grant a capability to a file (the alternative to setuid root)
sudo setcap cap_net_bind_service=+ep ./myserver  # allow only binding port 80

Container runtimes start with capabilities heavily trimmed. The default set of Docker/containerd contains around a dozen — CHOWN, DAC_OVERRIDE, FOWNER, SETUID, SETGID, NET_BIND_SERVICE, KILL, and so on — excluding dangerous ones like SYS_ADMIN (effectively root-grade), NET_ADMIN, and SYS_PTRACE.

# Kubernetes: drop everything, add back only what is needed
securityContext:
  capabilities:
    drop: ["ALL"]
    add: ["NET_BIND_SERVICE"]
  allowPrivilegeEscalation: false

privileged true is the switch that disables all of these protections at once — capability trimming, seccomp, device isolation. Reaching for privileged because "it did not work otherwise" is a decision to discard the entire container security model.

seccomp — Filtering System Calls

seccomp restricts which system calls a process can invoke, via a BPF filter. Linux has over 400 syscalls, but a typical application uses a fraction. Default profiles block dangerous calls such as keyctl, add_key, kexec_load, and open_by_handle_at (a call used in a real container escape).

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64"],
  "syscalls": [
    {
      "names": ["read", "write", "openat", "close", "fstat",
                 "mmap", "brk", "exit_group", "futex", "epoll_wait"],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}
# seccomp state of a process: 0 none, 1 strict, 2 filter
grep Seccomp /proc/self/status

# Apply the runtime default profile in Kubernetes
# securityContext.seccompProfile.type: RuntimeDefault

The practical baseline: apply RuntimeDefault to all workloads, and when a blocked call (EPERM) is suspected, confirm with strace and then allow it narrowly through a custom profile.

What Does the Runtime Actually Do — The Job of runc

The division of labor across the runtime layers should now be clear.

  kubelet
    | CRI (gRPC)
    v
  containerd          image pull/storage, snapshots (overlayfs),
    |                  container lifecycle
    | OCI spec (config.json)
    v
  runc                the executor of everything in this post:
                      create namespaces, configure cgroups, pivot_root,
                      apply capabilities, load seccomp, exec the process
                      -- then it exits (it is not a daemon)

runc takes the config.json of the OCI runtime spec (the list of namespaces, cgroup limits, capabilities, seccomp, mounts, rootfs path) and performs precisely and safely what we did by hand above. It shares its essence with our mini container.

# Peek at the OCI config of a running container
sudo cat /run/containerd/io.containerd.runtime.v2.task/k8s.io/*/config.json \
  | jq '.linux.namespaces, .process.capabilities.effective' | head -30

The Kubernetes Connection — QoS Classes and cgroup Drivers

How Kubernetes abstractions land on the cgroup tree:

QoS classConditionMeaning on the cgroup tree
Guaranteedall containers have requests = limitsfixed cpu.max, lowest OOM score (max protection)
Burstablerequests set, different from limitsproportional cpu.weight, medium protection
BestEffortno requests/limitsminimal weight, first OOM victim

Under memory pressure, the kernel OOM Killer kills in order of highest oom_score_adj, and the kubelet sets this value according to QoS. BestEffort dying first is not a policy decision somewhere — it is the direct consequence of cgroups and OOM scores.

For the cgroup driver, systemd (not cgroupfs) is the standard, because having two managers of the cgroup tree (systemd and kubelet) leads to conflicts. Features premised on cgroup v2 keep growing — memory QoS based on memory.high (the MemoryQoS feature gate), swap control, PSI-based eviction signals — so confirming the node runs v2 is table stakes.

stat -fc %T /sys/fs/cgroup    # cgroup2fs means v2

The Debugging Toolbox

Tools for peeling the abstraction and looking directly.

# 1) lsns: list all namespaces on the host
lsns                       # namespaces by type with a representative PID
lsns -t net                # only net namespaces

# 2) nsenter: enter the namespaces of a given process
PID=$(pgrep -f my-app | head -1)
nsenter -t "$PID" -n ss -tlnp      # view sockets in that container net ns
nsenter -t "$PID" -m -p ps aux     # view processes in mnt+pid ns
nsenter -t "$PID" -a bash          # enter everything (effectively exec)
# Even distroless containers without debug tools can be inspected with
# nsenter (host binaries, container view)

# 3) Compare namespace IDs directly via /proc
ls -l /proc/$$/ns/         # inode numbers of each ns
ls -l /proc/"$PID"/ns/     # same inode = same namespace

# 4) Trace the cgroup path: process -> cgroup -> limits/usage
cat /proc/"$PID"/cgroup    # find the cgroup path
CG=/sys/fs/cgroup$(cut -d: -f3 /proc/"$PID"/cgroup)
cat "$CG/memory.current" "$CG/memory.max" "$CG/memory.events"
cat "$CG/cpu.stat" | grep -E "usage|throttled"
cat "$CG/memory.pressure" "$CG/cpu.pressure"   # PSI

# 5) Post-mortem of an OOM Kill
dmesg -T | grep -i -A5 "killed process"
journalctl -k | grep -i oom

In particular, the incident of "the pod was OOMKilled but the graph showed memory below limits" is almost always either the difference between graph resolution (usually 15s or more) and an instantaneous spike, or the difference between memory.current (which includes page cache) and an RSS-only graph. The oom_kill counter in memory.events and dmesg tell the truth.

Pitfalls and Anti-patterns

  • Splashing privileged true around for debugging convenience. The entire isolation model is disabled. Adding only the needed capabilities is the answer.
  • Thinking of containers as VMs and forgetting the shared kernel. A kernel panic or exploit in one container is a node-wide problem.
  • Writing database data into the container layer (overlayfs upperdir). Wrong on both copy-up cost and volatility.
  • Running important workloads as BestEffort without memory limits. They are the first victims under pressure.
  • Forgetting the PID 1 problem. The first process of a container has different default signal behavior and is responsible for reaping zombies. Use a lightweight init like tini or the exec pattern in shells.
  • Operating v2 nodes with cgroup v1-era knowledge (file names like memory.limit_in_bytes). The interface is different.
  • Installing debug tools inside containers instead of using nsenter. It breaks image immutability and reproducibility.

Operations Checklist

  • Did you confirm the node runs cgroup v2 and the cgroup driver is systemd?
  • Is seccomp RuntimeDefault applied to all workloads by default?
  • Do you drop ALL capabilities and add back only what is needed?
  • Is the reason for every privileged container documented and reviewed periodically?
  • Have you evaluated rootless/user namespace adoption?
  • Are memory.events (oom_kill, high) and PSI collected as node/pod metrics?
  • Is the QoS class design (what must be Guaranteed) intentional?
  • Do write-heavy data paths use volumes instead of overlayfs?
  • Is PID 1 signal/zombie handling solved in every image?
  • Are lsns/nsenter-based debugging procedures in the runbook?

Closing

The title "containers are a lie" is praise, not an accusation. Namespaces, cgroups, capabilities, seccomp, overlayfs — kernel features that each evolved independently combine to create the illusion of a small isolated machine. As with every good abstraction, enjoy the illusion in normal times. But at the moment of an outage or a security incident, the person who can see through the illusion and read /sys/fs/cgroup and the ns links under /proc directly is the one who solves the problem. May the hands-on exercises in this post be the starting point of that x-ray vision.

References