Skip to content
Published on

Container & Docker Internals Deep Dive — Namespace, cgroups, OverlayFS, seccomp, Capabilities and Kubernetes (2025)

Authors

0. "A container is not a VM"

Many engineers think of containers as "lightweight VMs". They are not:

AspectVMContainer
Isolation unitHardwareOS namespace
KernelPer-VMShared with host
Boot timeSeconds to minutesTens of ms
Memory overheadHundreds of MBA few MB
Disk overheadGBMB
Isolation strengthStrongWeaker

Container = a process plus kernel features (namespace, cgroup, rootfs). There is no separate OS inside.

This article unpacks what docker run nginx actually does in the Linux kernel, and the decade-plus of engineering history behind it.

1. A short history of containers

1.1 1979 — chroot

Unix v7's chroot() syscall. "Change the filesystem root for this process" so it only sees a restricted directory. Not enough on its own: processes, network, users remained visible, and escape tricks (chdir, mount) were well known.

1.2 2000 — FreeBSD Jails

chroot extended to isolate network, processes, and users. The real beginning of "lightweight virtualization".

1.3 2004 — Solaris Zones

A fully isolated container concept — the first commercially successful one.

1.4 2006 — cgroups, 2007 — LXC

Google engineers built "process containers" for internal resource isolation, which became Linux cgroups. Namespaces were added in stages. LXC (Linux Containers) combined these into the first practical container runtime.

1.5 2013 — Docker

Solomon Hykes's dotCloud wrapped LXC in a friendly API — the magic of docker run. Killer features: image layers (OverlayFS) plus a registry (Docker Hub). Engineers could "ship their environment" for the first time.

1.6 2015 — OCI, 2016 — containerd, 2017 — Kubernetes

The Open Container Initiative (Docker + CoreOS + Linux Foundation) standardized image and runtime specs. Docker split the runtime out as containerd. Kubernetes became the orchestration standard, abstracting away the Docker engine.

1.7 2024 — Today

Docker Desktop, Podman (daemonless), Kubernetes + containerd + runc, ECS, Cloud Run, Fly.io. Containers are the default unit of cloud deployment.

2. Linux Namespaces — 7 dimensions of isolation

Namespaces restrict "which OS resources this process can see". 7 kinds as of 2024:

2.1 PID Namespace

Process-ID isolation. Inside a container, ps shows:

PID TTY      TIME CMD
  1 ?     00:00:00 nginx
 10 ?     00:00:00 worker
 11 ?     00:00:00 worker

The container's nginx is PID 1. On the host, it might be PID 12345 — different number spaces. PID 1 is special: the kernel expects it to reap zombies (init role), and its termination kills all children. Most apps are not designed as init, so small inits like tini or dumb-init are used.

2.2 Mount Namespace

Isolates mount points. Each container has its own / tree — an evolution of chroot. Specific host mounts can be shared.

2.3 Network Namespace

Each container has its own interfaces, routing table, and iptables rules.

# Enter a container's network namespace from the host
sudo nsenter -t <container_pid> -n ip addr

Docker bridge mode: virtual bridge docker0 plus veth pairs.

2.4 UTS Namespace

Unix Timesharing System — isolates hostname and domain.

2.5 IPC Namespace

Isolates System V IPC and POSIX message queues.

2.6 User Namespace

UID/GID mapping. The container's root (UID 0) can map to an unprivileged host UID like 100000. Critical for security but complex — Docker does not enable it by default; Podman rootless relies on it.

2.7 Cgroup Namespace

Isolates the cgroup hierarchy so the container sees its own cgroup path as /. Introduced in 2016.

2.8 Time Namespace

New in Linux 5.6 (2020). Each container can have its own CLOCK_BOOTTIME. Used by checkpoint/restore (CRIU).

2.9 Manipulating namespaces

unshare --pid --mount --net --fork bash   # new bash in new namespaces
nsenter -t PID -n -p                       # enter existing namespaces

docker exec is essentially creating a new process that joins the target container's namespaces.

3. cgroups — resource limits

3.1 cgroups v1 (2007-)

A separate hierarchy per resource (CPU, memory, network, etc):

/sys/fs/cgroup/
├── cpu/
│   ├── docker/
│   │   └── <container_id>/
│   │       ├── cpu.cfs_quota_us
│   │       └── cpu.cfs_period_us
├── memory/
├── blkio/
└── ...

Independent trees made configuration complex.

3.2 cgroups v2 (2016-, systemd 232+)

A single unified hierarchy:

/sys/fs/cgroup/
├── user.slice/
├── system.slice/
│   └── docker-<id>.scope/
│       ├── cpu.max         # "100000 100000" = 1 CPU
│       ├── memory.max      # "536870912" = 512MB
│       ├── io.max
│       └── pids.max

One tree, hierarchical allocation, simpler model.

3.3 CPU limits in practice

cpu.max = "50000 100000" means "up to 50ms of CPU every 100ms period" = 0.5 CPU. JVM or Node bursts can be throttled, causing tail latency. Docker's --cpus=1 maps to cpu.max = "100000 100000".

3.4 Memory limits

  • memory.max: hard cap. OOM on overflow.
  • memory.high: soft limit triggering reclaim pressure.
  • memory.min, memory.low: protected memory.

Hitting memory.max triggers an in-container OOM killer — the host stays healthy.

3.5 cgroup-aware runtimes

JVM 10+ reads cgroup limits (-XX:+UseContainerSupport default). Go 1.19+ has GOMEMLIMIT. Node uses --max-old-space-size. Without these settings runtimes assume host memory (e.g. 64GB) and OOM quickly.

4. OverlayFS — the secret of image layers

4.1 Why layers

A Docker image is a stack of read-only layers:

Layer 4 (10KB): app code
Layer 3 (50MB): npm install output
Layer 2 (80MB): apt packages
Layer 1 (70MB): ubuntu base

Each layer is independent and immutable; images share base layers; only the changed top layer needs push/pull.

4.2 OverlayFS structure

┌─────────────────────┐
upperdir (RW)      │  ← container writes
├─────────────────────┤
│  merged view        │  ← what the app sees
├─────────────────────┤
lowerdir (RO) ×N   │  ← image layers
└─────────────────────┘

Reads fall through to lowerdir; writes go to upperdir; modifications trigger copy-up (copy the original to upperdir before editing).

4.3 Copy-up cost

Editing 1 byte of a 100MB file copies 100MB. That is why large DB files do not belong inside a container — use volume mounts.

4.4 Whiteouts — deletion

rm does not actually remove from lowerdir. Instead a special whiteout file is placed in upperdir so the merged view hides it — but the original still occupies the layer. That is why RUN rm in a later Dockerfile step does not shrink the image; delete in the same RUN that created the file.

4.5 Storage driver evolution

  • aufs (2013-): early, not upstream.
  • devicemapper (2014-): thin provisioning, slow.
  • btrfs, zfs: filesystem dependent.
  • overlay (2014): upstream, fast.
  • overlay2 (2016-): current default, multiple lowerdirs.

Podman also offers fuse-overlayfs for rootless mode.

5. Container runtimes — runc and friends

5.1 Docker's internal layers

docker CLI
dockerd (daemon)
containerd (container lifecycle)
containerd-shim (one per container)
runc (OCI runtime, actually creates containers)
Linux kernel (namespace, cgroup, ...)

5.2 runc's job

Implements the OCI Runtime Specification. Input: rootfs + config.json. Output: a running container. Written in Go/C, it calls syscalls directly: clone() with namespace flags, setns(), cgroup setup, rootfs mount, execve().

5.3 Alternative runtimes

  • crun: C implementation, faster than runc (no Go runtime).
  • youki: Rust implementation.
  • Kata Containers: each container in a microVM — VM-grade isolation.
  • gVisor: Google userspace kernel that intercepts syscalls.
  • Firecracker: AWS Lambda's microVM.

Multi-tenant platforms (Lambda, Fargate) use VM-based runtimes because kernel sharing is too risky.

5.4 The shim

containerd-shim exists per container. It keeps the container alive even if containerd crashes, collects stdout/stderr, and records exit codes. The chain is: kubelet → CRI → containerd → shim → runc.

6. Security — why containers are not "real" isolation

6.1 Shared-kernel risk

All containers share one kernel → a kernel CVE is a mass container escape. Real examples:

  • Dirty COW (CVE-2016-5195): privilege escalation, container escape.
  • runc CVE-2019-5736: overwriting the runc binary to own the host.
  • Leaky Vessels (2024): multiple runc/containerd CVEs.

Mitigations: patched kernels, AppArmor/SELinux, seccomp profiles.

6.2 Linux Capabilities

Traditional Unix root (UID 0) has total power — too dangerous for containers. Capabilities split root into pieces:

  • CAP_NET_ADMIN: network config.
  • CAP_SYS_ADMIN: most dangerous operations.
  • CAP_NET_BIND_SERVICE: bind ports below 1024.

Docker default: a restricted cap set (least privilege). Tune with --cap-add / --cap-drop. --privileged grants all caps — effectively host root.

6.3 seccomp — syscall filter

"Only allow this container to call these syscalls":

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "syscalls": [
    { "names": ["read", "write", "open"], "action": "SCMP_ACT_ALLOW" }
  ]
}

Docker's default profile allows ~300 syscalls and blocks keyctl, ptrace, mount, etc.

6.4 AppArmor / SELinux

Filesystem-level MAC. Ubuntu uses AppArmor, RHEL uses SELinux.

6.5 Rootless containers

Run the container daemon as a non-privileged user, mapping UIDs via user namespace. Podman's default. Docker also supports rootless. Much stronger security, with limitations (no ports below 1024, etc).

6.6 Security checklist

  • Patched kernel.
  • Keep the default seccomp profile (avoid --security-opt seccomp=unconfined).
  • Minimize capabilities (--cap-drop ALL + add only what is needed).
  • Run as non-root user (USER directive).
  • Read-only rootfs (--read-only with tmpfs volumes).
  • Image scanning: Trivy, Grype, Docker Scout.
  • Runtime monitoring: Falco.

7. Building images well

7.1 Layer optimization

# Bad: each RUN creates a new layer, cache remains in layer 1
FROM node:20
COPY package.json .
RUN npm install
COPY . .
RUN npm run build
RUN rm -rf /tmp/cache
# Good: clean up in the same RUN
FROM node:20
COPY package.json .
RUN npm install && rm -rf /tmp/* /var/cache/*
COPY . .
RUN npm run build

7.2 Multi-stage build

FROM node:20 AS builder
WORKDIR /app
COPY . .
RUN npm install && npm run build

FROM node:20-alpine
COPY --from=builder /app/dist /app
COPY --from=builder /app/node_modules /app/node_modules
CMD ["node", "/app/index.js"]

Build tools (gcc, webpack) never reach the final image — 5x+ size reduction.

7.3 Distroless images

Google's minimal images (no shell):

FROM gcr.io/distroless/nodejs20
COPY --from=builder /app /app
CMD ["/app/index.js"]

Tens of MB, minimal attack surface — but harder to debug (no exec shell).

7.4 Alpine vs Debian

  • Alpine (node:20-alpine): 5MB base, musl libc.
  • Debian slim (node:20-slim): 80MB, glibc.

Alpine gotcha: musl compatibility (DNS resolver, pthread quirks), extra tooling for Python C extensions.

7.5 BuildKit

DOCKER_BUILDKIT=1 docker build .

Parallel stages, registry cache export/import (--cache-to=type=registry,ref=...), build-time secrets (--secret), containerd native.

8. Networking — from docker0 to CNI

8.1 Docker default bridge

docker0 (bridge, 172.17.0.1)
  ├── veth0 → container1 eth0 (172.17.0.2)
  ├── veth1 → container2 eth0 (172.17.0.3)
  └── veth2 → container3 eth0 (172.17.0.4)

veth pairs (one end on host, one in container), NAT via iptables MASQUERADE, port forwarding via iptables DNAT (-p 80:8080).

8.2 Network modes

  • bridge (default).
  • host: share host network (no isolation, fastest).
  • none: no network.
  • container: share another container's network (pod-like).

8.3 Kubernetes CNI

Kubernetes mandates flat pod IPs. Docker bridge alone is not enough.

  • Flannel: VXLAN tunneling, simple.
  • Calico: BGP-based, eBPF support, network policy.
  • Cilium: eBPF native, rich observability/security.
  • AWS VPC CNI: ENI assigned directly to pods.

CNI (Container Network Interface) is the standard plugin contract.

9. Integration with Kubernetes

9.1 Why Kubernetes dropped Docker (2020)

Kubernetes 1.20 deprecated dockershim; 1.24 (2022) removed it. Reasons: Docker Engine contained features kubelet does not use (image builds, swarm); a shim was needed for CRI; containerd was already Docker's core. Result: kubelet → CRI → containerd direct. OCI images still work the same way.

9.2 Pod = containers sharing namespaces

A Pod is a group of containers sharing network, IPC, UTS namespaces. Mount and PID are independent by default.

pause container (owns the namespaces)
  ├── app container
  └── sidecar container

pause is a tiny no-op process that keeps the namespaces alive.

9.3 Init and sidecar containers

  • Init containers: run sequentially before main.
  • Sidecar (native since 2023): run alongside app with independent restart.

Envoy (Istio), Fluent Bit, etc. run as sidecars.

9.4 What orchestration adds

Scheduling, health checks and auto-restart, rolling updates, service discovery and load balancing, storage/config/secrets, horizontal autoscaling — all declarative YAML.

10. Practical tips

10.1 Debugging toolkit

docker ps -a
docker logs -f <container>
docker exec -it <container> sh
docker inspect <container>
docker stats
docker events

Kubernetes:

kubectl logs -f <pod>
kubectl exec -it <pod> -- sh
kubectl describe pod <pod>
kubectl get events --sort-by='.lastTimestamp'

10.2 Debugging distroless

No shell, so exec won't work. Options:

kubectl debug <pod> --image=busybox --target=app
# or
docker run --rm -it --pid=container:<id> --net=container:<id> busybox

Share pid/net namespaces to attach a debug container.

10.3 Image size analysis

  • docker history <image>: per-layer sizes.
  • dive tool: explore layer contents.
  • docker image ls --format="...".

10.4 Performance observation

docker stats --no-stream

cat /sys/fs/cgroup/memory.current
cat /sys/fs/cgroup/cpu.stat

docker run --cap-add SYS_ADMIN --pid host ...

11. Closing — what one docker run really does

A single docker run nginx performs:

  1. Download nginx image layers from Docker Hub.
  2. Assemble the layer stack with OverlayFS.
  3. containerd creates a containerd-shim.
  4. The shim launches runc.
  5. runc sets up 7 namespaces, cgroup, and rootfs.
  6. Applies seccomp/AppArmor profile.
  7. execve("nginx") starts the process.
  8. Container network (veth + bridge) is configured.
  9. iptables port forwarding is added.

Done in under 50ms. Compared with the minutes of a 2000s-era VM boot this is miraculous, resting on a decade of engineering: 2006 cgroups, 2008 namespaces, 2013 Docker, 2015 OCI, 2016 overlay2, 2020 CRI.

The price of this convenience is shared-kernel risk — the reason Lambda and Fargate use microVMs. For multi-tenant workloads, do not trust "container = isolation" blindly.

Next: Kubernetes internals — etcd's Raft consensus, scheduler filter/score, controller patterns, custom resources, and the CRI/CNI/CSI plugin system.

References

  • Jérôme Petazzoni — "Anatomy of a Container" (LinuxCon 2015).
  • Michael Kerrisk — "Understanding Linux Namespaces" (LWN series).
  • OCI Image/Runtime Specifications (GitHub).
  • Julia Evans — Container Networking series.
  • Liz Rice — "Container Security" (O'Reilly, 2020).
  • Kubernetes official docs — Pods, CNI, CRI.
  • Red Hat crun blog.
  • Firecracker paper (NSDI 2020).
  • Google gVisor paper.
  • "The Kubernetes Book" — Nigel Poulton.