Skip to content
Published on

[containerd] Container Lifecycle Management

Authors

containerd Container Lifecycle Management

This post analyzes the complete lifecycle of containers in containerd, from creation to termination. We examine the separation of container metadata and execution processes (Tasks), process management through shims, and integration of various runtime classes.


1. Separation of Container and Task

1.1 Core Concepts

containerd separates container metadata from execution state:

Container (metadata):
  - ID, image reference, snapshot key
  - OCI runtime spec
  - Labels, extension data
  - Persisted in BoltDB

Task (execution state):
  - Actually running process
  - PID, state (created/running/stopped)
  - stdin/stdout/stderr
  - Managed by shim process

1.2 Benefits of Separation

Benefits of separated design:

1. Container metadata can exist without a Task
   - Create container and start later
   - Preserve metadata of stopped containers

2. Independent of containerd restarts
   - Tasks are managed by shim, surviving containerd restarts
   - Reconnect to existing shims after restart

3. Support for multiple runtimes
   - Container object is runtime-agnostic
   - Runtime selected at Task creation time

2. Container Creation

2.1 Creation Process

Container creation flow:

1. Generate OCI spec from image
        |
        v
2. Prepare snapshot
   - Add Active snapshot to image snapshot chain
   - Writable layer for the container
        |
        v
3. Store container metadata
   - Create Container record in BoltDB
   - Store ID, image, snapshot, runtime, spec
        |
        v
4. Return container object
   (process not yet started)

2.2 OCI Runtime Spec

containerd generates an OCI runtime spec to define the container execution environment:

OCI runtime spec key sections:

ociVersion: "1.0.2"

process:
  terminal: false
  user: uid=0, gid=0
  args: ["/bin/sh"]
  env: ["PATH=/usr/local/sbin:..."]
  cwd: "/"
  capabilities: ...
  rlimits: ...

root:
  path: "rootfs"
  readonly: false

hostname: "container-abc"

mounts:
  - destination: "/proc"
    type: "proc"
    source: "proc"
  - destination: "/dev"
    type: "tmpfs"
    source: "tmpfs"

linux:
  namespaces:
    - type: "pid"
    - type: "network"
    - type: "ipc"
    - type: "uts"
    - type: "mount"
  resources:
    memory:
      limit: 536870912
    cpu:
      shares: 1024
      quota: 100000
      period: 100000
  cgroupsPath: "/kubelet/pod-abc/container-xyz"

2.3 Spec Generators (Spec Opts)

containerd spec generation pattern:

Spec Opts are function chains that incrementally build the OCI spec:

WithImageConfig(image)     -> Apply image CMD, ENV, WORKDIR
WithHostNamespace(ns)      -> Share host namespace
WithMemoryLimit(limit)     -> Set memory limit
WithCPUs(cpus)             -> Set CPU limit
WithMounts(mounts)         -> Add mount points
WithProcessArgs(args)      -> Set process arguments
WithRootfsPropagation(p)   -> Set rootfs mount propagation
WithSeccompProfile(p)      -> Apply Seccomp profile
WithApparmorProfile(p)     -> Apply AppArmor profile

3. Task Execution

3.1 Task Creation

Task creation flow:

1. Check container's runtime type
   (e.g., io.containerd.runc.v2)
        |
        v
2. Execute shim binary
   (containerd-shim-runc-v2 start)
        |
        v
3. Shim returns ttrpc socket address
        |
        v
4. containerd sends Create request to shim
   - Pass OCI spec
   - Pass bundle path
        |
        v
5. Shim executes runc create
   - Create namespaces
   - Configure cgroups
   - Mount rootfs
   - Create process (not yet started)
        |
        v
6. Task state: Created

3.2 Task Start

Task start:

1. containerd sends Start request to shim
        |
        v
2. Shim executes runc start
   - Start container process init
   - Synchronize via exec.fifo
        |
        v
3. Task state: Running
   - PID assigned
   - stdin/stdout/stderr connected

3.3 Task State Transitions

Task state machine:

  Created
     |
     | Start()
     v
  Running
     |
     +-- Kill(signal) -> Send signal
     |
     +-- Pause()  -> Paused
     |                 |
     |                 +-- Resume() -> Running
     |
     +-- Process exits -> Stopped
     |
     v
  Stopped
     |
     | Delete()
     v
  (deleted)

3.4 Exec (Additional Processes)

Exec operation:

Add a new process to an already running container:

1. Create ExecProcess
   - Define new process spec (args, env, user)
   - Assign execID
        |
        v
2. Send Exec request to shim
        |
        v
3. Execute runc exec
   - Enter existing container namespaces
   - Start new process
        |
        v
4. Independently manage stdin/stdout/stderr

Use cases: kubectl exec, docker exec

4. Shim Lifecycle

4.1 Shim Start

Shim start process:

1. containerd fork/execs shim binary
   containerd-shim-runc-v2 -namespace k8s.io \
     -id container-abc \
     -address /run/containerd/containerd.sock \
     start
        |
        v
2. Shim daemonizes itself
   - Detach from parent process (setsid)
   - Run independently of containerd
        |
        v
3. Create ttrpc Unix socket
   /run/containerd/s/abc123...
        |
        v
4. Output socket address to stdout
   containerd reads this address to connect

4.2 Shim Responsibilities

Shim key responsibilities:

1. Process management:
   - Act as parent of container process
   - Collect exit status via wait4()
   - Detect and report OOM events

2. I/O management:
   - Manage stdin/stdout/stderr FIFOs
   - Connect to log drivers
   - Copy I/O (containerProcess <-> FIFO)

3. Communication with containerd:
   - Receive commands via ttrpc
   - Report events (TaskExit, etc.)
   - Respond to status queries

4. containerd restart resilience:
   - Continue running when containerd restarts
   - Restarted containerd reconnects to existing shim
   - State recovery

4.3 Shim Shutdown

Shim shutdown:

1. Receive Task Delete request
        |
        v
2. Clean up container resources
   - Delete cgroups
   - Clean up namespaces
   - Unmount rootfs
        |
        v
3. Close ttrpc socket
        |
        v
4. Shim process exits

5. Checkpoint/Restore

5.1 Checkpoint

Checkpoint operation:

Save running container state as a snapshot:

1. Invoke CRIU (Checkpoint/Restore in Userspace)
        |
        v
2. Dump process memory
   - Save memory pages
   - Save file descriptor state
   - Save network connection state
        |
        v
3. Create checkpoint image
   - CRIU image file set
   - Stored alongside container spec
        |
        v
4. Optionally stop the container

Use cases:
  - Live migration
  - Fast start (restore from pre-warmed state)
  - Debugging (capture state at specific point)

5.2 Restore

Restore operation:

1. Load checkpoint image
        |
        v
2. Prepare new container environment
   - Create namespaces
   - Mount rootfs
        |
        v
3. Execute CRIU restore
   - Restore memory pages
   - Restore process state
   - Reconnect file descriptors
        |
        v
4. Resume process execution

6. Runtime Classes

6.1 Multiple Runtime Support

containerd supports various runtimes through the shim interface:

Runtime class comparison:

+----------+------------+-----------+----------+---------------+
| Runtime  | Isolation  | Overhead  | Startup  | Compatibility |
+----------+------------+-----------+----------+---------------+
| runc     | Namespace  | Minimal   | Fast     | Best          |
| kata     | Light VM   | Medium    | Medium   | High          |
| gVisor   | User kernel| Low       | Fast     | Medium        |
| Wasm     | Wasm sandbox| Minimal  | Very fast| Limited       |
+----------+------------+-----------+----------+---------------+

6.2 runc

runc:

- Default OCI runtime
- Linux namespace and cgroup-based isolation
- Uses host kernel directly
- Lowest overhead
- Suitable for all Linux container workloads

shim: containerd-shim-runc-v2
config.toml:
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
    runtime_type = "io.containerd.runc.v2"

6.3 Kata Containers

Kata Containers:

- Runs containers inside lightweight VMs
- Uses QEMU/Cloud-Hypervisor/Firecracker
- Strong isolation with separate guest kernel
- Suited for multi-tenant environments
- VM overhead exists

shim: containerd-shim-kata-v2
config.toml:
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata]
    runtime_type = "io.containerd.kata.v2"

6.4 gVisor

gVisor (runsc):

- User-space kernel (Sentry)
- Intercepts and reimplements system calls
- Reduces host kernel attack surface
- Operates via ptrace or KVM
- Some system calls unsupported

shim: containerd-shim-runsc-v1
config.toml:
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc]
    runtime_type = "io.containerd.runsc.v1"

6.5 WebAssembly (Wasm)

Wasm runtime:

- Runs WebAssembly binaries as containers
- Uses Wasmtime, WasmEdge, etc.
- Very fast startup (millisecond range)
- Minimal memory usage
- Portable binaries
- Limited system access (WASI)

shim: containerd-shim-wasm
config.toml:
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.wasm]
    runtime_type = "io.containerd.wasm.v1"

6.6 RuntimeClass Selection

Kubernetes RuntimeClass integration:

1. Define RuntimeClass resource:
   apiVersion: node.k8s.io/v1
   kind: RuntimeClass
   metadata:
     name: kata
   handler: kata

2. Specify RuntimeClass in Pod:
   spec:
     runtimeClassName: kata
     containers:
       - name: app
         image: nginx

3. containerd selects runtime matching handler:
   handler "kata" -> containerd.runtimes.kata config
   -> execute containerd-shim-kata-v2

7. Summary

containerd container lifecycle management revolves around the separation of Container (metadata) and Task (execution), process isolation through shims, and support for diverse runtime classes. The shim's daemonized design ensures containers survive containerd restarts, while the standardized OCI runtime spec interface integrates runtimes like runc, Kata, gVisor, and Wasm seamlessly.