containerd CRI Implementation: Kubernetes Runtime Integration

containerd implements the Kubernetes CRI (Container Runtime Interface) as a built-in plugin. This post analyzes the CRI gRPC service implementation details, Pod Sandbox management, container spec translation, streaming API, RuntimeClass, and NRI.

1. CRI gRPC Service

1.1 Service Structure

CRI consists of two gRPC services:

CRI gRPC services:

RuntimeService:
  +-- PodSandbox management
  |     RunPodSandbox
  |     StopPodSandbox
  |     RemovePodSandbox
  |     PodSandboxStatus
  |     ListPodSandbox
  |
  +-- Container management
  |     CreateContainer
  |     StartContainer
  |     StopContainer
  |     RemoveContainer
  |     ListContainers
  |     ContainerStatus
  |     UpdateContainerResources
  |
  +-- Streaming
  |     ExecSync
  |     Exec
  |     Attach
  |     PortForward
  |
  +-- Runtime info
        Status
        Version

ImageService:
  +-- PullImage
  +-- ListImages
  +-- ImageStatus
  +-- RemoveImage
  +-- ImageFsInfo

1.2 Socket Configuration

CRI socket:

containerd serves CRI on the same gRPC socket:
  /run/containerd/containerd.sock

kubelet configuration:
  --container-runtime-endpoint=unix:///run/containerd/containerd.sock

CRI plugin registers CRI services on the containerd server:
  Plugin ID: io.containerd.grpc.v1.cri

2. Pod Sandbox

2.1 Pod Sandbox Concept

A Pod Sandbox represents the isolation environment for a Pod:

Pod Sandbox composition:

Pod Sandbox = Pause container + shared namespaces

Shared resources:
  - Network namespace (same IP, port space)
  - IPC namespace (inter-process communication)
  - UTS namespace (hostname)
  - PID namespace (optional)

Isolated resources:
  - Mount namespace (per container)
  - cgroup (per container resource limits)

2.2 RunPodSandbox Flow

RunPodSandbox processing:

1. Create Sandbox metadata
   - Generate ID
   - Create log directory
        |
        v
2. Pull Pause image
   - Determine image from sandbox_image config
   - Default: registry.k8s.io/pause:3.9
        |
        v
3. Prepare Pause container snapshot
        |
        v
4. Generate OCI spec
   - Minimal spec for Pause container
   - Include hostname, DNS configuration
        |
        v
5. Create network namespace
   - Create namespace file at /var/run/netns/
        |
        v
6. Call CNI plugin
   - Create network interface
   - Allocate IP
        |
        v
7. Create and start Pause container Task
        |
        v
8. Set Sandbox state to SANDBOX_READY

2.3 Pause Container

Pause container role:

1. Namespace holder:
   - First process in network namespace
   - Namespace persists even if App containers exit
   - Binds namespace lifecycle to Pod

2. PID 1 role:
   - Init process of Pod PID namespace
   - Reaps zombie processes
   - Minimal resource usage (approx. 1MB)

3. Behavior:
   - Waits indefinitely via pause() system call
   - Exits on SIGTERM

3. Container Spec Translation

3.1 CRI Request to OCI Spec

Spec translation process:

CRI ContainerConfig:
  - Image
  - Command, Args
  - Envs
  - Mounts
  - Devices
  - SecurityContext
  - Resources
        |
        v
containerd CRI plugin translates
        |
        v
OCI Runtime Spec:
  - root (image snapshot path)
  - process (command, env, capabilities)
  - mounts (volumes, special filesystems)
  - linux.resources (cgroup settings)
  - linux.namespaces (shared with Sandbox)
  - hooks (OCI hooks)

3.2 Resource Translation

Kubernetes resources -> OCI resource translation:

CPU:
  requests.cpu: 250m
    -> linux.resources.cpu.shares = 256
       (based on 1000m = 1024 shares)

  limits.cpu: 500m
    -> linux.resources.cpu.quota = 50000
       linux.resources.cpu.period = 100000
       (500m/1000m * 100000us)

Memory:
  limits.memory: 512Mi
    -> linux.resources.memory.limit = 536870912
       (in bytes)

  requests.memory:
    -> Used for scheduling only, not reflected in OCI spec

Hugepages:
  limits.hugepages-2Mi: 100Mi
    -> linux.resources.hugepageLimits:
         pageSize: "2MB"
         limit: 104857600

3.3 Security Context Translation

SecurityContext -> OCI spec translation:

runAsUser: 1000
  -> process.user.uid = 1000

runAsGroup: 1000
  -> process.user.gid = 1000

readOnlyRootFilesystem: true
  -> root.readonly = true

privileged: true
  -> Grant all capabilities
  -> Allow all device access
  -> Disable AppArmor/SELinux/Seccomp

capabilities:
  add: ["NET_ADMIN"]
  drop: ["ALL"]
  -> process.capabilities configuration

seccompProfile:
  type: RuntimeDefault
  -> Apply linux.seccomp profile

4. Streaming API

4.1 ExecSync

ExecSync operation:

Synchronously execute command in container:

1. kubelet calls ExecSync(containerID, cmd, timeout)
        |
        v
2. containerd sends Exec request to shim
        |
        v
3. Shim executes runc exec
   - Create new process in container namespaces
        |
        v
4. Capture stdout/stderr
        |
        v
5. Wait for process exit
        |
        v
6. Return exit code + stdout + stderr

Use cases: liveness/readiness probes, kubectl exec (sync)

4.2 Exec (Async Streaming)

Exec streaming operation:

1. kubelet calls Exec(containerID, cmd, stdin, stdout, stderr)
        |
        v
2. containerd returns streaming URL
   - Streaming server address: https://node:10250/exec/...
        |
        v
3. kubelet passes URL to client
        |
        v
4. Client connects to streaming server via WebSocket/SPDY
        |
        v
5. Streaming server performs actual Exec via containerd
        |
        v
6. Bidirectional stdin/stdout/stderr streaming

Streaming protocols:
  - SPDY (legacy)
  - WebSocket (modern)

4.3 Attach

Attach operation:

Connect to a running container's main process:

1. Generate streaming URL (similar to Exec)
        |
        v
2. Connect to container's stdin/stdout/stderr
   - Does not create new process
   - Directly connects to existing process I/O
        |
        v
3. Bidirectional streaming

Use case: kubectl attach

4.4 PortForward

PortForward operation:

Forward local traffic to Pod port:

1. Generate streaming URL
        |
        v
2. Execute socat/nsenter in Pod's network namespace
   - Create TCP connection to specified port
        |
        v
3. Bidirectional data transfer between local port and Pod port

Implementation:
  containerd enters the Pod's network namespace and
  creates a TCP connection to the target port.

Use case: kubectl port-forward

5. RuntimeClass

5.1 RuntimeClass Mapping

RuntimeClass processing:

1. Kubernetes RuntimeClass resource:
   apiVersion: node.k8s.io/v1
   kind: RuntimeClass
   metadata:
     name: kata
   handler: kata

2. kubelet passes runtime_handler = "kata"
   when calling CRI RunPodSandbox

3. containerd maps handler to runtime config:
   [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata]
     runtime_type = "io.containerd.kata.v2"

4. Create Task with corresponding shim binary:
   containerd-shim-kata-v2

5.2 Default Runtime

Default runtime configuration:

[plugins."io.containerd.grpc.v1.cri".containerd]
  default_runtime_name = "runc"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true

If Pod has no runtimeClassName, default runtime (runc) is used

5.3 RuntimeClass Overhead

RuntimeClass resource overhead:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata
handler: kata
overhead:
  podFixed:
    memory: "160Mi"
    cpu: "250m"

Overhead processing:
  - kubelet adds overhead to Pod resources
  - Scheduler includes overhead when selecting nodes
  - Reflects fixed costs of VM-based runtimes

6. NRI (Node Resource Interface)

6.1 NRI Overview

NRI is a plugin extension mechanism for containerd that allows registering hooks on container lifecycle events:

NRI architecture:

kubelet -> containerd
               |
               +-- NRI Plugin 1 (resource allocation)
               +-- NRI Plugin 2 (topology awareness)
               +-- NRI Plugin 3 (monitoring)

NRI plugins receive container lifecycle events and
can modify the OCI spec.

6.2 NRI Hook Points

NRI hook points:

1. RunPodSandbox:
   - Called on Pod creation
   - Pod-level resource allocation

2. CreateContainer:
   - Called on container creation
   - Can modify OCI spec
   - CPU pinning, memory NUMA allocation, etc.

3. StartContainer:
   - Called on container start

4. UpdateContainer:
   - Called on resource update

5. StopContainer:
   - Called on container stop
   - Resource release

6. RemoveContainer:
   - Called on container removal

6.3 NRI Use Cases

NRI use cases:

1. CPU/memory topology-aware allocation:
   - NUMA-aware CPU pinning
   - Allocate memory to specific NUMA nodes
   - Integration with topology manager

2. Device resource management:
   - GPU allocation optimization
   - RDMA resource management
   - Device plugin complementation

3. Security policy enforcement:
   - Dynamic Seccomp profiles
   - Runtime security rule injection

4. Monitoring/auditing:
   - Container start/stop event logging
   - Resource usage tracking

7. Image Service

7.1 Image Pull

CRI PullImage processing:

1. kubelet calls PullImage(imageSpec, authConfig)
        |
        v
2. containerd resolves image reference
   - Tag or digest
   - Apply registry auth credentials
        |
        v
3. Download image
   - Manifest, Config, Layers
   - Store in k8s.io namespace
        |
        v
4. Unpack layers
   - Create snapshot chain via Snapshotter
        |
        v
5. Return image reference (imageRef)

7.2 Image Caching

Image caching:

containerd image caching:
  - Skip download if layer already exists in Content Store
  - Skip unpacking if snapshot already exists in Snapshotter
  - Accurate deduplication based on digests

kubelet image policy:
  imagePullPolicy: Always
    -> Always check registry manifest (layers leverage cache)
  imagePullPolicy: IfNotPresent
    -> Pull only if not available locally
  imagePullPolicy: Never
    -> Use local images only

8. Monitoring and Debugging

8.1 CRI Metrics

containerd CRI-related metrics:

container_runtime_cri_operations_total:    CRI operation count
container_runtime_cri_operations_errors_total: CRI operation error count
container_runtime_cri_operations_latency_seconds: CRI operation latency

containerd internal metrics:
  containerd_task_count:                   Running Task count
  containerd_container_count:              Container count
  containerd_image_pull_duration_seconds:   Image pull duration

8.2 Debugging Tools

Debugging tools:

1. crictl (CRI CLI):
   crictl ps              # List containers
   crictl pods            # List pods
   crictl images          # List images
   crictl inspect CONTAINER_ID  # Container details
   crictl logs CONTAINER_ID     # Container logs
   crictl exec -it CONTAINER_ID /bin/sh  # exec

2. ctr (containerd CLI):
   ctr -n k8s.io containers list
   ctr -n k8s.io tasks list
   ctr -n k8s.io images list

3. containerd logs:
   journalctl -u containerd -f

9. Summary

containerd's CRI implementation is the core interface between Kubernetes and container runtimes. Pod-level isolation via Pod Sandbox, accurate CRI-to-OCI spec translation, WebSocket/SPDY-based streaming, multi-runtime support via RuntimeClass, and flexible extension via NRI are its key features. This layered design establishes containerd as a reliable container runtime for Kubernetes.