containerd Image Management: OCI Images and Snapshots

The containerd image management subsystem handles storing, distributing, and unpacking images based on the OCI image spec. This post analyzes the internals of Content Store, Snapshotter, image pull flow, and garbage collection.

1. OCI Image Spec

1.1 Image Structure

An OCI image consists of three core components:

OCI image structure:

1. Image Index (Fat Manifest)
   - Supports multiple platforms (linux/amd64, linux/arm64, etc.)
   - Points to per-platform Manifests

2. Image Manifest
   - Digest of Config object
   - Layer list (ordered)
   - Media type information

3. Image Config
   - Environment variables, entrypoint, CMD
   - Layer diff ID list
   - Creation history

1.2 Content Addressable Storage

Content Addressable Storage:

All objects are identified by SHA256 digest:
  sha256:abc123... -> Image Index JSON
  sha256:def456... -> Image Manifest JSON
  sha256:789ghi... -> Image Config JSON
  sha256:jkl012... -> Layer tar.gz

Benefits:
  - Deduplication: identical layers stored only once
  - Integrity verification: validate data via digest
  - Caching: digest-based cache lookups

2. Content Store

2.1 Overview

Content Store is containerd's content-addressable storage that manages all binary data for images.

Content Store directory structure:

/var/lib/containerd/io.containerd.content.v1.content/
  blobs/
    sha256/
      abc123...  (Image Index)
      def456...  (Image Manifest)
      789ghi...  (Image Config)
      jkl012...  (Layer 1 tar.gz)
      mno345...  (Layer 2 tar.gz)
  ingest/
    (temporary download data)

2.2 Content Store API

Content Store key operations:

Info(digest)     -> Query content metadata (size, creation time)
ReaderAt(digest) -> Read content (io.ReaderAt interface)
Writer(ref)      -> Write content (atomic commit)
Delete(digest)   -> Delete content
ListStatuses()   -> Query in-progress writes
Abort(ref)       -> Cancel in-progress write

2.3 Ingest Process

Content write (Ingest) process:

1. Create Writer (assign reference key)
        |
        v
2. Create temporary file in ingest/ directory
        |
        v
3. Stream data writes
   (e.g., downloading layers from registry)
        |
        v
4. Digest verification
   (compare expected digest with actual data hash)
        |
        v
5. Atomic commit
   (move from ingest/ -> blobs/sha256/)
        |
        v
6. Clean up ingest/ on failure

3. Snapshotter

3.1 Snapshotter Overview

The Snapshotter is a plugin that manages image layers as filesystem snapshots, preparing the root filesystem for containers.

Snapshotter role:

Image layers (tar.gz)
    |
    v
Snapshotter converts each layer into a snapshot
    |
    v
Stacks snapshots to form a unified filesystem
    |
    v
Provides mount point to container

3.2 Snapshot Types

Snapshot types:

1. Committed
   - Read-only snapshot
   - Corresponds to image layers
   - Sharable across multiple containers

2. Active
   - Read/write snapshot
   - Writable layer for a container
   - Assigned to a single container

3.3 overlayfs Snapshotter

The most widely used Snapshotter:

overlayfs operation:

Layer 1 (base): /snapshots/1/fs  (lowerdir)
Layer 2 (app):  /snapshots/2/fs  (lowerdir)
Write layer:    /snapshots/3/fs  (upperdir)
Work directory: /snapshots/3/work (workdir)

Mount:
  mount -t overlay overlay \
    -o lowerdir=/snapshots/2/fs:/snapshots/1/fs,\
       upperdir=/snapshots/3/fs,\
       workdir=/snapshots/3/work \
    /container/rootfs

Benefits:
  - Copy-on-Write: copies only on modification
  - Fast container startup
  - Layer sharing saves disk space

3.4 native Snapshotter

native Snapshotter:

- Stores each snapshot in an independent directory
- Fully copies parent snapshot (using hardlinks)
- Used in environments without overlayfs support
- Higher disk usage
- Simple and highly portable

3.5 devmapper Snapshotter

devmapper Snapshotter:

- Uses Linux device mapper thin provisioning
- Block-level Copy-on-Write
- Suited for high-performance workloads
- Used with Firecracker microVMs
- Complex setup (requires thin-pool pre-configuration)

Use cases:
  - AWS Fargate (Firecracker)
  - High-performance container environments
  - Block storage-based infrastructure

3.6 Snapshotter API

Snapshotter key operations:

Stat(key)              -> Query snapshot info
Prepare(key, parent)   -> Create Active snapshot (writable)
View(key, parent)      -> Read-only view of Committed snapshot
Commit(name, key)      -> Convert Active snapshot to Committed
Mounts(key)            -> Return mount info for snapshot
Remove(key)            -> Delete snapshot

4. Image Pull Flow

4.1 Complete Flow

Image pull complete flow:

1. Resolve image reference
   docker.io/library/nginx:latest
        |
        v
2. Download Image Index/Manifest
   - Fetch manifest from registry
   - Select manifest for target platform
        |
        v
3. Download Config
   - Download image config JSON
   - Store in Content Store
        |
        v
4. Download layers (parallel)
   - Store each layer in Content Store
   - Skip already existing layers
        |
        v
5. Unpack layers
   - Read layers from Content Store
   - Create snapshots via Snapshotter
        |
        v
6. Register image metadata
   - Create image record in BoltDB
   - Map tags to digests

4.2 Layer Download Details

Layer download:

1. Extract layer digest list from manifest
2. Check if already exists in Content Store
3. Download only missing layers from registry
4. Transfer Service manages downloads:
   - Concurrent download limit (default 3)
   - Progress tracking
   - Retry logic
5. Each layer stored gzip-compressed in Content Store

4.3 Layer Unpacking

Layer unpacking:

1. Read layer blob from Content Store
2. Decompress gzip
3. Extract tar archive
4. Create snapshot in Snapshotter:
   a. First layer: Prepare without parent
   b. Apply layer contents to snapshot
   c. Commit to convert to read-only
   d. Next layer: Prepare with previous snapshot as parent
5. Complete final snapshot chain

Snapshot chain:
  Layer 1 (committed) <- Layer 2 (committed) <- Layer 3 (committed)

5. Image Metadata

5.1 Image Record

Image metadata (BoltDB):

Image record:
  - Name: "docker.io/library/nginx:latest"
  - Target:
      MediaType: "application/vnd.oci.image.index.v1+json"
      Digest: "sha256:abc123..."
      Size: 1234
  - Labels:
      "containerd.io/gc.ref.content.0": "sha256:def456..."
      "containerd.io/gc.ref.content.1": "sha256:789ghi..."
  - CreatedAt: 2026-03-20T00:00:00Z
  - UpdatedAt: 2026-03-20T00:00:00Z

5.2 Querying Images

# List images with ctr
ctr -n k8s.io images list

# Detailed image info
ctr -n k8s.io images check

# Inspect image content
ctr -n k8s.io content get sha256:abc123... | jq .

6. Garbage Collection

6.1 GC Mechanism

Garbage collection operation:

1. Identify root objects:
   - Image records
   - Container records
   - Lease records

2. Trace references (Mark):
   - Image -> Manifest -> Config + Layers
   - Container -> Snapshot chain
   - Lease -> Protected resources

3. Delete unreferenced objects (Sweep):
   - Delete unreferenced blobs from Content Store
   - Delete unreferenced snapshots from Snapshotter
   - Clean up orphaned metadata records

6.2 GC Labels

GC reference labels:

containerd manages GC references via labels:

Image labels:
  "containerd.io/gc.ref.content.0": "sha256:..."  (manifest reference)
  "containerd.io/gc.ref.content.1": "sha256:..."  (layer reference)

Content labels:
  "containerd.io/gc.ref.content.config": "sha256:..." (config reference)
  "containerd.io/gc.ref.content.l.0": "sha256:..."    (layer reference)

Snapshot labels:
  "containerd.io/gc.ref.snapshot.overlayfs": "sha256:..." (snapshot reference)

6.3 Lease

Lease:

- Protects in-progress operation resources from GC
- Protects downloaded layers during image pull
- Protects snapshots during container creation
- TTL-based automatic expiration
- Can be explicitly deleted after operation completes

Example:
  Image pull starts -> Lease created
  Layer download -> Lease protects content
  Image registration complete -> Lease deleted (image record holds references)

6.4 GC Scheduling

GC triggers:

1. Periodic execution:
   - gc_schedule in containerd config (no default)
   - When configured, uses cron expression for scheduling

2. Event-based:
   - On image deletion
   - On container deletion
   - Explicit API call

3. Manual execution via ctr:
   ctr -n k8s.io content prune

7. Summary

containerd image management is built on three pillars: Content Store's content-addressable storage, Snapshotter's layer management, and GC's resource cleanup. The overlayfs Snapshotter's Copy-on-Write mechanism enables fast container startup and efficient disk usage, while Lease-based GC protection ensures image operation safety.