Skip to content
Published on

Docker BuildKit & Image Layers Complete Guide 2025: LLB, Cache Mount, Multi-Stage, OCI, Build Optimization Deep Dive

Authors

Introduction: The Evolution of docker build

10 Years Ago vs Today

Docker build in 2015:

$ docker build -t myapp .
Sending build context to Docker daemon  45.2MB
Step 1/12 : FROM node:14
Step 2/12 : WORKDIR /app
Step 3/12 : COPY package.json .
...
  • Sequential execution: one layer at a time.
  • Build context transfer: all files sent to the daemon.
  • Cache breaks: a single file change forces a full rebuild.
  • 5-10 minutes every time.

The same build in 2025 (BuildKit):

$ DOCKER_BUILDKIT=1 docker build -t myapp .
[+] Building 12.3s (15/15) FINISHED
 => [internal] load build definition       0.0s
 => [internal] load metadata                0.3s
 => [build 1/4] FROM node:14                0.0s (cached)
 => [build 2/4] COPY package.json .         0.1s
 => [build 3/4] RUN --mount=type=cache... npm install   8.2s
 ...
  • Parallel execution: independent steps at the same time.
  • Cache mount: reuse npm/maven/pip.
  • Fine-grained caching: only changed files are invalidated.
  • Seconds.

Same Dockerfile, 10x or more faster. The key difference: BuildKit.

What This Article Covers

  1. Docker image structure: OCI spec.
  2. Layers and UnionFS: the foundation of containers.
  3. Legacy builder vs BuildKit: what changed.
  4. LLB (Low-Level Builder): BuildKit's DSL.
  5. Multi-stage builds: small and secure.
  6. Cache strategies: layer, mount, registry.
  7. Reproducible builds.
  8. Security: distroless, scanning, SBOM.
  9. Practical optimization techniques.

1. The OCI Image Format

Open Container Initiative

The OCI (Open Container Initiative) is a Linux Foundation project defining container standards. Two major specifications:

  1. Runtime Spec: how to run containers.
  2. Image Spec: image format.

Docker, Podman, containerd, CRI-O all comply with OCI. Mutually interoperable.

Structure of an OCI Image

An OCI image is a collection of a few files:

manifest.json          # Image metadata
config.json            # Container configuration
layers/                # Filesystem layers
├── blob1.tar.gz
├── blob2.tar.gz
└── blob3.tar.gz

Manifest

Image manifest: a list of all components of the image.

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:abc123...",
    "size": 1234
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:def456...",
      "size": 54321
    },
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:ghi789...",
      "size": 123456
    }
  ]
}

Each entry is content-addressable: identified by SHA-256 digest.

Image Config

{
  "architecture": "amd64",
  "os": "linux",
  "config": {
    "Env": ["PATH=/usr/local/sbin:..."],
    "Cmd": ["node", "server.js"],
    "WorkingDir": "/app",
    "ExposedPorts": {
      "3000/tcp": {}
    }
  },
  "rootfs": {
    "type": "layers",
    "diff_ids": [
      "sha256:aaa...",
      "sha256:bbb..."
    ]
  },
  "history": [
    {
      "created": "2025-01-01T00:00:00Z",
      "created_by": "/bin/sh -c #(nop) ADD file:... in /"
    }
  ]
}
  • rootfs.diff_ids: hashes of layers after decompression. Different from manifest digests.
  • history: creation history of each layer.

Content Addressable Storage

Every blob is identified by a hash:

  • Layer file: sha256:def456...
  • Config: sha256:abc123...
  • Manifest: itself a hash.

Benefits:

  • Deduplication: identical layers stored only once.
  • Integrity: verified by hash.
  • Immutability: same hash = same content.

Image Index (Multi-arch)

Images for multiple platforms:

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.index.v1+json",
  "manifests": [
    {
      "mediaType": "application/vnd.oci.image.manifest.v1+json",
      "digest": "sha256:amd64...",
      "platform": {
        "architecture": "amd64",
        "os": "linux"
      }
    },
    {
      "mediaType": "application/vnd.oci.image.manifest.v1+json",
      "digest": "sha256:arm64...",
      "platform": {
        "architecture": "arm64",
        "os": "linux"
      }
    }
  ]
}

On docker pull, the manifest matching the host platform is selected.


2. Layers and UnionFS

The Nature of Layers

Each layer of a Docker image is a tar archive holding filesystem changes (deltas).

Example:

FROM alpine:3.19       # Layer 1: base Alpine filesystem
COPY app.js /app/      # Layer 2: adds /app/app.js
RUN apk add nodejs     # Layer 3: nodejs package + dependencies
RUN chmod +x /app/start.sh  # Layer 4: file permission change

Each layer contains only the change from the previous state:

  • Layer 1: all Alpine files.
  • Layer 2: one file, /app/app.js.
  • Layer 3: /usr/bin/node, libraries, and hundreds of files.
  • Layer 4: just metadata for /app/start.sh.

Union Filesystem

Union FS (OverlayFS, AUFS, Btrfs) merges multiple layers into one unified view:

merged view:         /app/app.js, /usr/bin/node, ...
   OverlayFS
  /   |   \
Layer 1 Layer 2 Layer 3 (read-only)
              \
           container writable layer (read-write)

Behavior:

  • Read: searches from the top layer down. Returns when found.
  • Write: stores to the topmost writable layer (copy-on-write).
  • Delete: marked with a whiteout file.

Copy-on-Write

When modifying a file:

  1. Read the file from the source layer.
  2. Copy to the writable layer.
  3. Modify the copy.
  4. The original is unchanged.

Result:

  • Multiple containers share the same image.
  • Each container keeps only its own changes.
  • Hundreds of containers using the same image → stored once.

Layer Sharing

An important benefit: multiple images share layers.

Image A: ubuntu:22.04 + node + myapp
Image B: ubuntu:22.04 + node + other-app

Shared layers:
- ubuntu:22.04 (stored once)
- node (stored once)

Unique layers:
- myapp (A only)
- other-app (B only)

Disk savings:

  • 10 images sharing the same base → 1x base + 10x app layers.
  • Tens of GB becomes a few GB.

Trade-offs in Layer Count

More layers:

  • Fine-grained build cache → faster builds.
  • Parallel downloads on pull possible.
  • But metadata overhead per layer.

Fewer layers:

  • Smaller images.
  • Pull may be more efficient.
  • Lower cache granularity.

Practical recommendation: 10-20 layers. Not too few, not too many.

Layer Size Limits

Technically no limit. However:

  • Single layer > 10 GB: problems possible.
  • Total image > 5 GB: pull time issues.
  • File count > 1 million: performance degradation.

In practice, a few GB or less and tens of thousands of files is appropriate.


3. Legacy Builder vs BuildKit

Legacy Docker Builder

Docker's pre-2017 build method:

Dockerfile → Docker Daemon → Sequential steps

Problems:

  1. Sequential only: independent steps also run sequentially.
  2. Client-server architecture: all files transferred to the daemon.
  3. Monolithic: no build graph.
  4. Limited caching: layer-level only.
  5. Hard to extend.

The Arrival of BuildKit

Released in Docker 18.06 in 2017. Now the default in Docker Desktop. Used by containerd, Podman, Kaniko, and other tools.

Key innovations:

  1. Build Graph: expresses builds as a DAG.
  2. Parallel execution: independent steps built simultaneously.
  3. Cache mount: cache directories used during builds.
  4. Secret mount: secrets used during builds (not left in the image).
  5. Multi-stage optimization: only required stages are built.
  6. Remote cache: registry as cache.
  7. Multi-platform: multiple archs at once.

Enabling BuildKit

Docker 23+: enabled by default.

Explicit enablement (older versions):

export DOCKER_BUILDKIT=1
docker build .

Config file:

// ~/.docker/config.json
{
  "features": {
    "buildkit": true
  }
}

Performance Differences

Actual measurements of the same Dockerfile:

BuildLegacyBuildKit
Cold (no cache)180s120s
Warm (1 file changed)180s15s
Warm (independent steps)60s5s

The difference made by BuildKit's fine-grained caching and parallelization.


4. BuildKit's LLB

Low-Level Builder

LLB (Low-Level Builder) is BuildKit's internal representation:

Dockerfile → Parser → LLB → Executor → Image

LLB is a low-level DSL expressing a build graph. Dockerfile is one of several frontends. Other frontends are possible:

  • Dockerfile frontend: default.
  • Buildpack frontend: Cloud Native Buildpacks.
  • Custom frontend: your own DSL.

DAG-based Builds

BuildKit converts the Dockerfile into a DAG:

FROM node:18 AS base
RUN apt-get update && apt-get install -y python3

FROM base AS deps
COPY package.json .
RUN npm install

FROM base AS build
COPY . .
RUN npm run build

FROM nginx:alpine AS runtime
COPY --from=build /app/dist /usr/share/nginx/html

DAG representation:

          node:18 (FROM)
         ┌─── base ───┐
         │            │
         ▼            ▼
       deps         build
         │            │
         ▼            ▼
   [unused]    nginx:alpine (FROM)
                  runtime

Note: the deps stage is not in the final image. BuildKit detects this and skips the build. The legacy builder would build it unconditionally.

Parallelism

Independent stages run in parallel:

FROM alpine AS a
RUN sleep 10

FROM alpine AS b
RUN sleep 10

FROM alpine
COPY --from=a /tmp /a
COPY --from=b /tmp /b
  • Legacy: 20s (10 + 10).
  • BuildKit: 10s (parallel).

Using LLB Directly

You can generate LLB with Go code:

import "github.com/moby/buildkit/client/llb"

st := llb.Image("alpine:3.19").
    Run(llb.Shlex("apk add --no-cache curl")).
    Root()

def, err := st.Marshal(ctx)
// Send to BuildKit

Advanced use cases: tools like Terraform and Bazel use LLB as a backend.


5. Multi-Stage Builds

Why Multi-Stage

Problem: build tools end up in the image.

# Bad example
FROM golang:1.21
WORKDIR /app
COPY . .
RUN go build -o myapp

CMD ["./myapp"]

Resulting image: ~1 GB (Go compiler, source, and dependencies all included).

Solution: multi-stage build.

# Build stage
FROM golang:1.21 AS builder
WORKDIR /app
COPY . .
RUN go build -o myapp

# Runtime stage
FROM alpine:3.19
COPY --from=builder /app/myapp /myapp
CMD ["/myapp"]

Resulting image: ~10 MB (Alpine + binary only).

100x smaller. Build tools exist only in the builder stage, not in the final image.

COPY --from

Copy files from another stage or image:

FROM alpine AS tools
RUN apk add --no-cache curl jq

FROM scratch
COPY --from=tools /usr/bin/curl /usr/bin/curl
COPY --from=tools /usr/bin/jq /usr/bin/jq
# Only the binaries you need

FROM scratch
COPY --from=python:3.11 /usr/local/bin/python3 /usr/local/bin/python3
# Works from registry images too

Stage Independence

Each stage is independent:

  • Different base images.
  • Different tools.
  • Different purposes.

Build perspective: BuildKit detects inter-stage dependencies. Builds only what's needed.

Common Patterns

1. Compiled languages (Go, Rust, C):

FROM rust:1.75 AS builder
WORKDIR /app
COPY . .
RUN cargo build --release

FROM debian:bookworm-slim
COPY --from=builder /app/target/release/myapp /usr/local/bin/
CMD ["myapp"]

2. Node.js:

FROM node:20 AS deps
WORKDIR /app
COPY package*.json ./
RUN npm ci

FROM node:20 AS build
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN npm run build

FROM node:20-slim
WORKDIR /app
COPY --from=build /app/dist ./dist
COPY --from=deps /app/node_modules ./node_modules
CMD ["node", "dist/server.js"]

3. Python:

FROM python:3.11 AS builder
COPY requirements.txt .
RUN pip install --target=/deps -r requirements.txt

FROM python:3.11-slim
COPY --from=builder /deps /usr/local/lib/python3.11/site-packages
COPY . /app
WORKDIR /app
CMD ["python", "main.py"]

4. Java:

FROM maven:3.9-eclipse-temurin-21 AS builder
WORKDIR /app
COPY pom.xml .
RUN mvn dependency:go-offline
COPY src ./src
RUN mvn package -DskipTests

FROM eclipse-temurin:21-jre
COPY --from=builder /app/target/*.jar /app.jar
CMD ["java", "-jar", "/app.jar"]

Stage Reuse

Reference the same stage from multiple places:

FROM alpine AS common
RUN apk add --no-cache ca-certificates

FROM common AS app1
# ...

FROM common AS app2
# ...

The common stage is built only once and shared.


6. Cache Strategies

Layer Cache (default)

The order in the Dockerfile affects caching:

FROM node:20
COPY . .                    # Copy everything
RUN npm install             # Runs every time

Problem: changing just one line in source invalidates the COPY . cache, and npm install runs again.

Correct order:

FROM node:20
COPY package*.json ./       # Only dependency files
RUN npm install             # Install dependencies
COPY . .                    # Rest (cache-independent)

Principle: things that rarely change on top, things that change often on the bottom.

.dockerignore

Files to exclude from the build context:

node_modules
.git
*.log
.env
dist
coverage

Effects:

  • Smaller build context.
  • COPY . cache stability.
  • Prevent leaking secrets.

Cache Mount (BuildKit essential)

A cache mount is a cache directory used only during the build. It is not included in the layer.

# syntax=docker/dockerfile:1.6

FROM node:20
WORKDIR /app
COPY package*.json ./
RUN --mount=type=cache,target=/root/.npm \
    npm install
COPY . .
RUN npm run build

Behavior:

  • /root/.npm is a persistent cache. Maintained across builds.
  • Reuses packages npm has already downloaded.
  • Not included in the image.

Performance difference:

  • First build: 60s.
  • Next build (cache used): 5s.

Language-specific Cache Mount Examples

Node.js (npm):

RUN --mount=type=cache,target=/root/.npm \
    npm ci

Node.js (pnpm):

RUN --mount=type=cache,target=/root/.local/share/pnpm/store \
    pnpm install --frozen-lockfile

Python (pip):

RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements.txt

Go:

RUN --mount=type=cache,target=/root/.cache/go-build \
    --mount=type=cache,target=/go/pkg/mod \
    go build -o /app ./...

Rust (cargo):

RUN --mount=type=cache,target=/usr/local/cargo/registry \
    --mount=type=cache,target=/app/target \
    cargo build --release

Java (Maven):

RUN --mount=type=cache,target=/root/.m2 \
    mvn package

Registry Cache

Store the build cache in a registry to share with other machines:

docker buildx build \
  --cache-to type=registry,ref=myregistry.com/myapp:cache \
  --cache-from type=registry,ref=myregistry.com/myapp:cache \
  -t myapp:latest .

Useful in CI/CD:

  • Shares cache across CI machines.
  • Drastically reduces build times.

Inline Cache

Embed cache metadata in the image itself:

docker buildx build \
  --cache-to type=inline \
  --push \
  -t myregistry.com/myapp:latest .

Use a pulled image directly as cache.

Git URL as Cache

Use a remote Git repository as cache:

# syntax=docker/dockerfile:1.6

FROM alpine
RUN --mount=type=bind,from=myorg/myrepo,target=/src \
    ...

7. Secret Management

Problem: Secrets During Build

# Bad: token remains in the image
ARG GITHUB_TOKEN
RUN git clone https://$GITHUB_TOKEN@github.com/private/repo.git

Result:

  • Token exposed in docker history.
  • Stored in image layers.
  • Stays forever.

Secret Mount

BuildKit's secret mount:

# syntax=docker/dockerfile:1.6

FROM alpine
RUN --mount=type=secret,id=github_token \
    TOKEN=$(cat /run/secrets/github_token) && \
    git clone https://$TOKEN@github.com/private/repo.git

Build command:

echo $GITHUB_TOKEN | docker buildx build \
  --secret id=github_token,src=/dev/stdin \
  -t myapp .

Effect:

  • Secret used only during the build.
  • Not left in the image.
  • Not in any layer.

SSH Mount

Pass the SSH agent socket to the build:

# syntax=docker/dockerfile:1.6

FROM alpine
RUN apk add --no-cache openssh-client git
RUN --mount=type=ssh \
    git clone git@github.com:private/repo.git

Build command:

docker buildx build --ssh default -t myapp .

The SSH key is never included in the image.


8. Reproducible Builds

Problem: Non-Reproducibility

Same Dockerfile, same source → different image?

FROM ubuntu:22.04
RUN apt-get update && apt-get install -y curl

Reasons:

  • State of apt repos at the time of apt-get update.
  • Package versions change over time.
  • Local environment differences (network, cache).

Result: the image built yesterday differs from the one built today. Problematic for security audits and debugging.

Requirements for Reproducible Builds

Fully reproducible:

  1. Pinned base image: FROM ubuntu:22.04@sha256:... (digest pinning).
  2. Pinned package versions: apt install curl=7.81.0-1ubuntu1.15.
  3. Use lockfiles: package-lock.json, requirements.txt (version pinning).
  4. Deterministic builds: time-independent.
  5. Clean environment: no external dependencies.

Digest Pinning

# Bad: tags can move
FROM node:20

# Good: digest is immutable
FROM node:20@sha256:8f0c5a7a1d0c7b8f...

Check digest:

docker pull node:20
docker images --digests | grep node

Package Pinning

apt (Debian/Ubuntu):

RUN apt-get update && apt-get install -y \
    curl=7.81.0-1ubuntu1.15 \
    && rm -rf /var/lib/apt/lists/*

pip:

requests==2.31.0
flask==3.0.0

npm:

{
  "dependencies": {
    "express": "4.18.2"
  }
}

Plus commit package-lock.json.

SOURCE_DATE_EPOCH

Compilers such as Go and Rust embed the build time in the binary. Pin it with SOURCE_DATE_EPOCH:

ARG SOURCE_DATE_EPOCH=0
FROM golang:1.21
ENV SOURCE_DATE_EPOCH=$SOURCE_DATE_EPOCH
RUN go build -ldflags="-buildid=" -trimpath .

Verification

# Build multiple times
docker build -t myapp:1 .
docker build -t myapp:2 .

# Compare images
docker inspect myapp:1 | jq .RootFS.Layers
docker inspect myapp:2 | jq .RootFS.Layers

# If identical, it's reproducible

In practice: 100% reproducible builds are rare. Near reproducible is usually enough.


9. Image Security

Distroless Images

Google's distroless project: minimal images containing only the essential runtime.

FROM golang:1.21 AS builder
WORKDIR /app
COPY . .
RUN go build -o myapp

FROM gcr.io/distroless/base-debian12
COPY --from=builder /app/myapp /
CMD ["/myapp"]

Characteristics:

  • No shell: no /bin/sh.
  • No package manager: no apt, apk.
  • Minimal size: tens of MB.
  • Minimized attack surface.

Scratch Image

Scratch: a fully empty image.

FROM golang:1.21 AS builder
WORKDIR /app
COPY . .
RUN CGO_ENABLED=0 go build -o myapp

FROM scratch
COPY --from=builder /app/myapp /
CMD ["/myapp"]

Result: just the binary. A few MB.

Caveats:

  • Static linking required (CGO_ENABLED=0).
  • Debugging is hard (no shell).
  • Copy CA certificates if you need them.

Image Scanning

Trivy (Aqua Security):

trivy image myapp:latest
  • CVE scanning.
  • Secret detection.
  • Config file analysis.

Grype (Anchore):

grype myapp:latest

Other tools: Snyk, Clair, Docker Scout, etc.

SBOM (Software Bill of Materials)

A complete list of software in the image:

Generate:

syft myapp:latest -o spdx-json > sbom.json

Contents:

  • All packages and versions.
  • License info.
  • Dependency tree.

Uses:

  • Supply chain security.
  • License compliance.
  • CVE matching.

Image Signing

Sign images with cosign:

# Sign
cosign sign --key cosign.key myregistry.com/myapp:latest

# Verify
cosign verify --key cosign.pub myregistry.com/myapp:latest

Keyless signing (OIDC):

cosign sign myregistry.com/myapp:latest
# Automatic signing via GitHub Actions OIDC

Part of the Sigstore project.


10. Practical Optimization

Minimize Layers

Bad:

RUN apt-get update
RUN apt-get install -y curl
RUN apt-get install -y jq
RUN rm -rf /var/lib/apt/lists/*

Four layers. Intermediate state is preserved in layers.

Good:

RUN apt-get update && \
    apt-get install -y \
      curl \
      jq && \
    rm -rf /var/lib/apt/lists/*

One layer. A smaller image.

Remove Temporary Files

Install + clean up in the same layer:

RUN apt-get update && \
    apt-get install -y build-essential && \
    make && \
    apt-get purge -y build-essential && \
    apt-get autoremove -y && \
    rm -rf /var/lib/apt/lists/*

Use .dockerignore

# Not needed for build
node_modules
.git
.github
*.log
*.md

# Security risk
.env
.env.local
**/secrets/

# Large temporary files
dist
build
coverage
.next

Effects:

  • Smaller build context.
  • Improved cache stability.
  • Improved security.

Small Base Images

# Bad: large base
FROM node:20              # ~1.1 GB

# Good: slim
FROM node:20-slim         # ~240 MB

# Better: alpine
FROM node:20-alpine       # ~180 MB

# Best: distroless
FROM gcr.io/distroless/nodejs20-debian12  # ~180 MB, no shell

Layer Order Optimization

Principle: immutable → frequently changing order.

FROM node:20-alpine

# 1. System dependencies (rarely change)
RUN apk add --no-cache libc6-compat

# 2. Package manager config (rarely change)
WORKDIR /app
COPY package.json package-lock.json ./

# 3. Install dependencies (only when package.json changes)
RUN --mount=type=cache,target=/root/.npm \
    npm ci --only=production

# 4. Application code (frequently changes)
COPY . .

# 5. Build (only when code changes)
RUN npm run build

CMD ["node", "dist/server.js"]

Multi-Platform Build

Build for multiple platforms at once with buildx:

docker buildx create --use
docker buildx build \
  --platform linux/amd64,linux/arm64 \
  -t myregistry.com/myapp:latest \
  --push .

Result: AMD64 and ARM64 images are pushed as a single manifest list.

BuildKit + CI/CD

GitHub Actions example:

- uses: docker/setup-buildx-action@v3

- uses: docker/build-push-action@v5
  with:
    context: .
    push: true
    tags: myregistry.com/myapp:${{ github.sha }}
    cache-from: type=gha
    cache-to: type=gha,mode=max
    platforms: linux/amd64,linux/arm64

type=gha: uses GitHub Actions cache. Shared cache across CI runs.


11. Common Pitfalls

Pitfall 1: Cache Invalidation

FROM node:20
WORKDIR /app
COPY . .  # One line changes, everything invalidated
RUN npm install

Fix: dependency files first.

Pitfall 2: Huge Build Context

$ docker build .
Sending build context: 500 MB  # Suspicious

Cause: no .dockerignore, so node_modules, .git are included.

Fix: add .dockerignore.

Pitfall 3: Leaking Secrets

ARG AWS_SECRET_KEY
ENV AWS_SECRET_KEY=$AWS_SECRET_KEY
# Forever in the image

Fix: --mount=type=secret.

Pitfall 4: Wrong Stage Choice

FROM node:20 AS builder
# ... build stuff ...

FROM node:20-alpine
COPY --from=builder /app .  # glibc binaries on alpine
# Won't run!

Fix: align base image architectures.

Pitfall 5: Platform Mismatch

Build on Apple Silicon:

docker build -t myapp .
# Built as arm64
docker push myregistry.com/myapp
# On the server: "exec format error"

Fix: specify --platform linux/amd64.

Pitfall 6: Root User

FROM node:20
# Runs as root by default
CMD ["node", "server.js"]

Problem: security risk.

Fix:

RUN useradd -u 1000 appuser
USER appuser

Pitfall 7: Unnecessary Files

COPY . .  # Everything

Fix: copy only what you need.

COPY src ./src
COPY package.json tsconfig.json ./

12. Debugging

Verbose Build Logs

docker buildx build --progress=plain .
# Print all logs

Inspecting Intermediate Containers

Legacy builder:

docker commit <intermediate_container_id> debug-image
docker run -it debug-image sh

BuildKit has no intermediate containers. Instead:

RUN ...
# Turn the state just before failure into an image

Or use --target to build up to a specific stage:

docker build --target builder -t myapp:debug .
docker run -it myapp:debug sh

Image Analysis

dive: explore the filesystem by layer.

dive myapp:latest
  • Layer-by-layer contents.
  • Wasted space.
  • Size statistics.

docker history:

docker history myapp:latest

Command and size of each layer.

skopeo: inspect images.

skopeo inspect docker://nginx:latest

Benchmark

time docker build -t myapp .

Use tools like Hyperfine for repeatable measurement.


Quiz Review

Q1. What are three reasons BuildKit is faster than the legacy builder?

A.

1. Parallelism

The legacy builder runs the Dockerfile sequentially:

FROM → RUN 1 → RUN 2 → RUN 3 → ...

Each step waits for the previous one. Even independent work is sequential.

BuildKit is DAG-based:

FROM ─┬─ RUN 1 (independent)
      └─ RUN 2 (independent)
          RUN 3 (depends on 1, 2)

Independent steps run concurrently. Most obvious in multi-stage builds:

FROM alpine AS a
RUN sleep 10

FROM alpine AS b
RUN sleep 10

FROM alpine
COPY --from=a /x /
COPY --from=b /y /
  • Legacy: 20s (10 + 10).
  • BuildKit: 10s (parallel).

In practice, 2-5x speedup on complex builds with multiple stages.

2. Fine-grained Cache Invalidation

Legacy builder cache:

Step 1/5 : COPY . .        ← One line changes,
Step 2/5 : RUN npm install ← this reruns too
Step 3/5 : RUN npm build   ← and this too

Everything reruns.

BuildKit:

  • Cache mount keeps the npm cache persistent.
  • Per-file change tracking.
  • File-level cache hashes.
# BuildKit with cache mount
RUN --mount=type=cache,target=/root/.npm \
    npm ci

Result:

  • First build: 60s.
  • Next build: 5s (npm hits cache).

3. Skip Unnecessary Work

In multi-stage builds, BuildKit builds only the stages actually needed:

FROM node:20 AS deps
RUN npm install

FROM node:20 AS build
COPY --from=deps /app/node_modules .
RUN npm run build

FROM node:20 AS test
COPY --from=deps /app/node_modules .
RUN npm test   ← Not needed for default build

FROM node:20-slim
COPY --from=build /app/dist ./dist
CMD ["node", "dist/server.js"]

Legacy: builds the test stage too.

BuildKit: the final image only references buildtest is skipped.

You can also use --target test to build only the test stage.

Performance comparison:

Complex Dockerfile with 10 stages:

ScenarioLegacyBuildKit
Cold build300s180s
Warm build (source change)300s20s
Warm build (dependency change)300s60s
Single stage onlyImpossible10s

Additional benefits:

4. Remote Cache: registry as cache:

docker buildx build \
  --cache-to type=registry,ref=myapp:cache \
  --cache-from type=registry,ref=myapp:cache

Cache shared across CI machines.

5. Secret Mount: secrets used only during build, not in the image.

6. Multi-platform: multiple architectures at once.

Takeaway:

BuildKit is not a simple upgrade but a fundamental redesign. The shift from viewing a Dockerfile as a "list of commands" to a "DAG" made all these optimizations possible.

Today (2025), Docker Desktop ships BuildKit by default, and even the docker build command uses BuildKit. The legacy builder is a thing of the past.

Practical recommendations:

  • All CI/CD: BuildKit is mandatory.
  • Local development: enabled by default (recent Docker).
  • Cache management: registry cache + gha cache.
  • Dockerfile optimization: use BuildKit features (# syntax=docker/dockerfile:1.6).

Migrating an old-style Dockerfile to BuildKit style alone can give 3-10x faster builds. Without any code changes.

Q2. How does a multi-stage build reduce image size by 100x?

A.

Basic idea: "separate build tools from runtime".

Problem scenario: building a Go app

Go apps need compilation. Building via Docker requires an image with the Go compiler:

Without multi-stage:

FROM golang:1.21
WORKDIR /app
COPY . .
RUN go build -o myapp
CMD ["./myapp"]

Result: ~1.1 GB image.

Why so big:

  • Go compiler + standard library: ~600 MB.
  • Build cache (/go/pkg): ~200 MB.
  • Source + dependencies: ~100 MB.
  • OS (Debian base): ~200 MB.
  • Only the binary is needed at runtime (~10 MB).

99% is waste.

Multi-stage solution:

# Stage 1: build
FROM golang:1.21 AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN go build -o myapp

# Stage 2: runtime
FROM alpine:3.19
COPY --from=builder /app/myapp /usr/local/bin/
CMD ["myapp"]

Result: ~17 MB image.

  • Alpine base: 5 MB.
  • Go binary: 10 MB.
  • CA certificates, tzdata, etc.: 2 MB.

65x smaller. In some cases, up to 100x.

How it works:

The core of multi-stage builds:

  1. Each stage is independent: can use different base images.
  2. COPY --from=stage: copy selectively from an earlier stage.
  3. Final stage becomes the resulting image: earlier stages are not included.

Internally:

BuildKit converts the Dockerfile into a DAG:

builder (FROM golang:1.21)
   │ RUN, COPY, ...
   └→ /app/myapp (artifact)
                       │ COPY --from=builder
runtime (FROM alpine:3.19)
   └→ final image (deployment)

The builder stage is intermediate. Not included in the final image. Invisible in docker images after the build.

More extreme: scratch

FROM golang:1.21 AS builder
WORKDIR /app
COPY . .
RUN CGO_ENABLED=0 go build -o myapp

FROM scratch
COPY --from=builder /app/myapp /
CMD ["/myapp"]

Result: ~10 MB (binary only).

  • scratch: a completely empty image. 0 bytes.
  • Only the binary is needed.
  • No shell, no libraries.

Constraints:

  • Static linking required (CGO_ENABLED=0).
  • No dynamic dependencies.
  • Hard to debug.

Real example: Rust

# Stage 1: cargo dependency cache
FROM rust:1.75 AS chef
WORKDIR /app
RUN cargo install cargo-chef

FROM chef AS planner
COPY . .
RUN cargo chef prepare --recipe-path recipe.json

FROM chef AS builder
COPY --from=planner /app/recipe.json recipe.json
RUN cargo chef cook --release --recipe-path recipe.json
COPY . .
RUN cargo build --release

# Stage 2: minimal runtime
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y \
    ca-certificates \
    && rm -rf /var/lib/apt/lists/*
COPY --from=builder /app/target/release/myapp /usr/local/bin/
CMD ["myapp"]

Result: 80 MB vs 1.5 GB (full Rust image).

Complex Node.js example:

# Stage 1: dev dependencies + build
FROM node:20 AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci  # install all deps (including dev)
COPY . .
RUN npm run build  # TypeScript compile, etc.

# Stage 2: production deps only
FROM node:20 AS deps
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production

# Stage 3: minimal runtime
FROM node:20-slim
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY --from=build /app/dist ./dist
COPY package*.json ./
CMD ["node", "dist/server.js"]

Result:

  • node:20: 1.1 GB
  • node:20 + all deps + build: 1.5 GB
  • This setup: 350 MB

4x smaller. With further optimization (alpine), down to 150 MB.

Benefits summary:

1. Pull speed:

  • 1 GB image pull: 30-60s.
  • 100 MB image pull: 3-5s.
  • Big impact on CI/CD and deployment speed.

2. Network cost:

  • AWS egress: $0.09/GB.
  • 1000 pull/day x 1 GB = $90/day.
  • 100 MB image: $9/day.
  • 1/10 the cost.

3. Storage:

  • Registry storage.
  • Node's image cache.
  • Smaller means savings.

4. Security:

  • Smaller attack surface.
  • Unused binaries = potential CVEs.
  • Scratch/distroless = minimal attack surface.

5. Startup time:

  • Image pull + startup.
  • Small image = fast startup.
  • Important for serverless and scale-out.

Pitfalls and solutions:

Pitfall 1: missing libraries

Binary not statically linked:

./myapp: error while loading shared libraries: libcurl.so.4

Fix:

  • Static linking (-static, CGO_ENABLED=0).
  • Or copy required libraries.

Pitfall 2: CA certificates

On HTTPS requests:

x509: certificate signed by unknown authority

Fix:

FROM alpine:3.19
RUN apk add --no-cache ca-certificates

Or:

FROM scratch
COPY --from=builder /etc/ssl/certs /etc/ssl/certs
COPY --from=builder /app/myapp /

Pitfall 3: timezone

FROM scratch
COPY --from=builder /usr/share/zoneinfo /usr/share/zoneinfo
ENV TZ=UTC

Pitfall 4: DNS

Some languages need /etc/nsswitch.conf:

COPY --from=builder /etc/nsswitch.conf /etc/nsswitch.conf

Pitfall 5: no shell

Scratch has no shell. Exec form is required:

CMD ["/myapp"]  # exec form (OK)
# The following won't work
# CMD /myapp    # shell form (NO)

Practical recommendations:

LanguageRecommended imageSize
Goscratch~10 MB
Rustdebian-slim~80 MB
Javaeclipse-temurin-jre~250 MB
Node.jsnode:alpine~150 MB
Pythonpython:slim~120 MB
C/C++alpine~5 MB

Takeaway:

Multi-stage build is one of Docker 2.0's biggest innovations (introduced in 2017). A single Dockerfile expresses the entire build pipeline while keeping the final image minimal.

It's a clever way to work around Docker's fundamental design flaw. Before, you had to manage "build images" and "runtime images" separately. Complicated.

With multi-stage:

  • One Dockerfile.
  • Simplified CI/CD.
  • Optimal image size.
  • Reproducible.

Today, 99% of production Dockerfiles are multi-stage. Not using it requires a reason.

If your Dockerfile is single-stage, switching to multi-stage is the easiest optimization. Images get smaller almost always and become more secure almost always.


Closing: The Evolution of Builds

Key Points

  1. OCI spec: container standards. Followed by all tools.
  2. Layers + UnionFS: sharing and efficiency.
  3. BuildKit: DAG-based, parallel, powerful caching.
  4. Multi-stage: the key to small images.
  5. Cache mount: dramatic build time reduction.
  6. Secret mount: safe secret management.
  7. Distroless/Scratch: minimal attack surface.
  8. Reproducible builds: audit and trust.

Production Checklist

Production Dockerfile:

  • Multi-stage build.
  • Slim/alpine/distroless base.
  • .dockerignore configured.
  • Dependency files COPY'd first.
  • Cache mounts (npm, pip, cargo).
  • Non-root user.
  • Secrets via --mount=type=secret.
  • Digest pinning (@sha256:...).
  • HEALTHCHECK defined.
  • Label metadata.
  • Multi-platform builds.
  • Image scanning.
  • SBOM generation.

Final Lesson

Docker popularized containers. BuildKit reinvented builds. Together, they became the foundation of modern CI/CD.

Today the following is taken for granted:

  • Minute-long builds → seconds.
  • GB images → MB images.
  • Opaque builds → reproducible.
  • Secret leakage → secret mount.

But all of this is achieved only when Dockerfiles are written properly. Badly written Dockerfiles are still slow, bloated, and insecure.

The knowledge in this post is something every engineer should know. Backend, frontend, DevOps, SRE, all of them. The images you build are pulled thousands of times and run tens of thousands of times in production. Small, fast, and secure images determine the efficiency of the entire system.

When you write your next Dockerfile, remember the principles from this post:

  • Multi-stage.
  • Cache mount.
  • Minimal base.
  • Order matters.
  • Don't leak secrets.
  • Think reproducibility.

These habits make 5x faster CI, 10x smaller images, and safer systems.


References