Split View: containerd 네트워킹과 스토리지

containerd 네트워킹과 스토리지

containerd는 네트워킹과 스토리지를 직접 구현하지 않고 표준 인터페이스를 통해 외부 플러그인과 통합합니다. 이 글에서는 CNI를 통한 네트워크 구성, 네임스페이스 관리, 볼륨 마운트, 디바이스 접근, 보안 모듈 통합을 분석합니다.

1. CNI 통합

1.1 CNI 개요

Container Network Interface(CNI)는 컨테이너 네트워킹의 표준 인터페이스입니다. containerd는 CNI 플러그인을 호출하여 네트워크를 구성합니다.

CNI 호출 흐름:

kubelet -> containerd (CRI RunPodSandbox)
                |
                v
        네트워크 네임스페이스 생성
                |
                v
        CNI 플러그인 호출
        (ADD 명령)
                |
                v
        IP 할당, 라우팅 설정, 인터페이스 생성
                |
                v
        결과를 containerd에 반환

1.2 CNI 설정

CNI 설정 파일 위치:
  설정 디렉토리: /etc/cni/net.d/
  바이너리 디렉토리: /opt/cni/bin/

containerd CNI 설정 (config.toml):
  [plugins."io.containerd.grpc.v1.cri".cni]
    bin_dir = "/opt/cni/bin"
    conf_dir = "/etc/cni/net.d"
    max_conf_num = 1

1.3 CNI 플러그인 체인

CNI 설정 예시 (10-calico.conflist):

네트워크 구성은 플러그인 체인으로 정의:

1. 메인 플러그인 (calico, cilium, flannel 등):
   - 네트워크 인터페이스 생성
   - IP 할당 (IPAM)
   - 라우팅 규칙 설정

2. 메타 플러그인 (bandwidth, portmap 등):
   - 대역폭 제한
   - 포트 매핑
   - 방화벽 규칙

실행 순서:
  ADD: 메인 -> 메타 플러그인 (순방향)
  DEL: 메타 -> 메인 플러그인 (역방향)

1.4 CNI 호출 과정

CNI ADD 실행 상세:

1. containerd가 네트워크 네임스페이스 경로 결정
   /var/run/netns/cni-abc123

2. CNI 환경 변수 설정:
   CNI_COMMAND=ADD
   CNI_CONTAINERID=abc123
   CNI_NETNS=/var/run/netns/cni-abc123
   CNI_IFNAME=eth0
   CNI_PATH=/opt/cni/bin

3. CNI 플러그인 바이너리 실행
   stdin으로 설정 JSON 전달

4. 플러그인이 stdout으로 결과 반환:
   - 할당된 IP 주소
   - 게이트웨이 주소
   - DNS 설정
   - 라우팅 정보

5. containerd가 결과를 저장

2. 네트워크 네임스페이스

2.1 네임스페이스 생성

Pod 네트워크 네임스페이스:

Pod Sandbox 생성 시:
1. unshare(CLONE_NEWNET)으로 새 네트워크 네임스페이스 생성
2. /var/run/netns/ 에 바인드 마운트로 영구화
3. 해당 네임스페이스에서 CNI 플러그인 실행
4. Pod의 모든 컨테이너가 이 네임스페이스를 공유

네임스페이스 공유:
  Pause 컨테이너가 네트워크 네임스페이스를 보유
  App 컨테이너가 같은 네임스페이스에 참여
  -> Pod 내 컨테이너가 localhost로 통신 가능

2.2 네임스페이스 정리

네임스페이스 정리:

Pod 삭제 시:
1. CNI DEL 명령으로 네트워크 리소스 해제
   - IP 주소 반환
   - 인터페이스 삭제
   - 라우팅 규칙 제거
2. /var/run/netns/ 에서 바인드 마운트 해제
3. 네트워크 네임스페이스 자동 삭제

3. 볼륨 마운트

3.1 마운트 타입

containerd는 OCI 스펙의 마운트 구성을 통해 볼륨을 관리합니다:

마운트 타입:

1. bind 마운트:
   - 호스트 파일/디렉토리를 컨테이너에 마운트
   - 호스트와 컨테이너가 동일 데이터 공유
   - ConfigMap, Secret, emptyDir 등에 사용

2. tmpfs 마운트:
   - 메모리 기반 파일시스템
   - 컨테이너 종료 시 데이터 소멸
   - /dev/shm, /run 등에 사용

3. 특수 파일시스템:
   - proc: /proc
   - sysfs: /sys
   - cgroup: /sys/fs/cgroup
   - devpts: /dev/pts

3.2 마운트 전파

마운트 전파(Propagation) 옵션:

1. private:
   - 마운트 이벤트 전파 없음
   - 기본값

2. rprivate:
   - 재귀적 private

3. shared:
   - 마운트 이벤트를 양방향 전파
   - 호스트에서 마운트 -> 컨테이너에서도 보임
   - 컨테이너에서 마운트 -> 호스트에서도 보임

4. rshared:
   - 재귀적 shared

5. slave:
   - 호스트 -> 컨테이너 단방향 전파
   - 볼륨 플러그인에 유용

6. rslave:
   - 재귀적 slave

Kubernetes에서의 사용:
  - MountPropagation 필드로 제어
  - CSI 드라이버는 주로 Bidirectional (shared) 사용

3.3 CRI 볼륨 처리

CRI를 통한 볼륨 처리:

kubelet이 OCI 스펙에 마운트를 추가:

1. emptyDir:
   - kubelet이 호스트에 디렉토리 생성
   - bind 마운트로 컨테이너에 전달

2. hostPath:
   - 호스트 경로를 직접 bind 마운트

3. ConfigMap/Secret:
   - kubelet이 tmpfs에 데이터 생성
   - bind 마운트로 컨테이너에 전달

4. PersistentVolumeClaim:
   - kubelet이 CSI 드라이버를 통해 볼륨 마운트
   - 마운트된 경로를 bind 마운트로 전달

containerd의 역할:
  - kubelet이 준비한 마운트 정보를 OCI 스펙에 반영
  - runc가 실제 마운트 수행

4. 디바이스 접근

4.1 디바이스 매핑

디바이스 접근 메커니즘:

OCI 스펙의 devices 섹션:
  linux:
    devices:
      - path: "/dev/nvidia0"
        type: "c"
        major: 195
        minor: 0
        fileMode: 438
        uid: 0
        gid: 0

cgroup 디바이스 접근 제어:
  linux:
    resources:
      devices:
        - allow: true
          type: "c"
          major: 195
          access: "rwm"

4.2 GPU 지원

GPU 접근 (NVIDIA):

NVIDIA Container Toolkit 통합:

1. nvidia-container-runtime-hook:
   - OCI 런타임 훅으로 동작
   - 컨테이너 시작 전에 실행
   - NVIDIA 드라이버 라이브러리를 컨테이너에 마운트
   - GPU 디바이스 노드를 컨테이너에 추가

2. CDI (Container Device Interface):
   - 디바이스 벤더 중립적 표준
   - /etc/cdi/ 에 디바이스 스펙 정의
   - containerd가 CDI 스펙을 읽어 OCI 스펙에 반영

CDI 스펙 예시:
  cdiVersion: "0.5.0"
  kind: "nvidia.com/gpu"
  devices:
    - name: "0"
      containerEdits:
        deviceNodes:
          - path: "/dev/nvidia0"
        mounts:
          - hostPath: "/usr/lib/x86_64-linux-gnu/libnvidia-ml.so"
            containerPath: "/usr/lib/x86_64-linux-gnu/libnvidia-ml.so"

4.3 기타 디바이스

기타 디바이스 접근:

1. FPGA:
   - CDI 스펙으로 FPGA 디바이스 노출
   - 벤더별 디바이스 플러그인

2. InfiniBand/RDMA:
   - /dev/infiniband/* 디바이스 매핑
   - 네트워크 디바이스 네임스페이스 공유

3. 시리얼/USB:
   - 호스트 디바이스 직접 매핑
   - privileged 모드 또는 명시적 디바이스 허용

5. SELinux 통합

5.1 SELinux 컨텍스트

SELinux 컨테이너 보안:

OCI 스펙의 SELinux 설정:
  linux:
    mountLabel: "system_u:object_r:container_file_t:s0:c1,c2"
    processLabel: "system_u:system_r:container_t:s0:c1,c2"

구성 요소:
  - user: system_u
  - role: system_r (프로세스) / object_r (파일)
  - type: container_t (프로세스) / container_file_t (파일)
  - level: s0:c1,c2 (MCS 카테고리)

MCS (Multi-Category Security):
  - 각 컨테이너에 고유 카테고리 할당
  - 다른 컨테이너의 파일에 접근 불가
  - 호스트와 컨테이너 간 격리

5.2 SELinux 처리 흐름

SELinux 적용:

1. kubelet이 Pod의 SELinux 옵션 결정
   - securityContext.seLinuxOptions
   - 자동 MCS 라벨 할당

2. CRI를 통해 containerd에 전달
   - processLabel: 프로세스 보안 컨텍스트
   - mountLabel: 파일 보안 컨텍스트

3. containerd가 OCI 스펙에 반영

4. runc가 실행 시:
   - 프로세스에 SELinux 라벨 적용
   - rootfs에 SELinux 라벨 적용
   - 마운트에 SELinux 라벨 적용

6. AppArmor 통합

6.1 AppArmor 프로파일

AppArmor 컨테이너 보안:

기본 프로파일: cri-containerd.apparmor.d

주요 규칙:
  - 파일시스템 접근 제한
    deny /proc/kcore r,
    deny /sys/firmware/** r,
  - 네트워크 접근 제어
  - 능력(capability) 제한
  - 마운트 연산 제한

프로파일 적용:
  OCI 스펙:
    process:
      apparmorProfile: "cri-containerd.apparmor.d"

6.2 커스텀 프로파일

커스텀 AppArmor 프로파일:

1. 호스트에 프로파일 설치:
   /etc/apparmor.d/ 에 프로파일 파일 배치
   apparmor_parser -r /etc/apparmor.d/my-profile

2. Pod에서 지정:
   annotations:
     container.apparmor.security.beta.kubernetes.io/app: localhost/my-profile

3. containerd가 OCI 스펙에 반영:
   process:
     apparmorProfile: "my-profile"

7. Seccomp 통합

7.1 Seccomp 프로파일

Seccomp (Secure Computing):

허용/차단할 시스템 콜을 정의:

기본 동작: SCMP_ACT_ERRNO (거부)

허용 시스템 콜 예시:
  - read, write, open, close
  - mmap, mprotect, munmap
  - socket, connect, accept
  - ...

차단 시스템 콜 예시:
  - mount, umount (컨테이너 탈출 방지)
  - reboot
  - kexec_load
  - ptrace (일부 환경)

7.2 Seccomp 적용

Seccomp 프로파일 적용:

1. Kubernetes SecurityContext:
   securityContext:
     seccompProfile:
       type: RuntimeDefault

2. RuntimeDefault 프로파일:
   - containerd/runc 기본 Seccomp 프로파일
   - 위험한 시스템 콜 차단
   - 대부분의 워크로드에 적합

3. 커스텀 프로파일:
   securityContext:
     seccompProfile:
       type: Localhost
       localhostProfile: "profiles/my-seccomp.json"

8. 정리

containerd의 네트워킹과 스토리지는 표준 인터페이스를 통한 위임 모델을 따릅니다. CNI를 통한 네트워크 구성, OCI 스펙을 통한 마운트 관리, CDI를 통한 디바이스 접근, SELinux/AppArmor/Seccomp를 통한 보안 격리가 핵심입니다. 이러한 표준 기반 설계로 containerd는 다양한 네트워킹 솔루션과 보안 모듈을 유연하게 통합합니다. 다음 글에서는 containerd의 CRI 구현과 Kubernetes 런타임 통합을 분석합니다.

[containerd] Networking and Storage

containerd Networking and Storage

containerd does not implement networking and storage directly but integrates with external plugins through standard interfaces. This post analyzes network configuration via CNI, namespace management, volume mounts, device access, and security module integration.

1. CNI Integration

1.1 CNI Overview

Container Network Interface (CNI) is the standard interface for container networking. containerd calls CNI plugins to configure networks.

CNI call flow:

kubelet -> containerd (CRI RunPodSandbox)
                |
                v
        Create network namespace
                |
                v
        Call CNI plugin
        (ADD command)
                |
                v
        IP allocation, routing setup, interface creation
                |
                v
        Return result to containerd

1.2 CNI Configuration

CNI configuration file location:
  Config directory: /etc/cni/net.d/
  Binary directory: /opt/cni/bin/

containerd CNI configuration (config.toml):
  [plugins."io.containerd.grpc.v1.cri".cni]
    bin_dir = "/opt/cni/bin"
    conf_dir = "/etc/cni/net.d"
    max_conf_num = 1

1.3 CNI Plugin Chain

CNI config example (10-calico.conflist):

Network configuration is defined as a plugin chain:

1. Main plugin (calico, cilium, flannel, etc.):
   - Create network interface
   - IP allocation (IPAM)
   - Routing rule setup

2. Meta plugin (bandwidth, portmap, etc.):
   - Bandwidth limiting
   - Port mapping
   - Firewall rules

Execution order:
  ADD: Main -> Meta plugins (forward)
  DEL: Meta -> Main plugins (reverse)

1.4 CNI Call Details

CNI ADD execution detail:

1. containerd determines network namespace path
   /var/run/netns/cni-abc123

2. Set CNI environment variables:
   CNI_COMMAND=ADD
   CNI_CONTAINERID=abc123
   CNI_NETNS=/var/run/netns/cni-abc123
   CNI_IFNAME=eth0
   CNI_PATH=/opt/cni/bin

3. Execute CNI plugin binary
   Pass config JSON via stdin

4. Plugin returns result via stdout:
   - Assigned IP address
   - Gateway address
   - DNS configuration
   - Routing information

5. containerd stores the result

2. Network Namespaces

2.1 Namespace Creation

Pod network namespace:

During Pod Sandbox creation:
1. Create new network namespace with unshare(CLONE_NEWNET)
2. Persist via bind mount at /var/run/netns/
3. Execute CNI plugins in this namespace
4. All containers in the Pod share this namespace

Namespace sharing:
  Pause container holds the network namespace
  App containers join the same namespace
  -> Containers in Pod can communicate via localhost

2.2 Namespace Cleanup

Namespace cleanup:

During Pod deletion:
1. CNI DEL command releases network resources
   - Return IP address
   - Delete interface
   - Remove routing rules
2. Unmount bind mount from /var/run/netns/
3. Network namespace automatically deleted

3. Volume Mounts

3.1 Mount Types

containerd manages volumes through mount configuration in the OCI spec:

Mount types:

1. bind mount:
   - Mount host file/directory into container
   - Host and container share the same data
   - Used for ConfigMap, Secret, emptyDir, etc.

2. tmpfs mount:
   - Memory-based filesystem
   - Data lost on container termination
   - Used for /dev/shm, /run, etc.

3. Special filesystems:
   - proc: /proc
   - sysfs: /sys
   - cgroup: /sys/fs/cgroup
   - devpts: /dev/pts

3.2 Mount Propagation

Mount propagation options:

1. private:
   - No mount event propagation
   - Default

2. rprivate:
   - Recursive private

3. shared:
   - Bidirectional mount event propagation
   - Mount on host -> visible in container
   - Mount in container -> visible on host

4. rshared:
   - Recursive shared

5. slave:
   - Host -> container unidirectional propagation
   - Useful for volume plugins

6. rslave:
   - Recursive slave

Kubernetes usage:
  - Controlled via MountPropagation field
  - CSI drivers typically use Bidirectional (shared)

3.3 CRI Volume Processing

Volume processing via CRI:

kubelet adds mounts to OCI spec:

1. emptyDir:
   - kubelet creates directory on host
   - Passed to container as bind mount

2. hostPath:
   - Direct bind mount of host path

3. ConfigMap/Secret:
   - kubelet creates data on tmpfs
   - Passed to container as bind mount

4. PersistentVolumeClaim:
   - kubelet mounts volume via CSI driver
   - Mounted path passed as bind mount

containerd's role:
  - Reflect kubelet-prepared mount info in OCI spec
  - runc performs the actual mount

4. Device Access

4.1 Device Mapping

Device access mechanism:

OCI spec devices section:
  linux:
    devices:
      - path: "/dev/nvidia0"
        type: "c"
        major: 195
        minor: 0
        fileMode: 438
        uid: 0
        gid: 0

Cgroup device access control:
  linux:
    resources:
      devices:
        - allow: true
          type: "c"
          major: 195
          access: "rwm"

4.2 GPU Support

GPU access (NVIDIA):

NVIDIA Container Toolkit integration:

1. nvidia-container-runtime-hook:
   - Operates as OCI runtime hook
   - Runs before container start
   - Mounts NVIDIA driver libraries into container
   - Adds GPU device nodes to container

2. CDI (Container Device Interface):
   - Device vendor-neutral standard
   - Define device specs in /etc/cdi/
   - containerd reads CDI specs and reflects in OCI spec

CDI spec example:
  cdiVersion: "0.5.0"
  kind: "nvidia.com/gpu"
  devices:
    - name: "0"
      containerEdits:
        deviceNodes:
          - path: "/dev/nvidia0"
        mounts:
          - hostPath: "/usr/lib/x86_64-linux-gnu/libnvidia-ml.so"
            containerPath: "/usr/lib/x86_64-linux-gnu/libnvidia-ml.so"

4.3 Other Devices

Other device access:

1. FPGA:
   - Expose FPGA devices via CDI specs
   - Vendor-specific device plugins

2. InfiniBand/RDMA:
   - Map /dev/infiniband/* devices
   - Share network device namespace

3. Serial/USB:
   - Direct host device mapping
   - Privileged mode or explicit device allowlist

5. SELinux Integration

5.1 SELinux Context

SELinux container security:

SELinux settings in OCI spec:
  linux:
    mountLabel: "system_u:object_r:container_file_t:s0:c1,c2"
    processLabel: "system_u:system_r:container_t:s0:c1,c2"

Components:
  - user: system_u
  - role: system_r (process) / object_r (file)
  - type: container_t (process) / container_file_t (file)
  - level: s0:c1,c2 (MCS category)

MCS (Multi-Category Security):
  - Assigns unique categories to each container
  - Prevents access to other containers' files
  - Isolation between host and container

5.2 SELinux Processing Flow

SELinux application:

1. kubelet determines Pod SELinux options
   - securityContext.seLinuxOptions
   - Automatic MCS label assignment

2. Passed to containerd via CRI
   - processLabel: process security context
   - mountLabel: file security context

3. containerd reflects in OCI spec

4. runc applies at execution:
   - Apply SELinux label to process
   - Apply SELinux label to rootfs
   - Apply SELinux label to mounts

6. AppArmor Integration

6.1 AppArmor Profiles

AppArmor container security:

Default profile: cri-containerd.apparmor.d

Key rules:
  - Filesystem access restrictions
    deny /proc/kcore r,
    deny /sys/firmware/** r,
  - Network access control
  - Capability restrictions
  - Mount operation restrictions

Profile application:
  OCI spec:
    process:
      apparmorProfile: "cri-containerd.apparmor.d"

6.2 Custom Profiles

Custom AppArmor profiles:

1. Install profile on host:
   Place profile file in /etc/apparmor.d/
   apparmor_parser -r /etc/apparmor.d/my-profile

2. Specify in Pod:
   annotations:
     container.apparmor.security.beta.kubernetes.io/app: localhost/my-profile

3. containerd reflects in OCI spec:
   process:
     apparmorProfile: "my-profile"

7. Seccomp Integration

7.1 Seccomp Profiles

Seccomp (Secure Computing):

Define allowed/blocked system calls:

Default action: SCMP_ACT_ERRNO (deny)

Allowed system calls example:
  - read, write, open, close
  - mmap, mprotect, munmap
  - socket, connect, accept
  - ...

Blocked system calls example:
  - mount, umount (prevent container escape)
  - reboot
  - kexec_load
  - ptrace (in some environments)

7.2 Seccomp Application

Seccomp profile application:

1. Kubernetes SecurityContext:
   securityContext:
     seccompProfile:
       type: RuntimeDefault

2. RuntimeDefault profile:
   - containerd/runc default Seccomp profile
   - Blocks dangerous system calls
   - Suitable for most workloads

3. Custom profile:
   securityContext:
     seccompProfile:
       type: Localhost
       localhostProfile: "profiles/my-seccomp.json"

8. Summary

containerd networking and storage follows a delegation model through standard interfaces. Network configuration via CNI, mount management via OCI spec, device access via CDI, and security isolation via SELinux/AppArmor/Seccomp are the key pillars. This standards-based design allows containerd to flexibly integrate with various networking solutions and security modules.