Split View: [가상화] 07. NVIDIA GPU Operator: 쿠버네티스 GPU 관리 자동화

[가상화] 07. NVIDIA GPU Operator: 쿠버네티스 GPU 관리 자동화

들어가며
GPU Operator란?
- 해결하는 문제
ClusterPolicy CRD
컴포넌트 상세
초기화 순서
설치
GPU 타임슬라이싱 설정
MIG 설정 예제
모니터링 대시보드
- Prometheus ServiceMonitor
- 주요 Grafana 대시보드 패널
마치며

들어가며

쿠버네티스에서 GPU를 사용하려면 NVIDIA 드라이버, 컨테이너 런타임 설정, 디바이스 플러그인, 모니터링 등 여러 소프트웨어를 설치하고 관리해야 합니다. GPU Operator는 이 모든 과정을 자동화하여 Day-0부터 Day-2 운영까지 한 번에 해결합니다.

GPU Operator란?

NVIDIA GPU Operator는 쿠버네티스에서 GPU를 사용하는 데 필요한 모든 NVIDIA 소프트웨어 컴포넌트를 자동으로 배포하고 관리하는 Operator입니다.

해결하는 문제

GPU를 쿠버네티스에서 사용하기 위해 수동으로 설치해야 하는 항목들을 살펴보겠습니다.

수동 설치 시:
+------------------+
| 1. GPU 드라이버    |  <-- 호스트 OS에 직접 설치
+------------------+
| 2. Container Toolkit | <-- containerd/CRI-O 설정 변경
+------------------+
| 3. Device Plugin  |  <-- K8s DaemonSet 배포
+------------------+
| 4. GFD            |  <-- GPU 레이블링 DaemonSet
+------------------+
| 5. DCGM           |  <-- 모니터링 에이전트
+------------------+
| 6. MIG Manager    |  <-- MIG 프로파일 관리 (선택)
+------------------+

GPU Operator 사용 시:
+------------------+
| ClusterPolicy CR |  <-- 이것 하나로 전부 자동화
+------------------+

ClusterPolicy CRD

GPU Operator는 Operator Framework를 기반으로 하며, ClusterPolicy CRD를 통해 모든 컴포넌트를 선언적으로 관리합니다.

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: cluster-policy
spec:
  operator:
    defaultRuntime: containerd
  driver:
    enabled: true
    version: '550.90.07'
    repository: nvcr.io/nvidia
    image: driver
  toolkit:
    enabled: true
    version: v1.16.1-ubuntu20.04
  devicePlugin:
    enabled: true
    version: v0.16.1
  dcgm:
    enabled: true
  dcgmExporter:
    enabled: true
    version: 3.3.7-3.5.0-ubuntu22.04
  gfd:
    enabled: true
    version: v0.16.1
  migManager:
    enabled: true
    version: v0.8.0
  nodeStatusExporter:
    enabled: true
  validator:
    version: v24.6.2

컴포넌트 상세

1. NVIDIA Driver DaemonSet

GPU 드라이버를 컨테이너화하여 호스트에 직접 설치할 필요 없이 DaemonSet으로 배포합니다.

+------------------------------------------+
|            Worker Node                    |
|  +------------------------------------+  |
|  | NVIDIA Driver Container (DaemonSet)|  |
|  |                                    |  |
|  |  - 커널 모듈 컴파일 및 로드         |  |
|  |  - nvidia.ko, nvidia-uvm.ko       |  |
|  |  - /dev/nvidia* 디바이스 생성      |  |
|  |                                    |  |
|  +------------------------------------+  |
|                  |                        |
|  +------------------------------------+  |
|  | Host Kernel                        |  |
|  +------------------------------------+  |
+------------------------------------------+

주요 특징은 다음과 같습니다.

호스트 OS에 드라이버를 직접 설치할 필요 없음
드라이버 버전 업그레이드가 롤링 업데이트로 가능
커널 버전에 맞는 드라이버 자동 컴파일
사전 컴파일된 드라이버 이미지도 지원

2. NVIDIA Container Toolkit

containerd나 CRI-O가 NVIDIA GPU를 인식하도록 런타임을 구성합니다.

컨테이너 실행 흐름:

kubelet --> containerd --> nvidia-container-runtime-hook
                                    |
                                    v
                          nvidia-container-cli
                                    |
                                    v
                          GPU 디바이스/라이브러리를
                          컨테이너에 마운트

containerd/CRI-O 설정을 자동으로 패치
nvidia-container-runtime hook 등록
GPU 라이브러리와 디바이스를 컨테이너에 자동 주입

3. Device Plugin

nvidia.com/gpu 리소스를 쿠버네티스에 등록하여 Pod에서 GPU를 요청할 수 있게 합니다.

# Pod에서 GPU 요청 예시
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-app
      image: nvcr.io/nvidia/cuda:12.4.1-runtime-ubuntu22.04
      resources:
        limits:
          nvidia.com/gpu: 1

Device Plugin의 동작 과정은 다음과 같습니다.

+-------------------+     +------------------+     +---------------+
| Device Plugin     | --> | kubelet          | --> | API Server    |
| (DaemonSet)       |     | gRPC 등록         |     | Node 리소스    |
+-------------------+     +------------------+     | 업데이트       |
                                                    +---------------+

kubelet에 gRPC로 GPU 디바이스 등록
GPU 할당/해제 관리
토폴로지 인식 스케줄링 지원

4. GPU Feature Discovery (GFD)

노드의 GPU 정보를 감지하여 쿠버네티스 노드 레이블로 추가합니다.

# GFD가 추가하는 노드 레이블 예시
kubectl get node worker-gpu-01 -o json | jq '.metadata.labels' | grep nvidia

# 출력 예시:
# "nvidia.com/cuda.driver.major": "550"
# "nvidia.com/cuda.driver.minor": "90"
# "nvidia.com/cuda.driver.rev": "07"
# "nvidia.com/cuda.runtime.major": "12"
# "nvidia.com/gpu.count": "4"
# "nvidia.com/gpu.family": "ampere"
# "nvidia.com/gpu.machine": "DGX-A100"
# "nvidia.com/gpu.memory": "81920"
# "nvidia.com/gpu.product": "A100-SXM4-80GB"
# "nvidia.com/gpu.replicas": "1"
# "nvidia.com/mig.capable": "true"

감지하는 정보는 다음과 같습니다.

레이블	설명
gpu.product	GPU 모델명 (A100, H100 등)
gpu.memory	GPU 메모리 용량 (MB)
gpu.family	GPU 아키텍처 (ampere, hopper 등)
cuda.driver.major	CUDA 드라이버 메이저 버전
cuda.runtime.major	CUDA 런타임 메이저 버전
mig.capable	MIG 지원 여부
gpu.count	GPU 개수

5. DCGM + DCGM Exporter

NVIDIA DCGM(Data Center GPU Manager)으로 GPU 상태를 모니터링하고 Prometheus 메트릭으로 내보냅니다.

+------------------+     +------------------+     +------------------+
| DCGM             | --> | DCGM Exporter    | --> | Prometheus       |
| (GPU 모니터링)    |     | (메트릭 변환)     |     | (수집/저장)       |
+------------------+     +------------------+     +--------+---------+
                                                           |
                                                  +--------v---------+
                                                  | Grafana           |
                                                  | (시각화/대시보드)   |
                                                  +------------------+

주요 메트릭은 다음과 같습니다.

메트릭	설명
DCGM_FI_DEV_GPU_UTIL	GPU 활용률 (%)
DCGM_FI_DEV_MEM_COPY_UTIL	메모리 복사 활용률 (%)
DCGM_FI_DEV_FB_USED	사용 중인 프레임버퍼 메모리 (MB)
DCGM_FI_DEV_FB_FREE	여유 프레임버퍼 메모리 (MB)
DCGM_FI_DEV_GPU_TEMP	GPU 온도 (C)
DCGM_FI_DEV_POWER_USAGE	GPU 전력 사용량 (W)
DCGM_FI_DEV_PCIE_TX_THROUGHPUT	PCIe 전송 처리량
DCGM_FI_DEV_XID_ERRORS	XID 에러 카운트

6. MIG Manager

NVIDIA A100, H100 등에서 Multi-Instance GPU(MIG) 프로파일을 동적으로 구성합니다.

A100 80GB MIG 프로파일 예시:

전체 GPU (80GB)
+------------------------------------------------------------------+
|                          1g.10gb x 7                              |
+--------+--------+--------+--------+--------+--------+--------+  |
| 10GB   | 10GB   | 10GB   | 10GB   | 10GB   | 10GB   | 10GB   |  |
+--------+--------+--------+--------+--------+--------+--------+  |

또는

+------------------------------------------------------------------+
|              3g.40gb              |           4g.40gb             |
+---------------------------------+---------------------------------+
|            40GB                  |            40GB                |
+---------------------------------+---------------------------------+

또는

+------------------------------------------------------------------+
|                          7g.80gb x 1                              |
+------------------------------------------------------------------+
|                           80GB                                    |
+------------------------------------------------------------------+

7. vGPU Manager (선택사항)

NVIDIA vGPU 라이선스가 있는 환경에서 가상 GPU를 생성하고 관리합니다.

호스트 드라이버 배포
vGPU 인스턴스 생성 및 관리
KubeVirt VM에 vGPU 할당 지원

8. Node Feature Discovery (NFD)

GPU Operator의 사전 요구 사항으로, 노드의 하드웨어 특성을 감지합니다.

# NFD가 추가하는 레이블 예시
# feature.node.kubernetes.io/pci-10de.present=true  (NVIDIA PCI 디바이스)
# feature.node.kubernetes.io/kernel-version.major=5
# feature.node.kubernetes.io/system-os_release.ID=ubuntu

초기화 순서

GPU Operator 컴포넌트는 의존성에 따라 순차적으로 초기화됩니다.

1. NFD (Node Feature Discovery)
   |
   v
2. NVIDIA Driver
   |
   v
3. NVIDIA Container Toolkit
   |
   v
4. Device Plugin
   |
   v
5. GPU Feature Discovery (GFD)
   |
   v
6. DCGM / DCGM Exporter
   |
   v
7. MIG Manager (선택)

각 단계가 완료되어야 다음 단계가 시작됩니다. Validator Pod가 각 단계의 정상 동작을 검증합니다.

설치

Helm을 통한 설치

# NVIDIA Helm 리포지토리 추가
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# GPU Operator 네임스페이스 생성
kubectl create namespace gpu-operator

# GPU Operator 설치
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set driver.version=550.90.07 \
  --set toolkit.version=v1.16.1-ubuntu20.04

# 설치 상태 확인
kubectl get pods -n gpu-operator -w

호스트에 드라이버가 이미 설치된 경우

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set driver.enabled=false

설치 검증

# 모든 컴포넌트 Pod 확인
kubectl get pods -n gpu-operator

# GPU 노드 리소스 확인
kubectl describe node worker-gpu-01 | grep -A 10 "Allocatable"

# 테스트 Pod 실행
kubectl run gpu-test --image=nvcr.io/nvidia/cuda:12.4.1-runtime-ubuntu22.04 \
  --limits=nvidia.com/gpu=1 \
  --command -- nvidia-smi
kubectl logs gpu-test

GPU 타임슬라이싱 설정

MIG를 지원하지 않는 GPU에서 여러 Pod가 GPU를 공유할 수 있게 합니다.

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4

# ConfigMap 적용
kubectl apply -f time-slicing-config.yaml

# ClusterPolicy에 타임슬라이싱 설정 추가
kubectl patch clusterpolicy cluster-policy \
  --type=merge \
  -p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'

타임슬라이싱 적용 후 GPU 리소스가 다음과 같이 변경됩니다.

# 적용 전: nvidia.com/gpu: 1
# 적용 후: nvidia.com/gpu: 4 (replicas=4 설정 시)

MIG 설정 예제

apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-1g.10gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7
      all-3g.40gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "3g.40gb": 2
      mixed-mig:
        - devices: all
          mig-enabled: true
          mig-devices:
            "3g.40gb": 1
            "1g.10gb": 4

# MIG 프로파일 적용
kubectl label node worker-gpu-01 nvidia.com/mig.config=all-1g.10gb --overwrite

# MIG 디바이스 확인
kubectl describe node worker-gpu-01 | grep nvidia.com/mig
# nvidia.com/mig-1g.10gb: 7

모니터링 대시보드

Prometheus ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
    - port: gpu-metrics
      interval: 15s

주요 Grafana 대시보드 패널

패널	PromQL 쿼리
GPU 활용률	`DCGM_FI_DEV_GPU_UTIL`
메모리 사용량	`DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) * 100`
GPU 온도	`DCGM_FI_DEV_GPU_TEMP`
전력 사용량	`DCGM_FI_DEV_POWER_USAGE`
XID 에러	`rate(DCGM_FI_DEV_XID_ERRORS[5m])`

마치며

NVIDIA GPU Operator는 쿠버네티스에서 GPU 인프라를 관리하는 복잡성을 크게 줄여줍니다. ClusterPolicy 하나로 드라이버부터 모니터링까지 모든 것을 자동화할 수 있으며, MIG와 타임슬라이싱을 통해 GPU 활용률을 극대화할 수 있습니다.

다음 글에서는 KubeVirt와 GPU Operator를 결합하여 VM에서 GPU 가속을 활용하는 방법을 알아보겠습니다.

퀴즈: GPU Operator 이해도 점검

Q1. GPU Operator의 컴포넌트 초기화 순서로 올바른 것은?

A) Device Plugin -> Driver -> Toolkit -> GFD B) Driver -> Toolkit -> Device Plugin -> GFD C) Toolkit -> Driver -> Device Plugin -> GFD D) GFD -> Driver -> Toolkit -> Device Plugin

정답: B) Driver -> Toolkit -> Device Plugin -> GFD 순서로 초기화됩니다. 각 단계는 이전 단계에 의존합니다.

Q2. GPU Feature Discovery(GFD)의 역할은?

A) GPU 드라이버를 설치 B) 컨테이너 런타임을 설정 C) GPU 정보를 감지하여 노드 레이블로 추가 D) GPU 메트릭을 Prometheus로 내보내기

정답: C) GFD는 GPU 모델, 드라이버 버전, CUDA 버전, MIG 지원 여부 등을 감지하여 쿠버네티스 노드 레이블로 추가합니다.

Q3. MIG를 지원하지 않는 GPU에서 여러 Pod가 GPU를 공유하려면?

A) vGPU Manager 사용 B) GPU 타임슬라이싱 설정 C) Device Plugin replicas 증가 D) 별도의 GPU 파티셔닝 불가

정답: B) GPU 타임슬라이싱을 사용하면 MIG를 지원하지 않는 GPU에서도 여러 Pod가 시분할로 GPU를 공유할 수 있습니다.

Q4. DCGM Exporter가 제공하지 않는 메트릭은?

A) GPU 활용률 B) GPU 온도 C) Pod별 네트워크 대역폭 D) XID 에러 카운트

정답: C) DCGM Exporter는 GPU 관련 메트릭만 제공합니다. 네트워크 대역폭은 GPU 모니터링 범위에 포함되지 않습니다.

[Virtualization] 07. NVIDIA GPU Operator: Automating GPU Management on Kubernetes

Introduction
What is GPU Operator?
- The Problem It Solves
ClusterPolicy CRD
Component Details
Initialization Order
Installation
GPU Time-Slicing Configuration
MIG Configuration Example
Monitoring Dashboard
- Prometheus ServiceMonitor
- Key Grafana Dashboard Panels
Conclusion

Introduction

Using GPUs on Kubernetes requires installing and managing multiple software components: NVIDIA drivers, container runtime configuration, device plugins, and monitoring. GPU Operator automates all of this, handling everything from Day-0 setup to Day-2 operations.

What is GPU Operator?

NVIDIA GPU Operator is an Operator that automatically deploys and manages all NVIDIA software components needed for GPU usage on Kubernetes.

The Problem It Solves

Let us look at what needs to be manually installed to use GPUs on Kubernetes.

Manual Installation:
+------------------+
| 1. GPU Driver    |  <-- Direct host OS installation
+------------------+
| 2. Container Toolkit | <-- containerd/CRI-O config changes
+------------------+
| 3. Device Plugin |  <-- K8s DaemonSet deployment
+------------------+
| 4. GFD           |  <-- GPU labeling DaemonSet
+------------------+
| 5. DCGM          |  <-- Monitoring agent
+------------------+
| 6. MIG Manager   |  <-- MIG profile management (optional)
+------------------+

With GPU Operator:
+------------------+
| ClusterPolicy CR |  <-- One resource automates everything
+------------------+

ClusterPolicy CRD

GPU Operator is built on the Operator Framework and declaratively manages all components through the ClusterPolicy CRD.

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: cluster-policy
spec:
  operator:
    defaultRuntime: containerd
  driver:
    enabled: true
    version: '550.90.07'
    repository: nvcr.io/nvidia
    image: driver
  toolkit:
    enabled: true
    version: v1.16.1-ubuntu20.04
  devicePlugin:
    enabled: true
    version: v0.16.1
  dcgm:
    enabled: true
  dcgmExporter:
    enabled: true
    version: 3.3.7-3.5.0-ubuntu22.04
  gfd:
    enabled: true
    version: v0.16.1
  migManager:
    enabled: true
    version: v0.8.0
  nodeStatusExporter:
    enabled: true
  validator:
    version: v24.6.2

Component Details

1. NVIDIA Driver DaemonSet

Containerizes the GPU driver and deploys it as a DaemonSet, eliminating the need for direct host installation.

+------------------------------------------+
|            Worker Node                    |
|  +------------------------------------+  |
|  | NVIDIA Driver Container (DaemonSet)|  |
|  |                                    |  |
|  |  - Compiles and loads kernel modules|  |
|  |  - nvidia.ko, nvidia-uvm.ko       |  |
|  |  - Creates /dev/nvidia* devices    |  |
|  |                                    |  |
|  +------------------------------------+  |
|                  |                        |
|  +------------------------------------+  |
|  | Host Kernel                        |  |
|  +------------------------------------+  |
+------------------------------------------+

Key features include the following.

No need to install drivers directly on the host OS
Driver version upgrades via rolling updates
Automatic driver compilation for kernel versions
Pre-compiled driver images also supported

2. NVIDIA Container Toolkit

Configures containerd or CRI-O runtime to recognize NVIDIA GPUs.

Container Execution Flow:

kubelet --> containerd --> nvidia-container-runtime-hook
                                    |
                                    v
                          nvidia-container-cli
                                    |
                                    v
                          Mounts GPU devices/libraries
                          into the container

Automatically patches containerd/CRI-O configuration
Registers nvidia-container-runtime hook
Auto-injects GPU libraries and devices into containers

3. Device Plugin

Registers nvidia.com/gpu resources with Kubernetes so Pods can request GPUs.

# Example: Requesting GPU in a Pod
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-app
      image: nvcr.io/nvidia/cuda:12.4.1-runtime-ubuntu22.04
      resources:
        limits:
          nvidia.com/gpu: 1

The Device Plugin operation flow is as follows.

+-------------------+     +------------------+     +---------------+
| Device Plugin     | --> | kubelet          | --> | API Server    |
| (DaemonSet)       |     | gRPC registration|     | Node resource |
+-------------------+     +------------------+     | update        |
                                                    +---------------+

Registers GPU devices with kubelet via gRPC
Manages GPU allocation/deallocation
Supports topology-aware scheduling

4. GPU Feature Discovery (GFD)

Detects GPU information on nodes and adds them as Kubernetes node labels.

# Example node labels added by GFD
kubectl get node worker-gpu-01 -o json | jq '.metadata.labels' | grep nvidia

# Example output:
# "nvidia.com/cuda.driver.major": "550"
# "nvidia.com/cuda.driver.minor": "90"
# "nvidia.com/cuda.driver.rev": "07"
# "nvidia.com/cuda.runtime.major": "12"
# "nvidia.com/gpu.count": "4"
# "nvidia.com/gpu.family": "ampere"
# "nvidia.com/gpu.machine": "DGX-A100"
# "nvidia.com/gpu.memory": "81920"
# "nvidia.com/gpu.product": "A100-SXM4-80GB"
# "nvidia.com/gpu.replicas": "1"
# "nvidia.com/mig.capable": "true"

Detected information includes the following.

Label	Description
gpu.product	GPU model name (A100, H100, etc.)
gpu.memory	GPU memory capacity (MB)
gpu.family	GPU architecture (ampere, hopper, etc.)
cuda.driver.major	CUDA driver major version
cuda.runtime.major	CUDA runtime major version
mig.capable	MIG support status
gpu.count	Number of GPUs

5. DCGM + DCGM Exporter

Monitors GPU health using NVIDIA DCGM (Data Center GPU Manager) and exports Prometheus metrics.

+------------------+     +------------------+     +------------------+
| DCGM             | --> | DCGM Exporter    | --> | Prometheus       |
| (GPU Monitoring) |     | (Metric Convert) |     | (Collect/Store)  |
+------------------+     +------------------+     +--------+---------+
                                                           |
                                                  +--------v---------+
                                                  | Grafana           |
                                                  | (Visualization)   |
                                                  +------------------+

Key metrics include the following.

Metric	Description
DCGM_FI_DEV_GPU_UTIL	GPU utilization (%)
DCGM_FI_DEV_MEM_COPY_UTIL	Memory copy utilization (%)
DCGM_FI_DEV_FB_USED	Framebuffer memory used (MB)
DCGM_FI_DEV_FB_FREE	Framebuffer memory free (MB)
DCGM_FI_DEV_GPU_TEMP	GPU temperature (C)
DCGM_FI_DEV_POWER_USAGE	GPU power usage (W)
DCGM_FI_DEV_PCIE_TX_THROUGHPUT	PCIe transmit throughput
DCGM_FI_DEV_XID_ERRORS	XID error count

6. MIG Manager

Dynamically configures Multi-Instance GPU (MIG) profiles on NVIDIA A100, H100, and similar GPUs.

A100 80GB MIG Profile Examples:

Full GPU (80GB)
+------------------------------------------------------------------+
|                          1g.10gb x 7                              |
+--------+--------+--------+--------+--------+--------+--------+  |
| 10GB   | 10GB   | 10GB   | 10GB   | 10GB   | 10GB   | 10GB   |  |
+--------+--------+--------+--------+--------+--------+--------+  |

Or

+------------------------------------------------------------------+
|              3g.40gb              |           4g.40gb             |
+---------------------------------+---------------------------------+
|            40GB                  |            40GB                |
+---------------------------------+---------------------------------+

Or

+------------------------------------------------------------------+
|                          7g.80gb x 1                              |
+------------------------------------------------------------------+
|                           80GB                                    |
+------------------------------------------------------------------+

7. vGPU Manager (Optional)

Creates and manages virtual GPUs in environments with NVIDIA vGPU licenses.

Host driver deployment
vGPU instance creation and management
vGPU allocation support for KubeVirt VMs

8. Node Feature Discovery (NFD)

A prerequisite for GPU Operator that detects hardware features on nodes.

# Example labels added by NFD
# feature.node.kubernetes.io/pci-10de.present=true  (NVIDIA PCI device)
# feature.node.kubernetes.io/kernel-version.major=5
# feature.node.kubernetes.io/system-os_release.ID=ubuntu

Initialization Order

GPU Operator components initialize sequentially based on dependencies.

1. NFD (Node Feature Discovery)
   |
   v
2. NVIDIA Driver
   |
   v
3. NVIDIA Container Toolkit
   |
   v
4. Device Plugin
   |
   v
5. GPU Feature Discovery (GFD)
   |
   v
6. DCGM / DCGM Exporter
   |
   v
7. MIG Manager (optional)

Each step must complete before the next begins. Validator Pods verify correct operation at each stage.

Installation

Installation via Helm

# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Create GPU Operator namespace
kubectl create namespace gpu-operator

# Install GPU Operator
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set driver.version=550.90.07 \
  --set toolkit.version=v1.16.1-ubuntu20.04

# Check installation status
kubectl get pods -n gpu-operator -w

When Drivers Are Already Installed on Host

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set driver.enabled=false

Installation Verification

# Check all component Pods
kubectl get pods -n gpu-operator

# Verify GPU node resources
kubectl describe node worker-gpu-01 | grep -A 10 "Allocatable"

# Run test Pod
kubectl run gpu-test --image=nvcr.io/nvidia/cuda:12.4.1-runtime-ubuntu22.04 \
  --limits=nvidia.com/gpu=1 \
  --command -- nvidia-smi
kubectl logs gpu-test

GPU Time-Slicing Configuration

Allows multiple Pods to share a GPU on devices that do not support MIG.

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4

# Apply ConfigMap
kubectl apply -f time-slicing-config.yaml

# Add time-slicing config to ClusterPolicy
kubectl patch clusterpolicy cluster-policy \
  --type=merge \
  -p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'

After applying time-slicing, GPU resources change as follows.

# Before: nvidia.com/gpu: 1
# After:  nvidia.com/gpu: 4 (with replicas=4)

MIG Configuration Example

apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-1g.10gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7
      all-3g.40gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "3g.40gb": 2
      mixed-mig:
        - devices: all
          mig-enabled: true
          mig-devices:
            "3g.40gb": 1
            "1g.10gb": 4

# Apply MIG profile
kubectl label node worker-gpu-01 nvidia.com/mig.config=all-1g.10gb --overwrite

# Verify MIG devices
kubectl describe node worker-gpu-01 | grep nvidia.com/mig
# nvidia.com/mig-1g.10gb: 7

Monitoring Dashboard

Prometheus ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
    - port: gpu-metrics
      interval: 15s

Key Grafana Dashboard Panels

Panel	PromQL Query
GPU Utilization	`DCGM_FI_DEV_GPU_UTIL`
Memory Usage	`DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) * 100`
GPU Temperature	`DCGM_FI_DEV_GPU_TEMP`
Power Usage	`DCGM_FI_DEV_POWER_USAGE`
XID Errors	`rate(DCGM_FI_DEV_XID_ERRORS[5m])`

Conclusion

NVIDIA GPU Operator significantly reduces the complexity of managing GPU infrastructure on Kubernetes. With a single ClusterPolicy, you can automate everything from drivers to monitoring, and maximize GPU utilization through MIG and time-slicing.

In the next post, we will explore combining KubeVirt with GPU Operator to leverage GPU acceleration in VMs.

Quiz: GPU Operator Knowledge Check

Q1. What is the correct initialization order of GPU Operator components?

A) Device Plugin -> Driver -> Toolkit -> GFD B) Driver -> Toolkit -> Device Plugin -> GFD C) Toolkit -> Driver -> Device Plugin -> GFD D) GFD -> Driver -> Toolkit -> Device Plugin

Answer: B) Components initialize in the order: Driver -> Toolkit -> Device Plugin -> GFD. Each step depends on the previous one.

Q2. What is the role of GPU Feature Discovery (GFD)?

A) Install GPU drivers B) Configure container runtime C) Detect GPU info and add node labels D) Export GPU metrics to Prometheus

Answer: C) GFD detects GPU model, driver version, CUDA version, MIG support, and more, adding them as Kubernetes node labels.

Q3. How can multiple Pods share a GPU that does not support MIG?

A) Use vGPU Manager B) Configure GPU time-slicing C) Increase Device Plugin replicas D) GPU partitioning is not possible

Answer: B) GPU time-slicing enables multiple Pods to share a GPU through time-division on devices without MIG support.

Q4. Which metric is NOT provided by DCGM Exporter?

A) GPU utilization B) GPU temperature C) Per-pod network bandwidth D) XID error count

Answer: C) DCGM Exporter only provides GPU-related metrics. Network bandwidth is outside the scope of GPU monitoring.