Split View: [가상화] 07. NVIDIA GPU Operator: 쿠버네티스 GPU 관리 자동화
[가상화] 07. NVIDIA GPU Operator: 쿠버네티스 GPU 관리 자동화
들어가며
쿠버네티스에서 GPU를 사용하려면 NVIDIA 드라이버, 컨테이너 런타임 설정, 디바이스 플러그인, 모니터링 등 여러 소프트웨어를 설치하고 관리해야 합니다. GPU Operator는 이 모든 과정을 자동화하여 Day-0부터 Day-2 운영까지 한 번에 해결합니다.
GPU Operator란?
NVIDIA GPU Operator는 쿠버네티스에서 GPU를 사용하는 데 필요한 모든 NVIDIA 소프트웨어 컴포넌트를 자동으로 배포하고 관리하는 Operator입니다.
해결하는 문제
GPU를 쿠버네티스에서 사용하기 위해 수동으로 설치해야 하는 항목들을 살펴보겠습니다.
수동 설치 시:
+------------------+
| 1. GPU 드라이버 | <-- 호스트 OS에 직접 설치
+------------------+
| 2. Container Toolkit | <-- containerd/CRI-O 설정 변경
+------------------+
| 3. Device Plugin | <-- K8s DaemonSet 배포
+------------------+
| 4. GFD | <-- GPU 레이블링 DaemonSet
+------------------+
| 5. DCGM | <-- 모니터링 에이전트
+------------------+
| 6. MIG Manager | <-- MIG 프로파일 관리 (선택)
+------------------+
GPU Operator 사용 시:
+------------------+
| ClusterPolicy CR | <-- 이것 하나로 전부 자동화
+------------------+
ClusterPolicy CRD
GPU Operator는 Operator Framework를 기반으로 하며, ClusterPolicy CRD를 통해 모든 컴포넌트를 선언적으로 관리합니다.
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: cluster-policy
spec:
operator:
defaultRuntime: containerd
driver:
enabled: true
version: '550.90.07'
repository: nvcr.io/nvidia
image: driver
toolkit:
enabled: true
version: v1.16.1-ubuntu20.04
devicePlugin:
enabled: true
version: v0.16.1
dcgm:
enabled: true
dcgmExporter:
enabled: true
version: 3.3.7-3.5.0-ubuntu22.04
gfd:
enabled: true
version: v0.16.1
migManager:
enabled: true
version: v0.8.0
nodeStatusExporter:
enabled: true
validator:
version: v24.6.2
컴포넌트 상세
1. NVIDIA Driver DaemonSet
GPU 드라이버를 컨테이너화하여 호스트에 직접 설치할 필요 없이 DaemonSet으로 배포합니다.
+------------------------------------------+
| Worker Node |
| +------------------------------------+ |
| | NVIDIA Driver Container (DaemonSet)| |
| | | |
| | - 커널 모듈 컴파일 및 로드 | |
| | - nvidia.ko, nvidia-uvm.ko | |
| | - /dev/nvidia* 디바이스 생성 | |
| | | |
| +------------------------------------+ |
| | |
| +------------------------------------+ |
| | Host Kernel | |
| +------------------------------------+ |
+------------------------------------------+
주요 특징은 다음과 같습니다.
- 호스트 OS에 드라이버를 직접 설치할 필요 없음
- 드라이버 버전 업그레이드가 롤링 업데이트로 가능
- 커널 버전에 맞는 드라이버 자동 컴파일
- 사전 컴파일된 드라이버 이미지도 지원
2. NVIDIA Container Toolkit
containerd나 CRI-O가 NVIDIA GPU를 인식하도록 런타임을 구성합니다.
컨테이너 실행 흐름:
kubelet --> containerd --> nvidia-container-runtime-hook
|
v
nvidia-container-cli
|
v
GPU 디바이스/라이브러리를
컨테이너에 마운트
- containerd/CRI-O 설정을 자동으로 패치
- nvidia-container-runtime hook 등록
- GPU 라이브러리와 디바이스를 컨테이너에 자동 주입
3. Device Plugin
nvidia.com/gpu 리소스를 쿠버네티스에 등록하여 Pod에서 GPU를 요청할 수 있게 합니다.
# Pod에서 GPU 요청 예시
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-app
image: nvcr.io/nvidia/cuda:12.4.1-runtime-ubuntu22.04
resources:
limits:
nvidia.com/gpu: 1
Device Plugin의 동작 과정은 다음과 같습니다.
+-------------------+ +------------------+ +---------------+
| Device Plugin | --> | kubelet | --> | API Server |
| (DaemonSet) | | gRPC 등록 | | Node 리소스 |
+-------------------+ +------------------+ | 업데이트 |
+---------------+
- kubelet에 gRPC로 GPU 디바이스 등록
- GPU 할당/해제 관리
- 토폴로지 인식 스케줄링 지원
4. GPU Feature Discovery (GFD)
노드의 GPU 정보를 감지하여 쿠버네티스 노드 레이블로 추가합니다.
# GFD가 추가하는 노드 레이블 예시
kubectl get node worker-gpu-01 -o json | jq '.metadata.labels' | grep nvidia
# 출력 예시:
# "nvidia.com/cuda.driver.major": "550"
# "nvidia.com/cuda.driver.minor": "90"
# "nvidia.com/cuda.driver.rev": "07"
# "nvidia.com/cuda.runtime.major": "12"
# "nvidia.com/gpu.count": "4"
# "nvidia.com/gpu.family": "ampere"
# "nvidia.com/gpu.machine": "DGX-A100"
# "nvidia.com/gpu.memory": "81920"
# "nvidia.com/gpu.product": "A100-SXM4-80GB"
# "nvidia.com/gpu.replicas": "1"
# "nvidia.com/mig.capable": "true"
감지하는 정보는 다음과 같습니다.
| 레이블 | 설명 |
|---|---|
| gpu.product | GPU 모델명 (A100, H100 등) |
| gpu.memory | GPU 메모리 용량 (MB) |
| gpu.family | GPU 아키텍처 (ampere, hopper 등) |
| cuda.driver.major | CUDA 드라이버 메이저 버전 |
| cuda.runtime.major | CUDA 런타임 메이저 버전 |
| mig.capable | MIG 지원 여부 |
| gpu.count | GPU 개수 |
5. DCGM + DCGM Exporter
NVIDIA DCGM(Data Center GPU Manager)으로 GPU 상태를 모니터링하고 Prometheus 메트릭으로 내보냅니다.
+------------------+ +------------------+ +------------------+
| DCGM | --> | DCGM Exporter | --> | Prometheus |
| (GPU 모니터링) | | (메트릭 변환) | | (수집/저장) |
+------------------+ +------------------+ +--------+---------+
|
+--------v---------+
| Grafana |
| (시각화/대시보드) |
+------------------+
주요 메트릭은 다음과 같습니다.
| 메트릭 | 설명 |
|---|---|
| DCGM_FI_DEV_GPU_UTIL | GPU 활용률 (%) |
| DCGM_FI_DEV_MEM_COPY_UTIL | 메모리 복사 활용률 (%) |
| DCGM_FI_DEV_FB_USED | 사용 중인 프레임버퍼 메모리 (MB) |
| DCGM_FI_DEV_FB_FREE | 여유 프레임버퍼 메모리 (MB) |
| DCGM_FI_DEV_GPU_TEMP | GPU 온도 (C) |
| DCGM_FI_DEV_POWER_USAGE | GPU 전력 사용량 (W) |
| DCGM_FI_DEV_PCIE_TX_THROUGHPUT | PCIe 전송 처리량 |
| DCGM_FI_DEV_XID_ERRORS | XID 에러 카운트 |
6. MIG Manager
NVIDIA A100, H100 등에서 Multi-Instance GPU(MIG) 프로파일을 동적으로 구성합니다.
A100 80GB MIG 프로파일 예시:
전체 GPU (80GB)
+------------------------------------------------------------------+
| 1g.10gb x 7 |
+--------+--------+--------+--------+--------+--------+--------+ |
| 10GB | 10GB | 10GB | 10GB | 10GB | 10GB | 10GB | |
+--------+--------+--------+--------+--------+--------+--------+ |
또는
+------------------------------------------------------------------+
| 3g.40gb | 4g.40gb |
+---------------------------------+---------------------------------+
| 40GB | 40GB |
+---------------------------------+---------------------------------+
또는
+------------------------------------------------------------------+
| 7g.80gb x 1 |
+------------------------------------------------------------------+
| 80GB |
+------------------------------------------------------------------+
7. vGPU Manager (선택사항)
NVIDIA vGPU 라이선스가 있는 환경에서 가상 GPU를 생성하고 관리합니다.
- 호스트 드라이버 배포
- vGPU 인스턴스 생성 및 관리
- KubeVirt VM에 vGPU 할당 지원
8. Node Feature Discovery (NFD)
GPU Operator의 사전 요구 사항으로, 노드의 하드웨어 특성을 감지합니다.
# NFD가 추가하는 레이블 예시
# feature.node.kubernetes.io/pci-10de.present=true (NVIDIA PCI 디바이스)
# feature.node.kubernetes.io/kernel-version.major=5
# feature.node.kubernetes.io/system-os_release.ID=ubuntu
초기화 순서
GPU Operator 컴포넌트는 의존성에 따라 순차적으로 초기화됩니다.
1. NFD (Node Feature Discovery)
|
v
2. NVIDIA Driver
|
v
3. NVIDIA Container Toolkit
|
v
4. Device Plugin
|
v
5. GPU Feature Discovery (GFD)
|
v
6. DCGM / DCGM Exporter
|
v
7. MIG Manager (선택)
각 단계가 완료되어야 다음 단계가 시작됩니다. Validator Pod가 각 단계의 정상 동작을 검증합니다.
설치
Helm을 통한 설치
# NVIDIA Helm 리포지토리 추가
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# GPU Operator 네임스페이스 생성
kubectl create namespace gpu-operator
# GPU Operator 설치
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--set driver.version=550.90.07 \
--set toolkit.version=v1.16.1-ubuntu20.04
# 설치 상태 확인
kubectl get pods -n gpu-operator -w
호스트에 드라이버가 이미 설치된 경우
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--set driver.enabled=false
설치 검증
# 모든 컴포넌트 Pod 확인
kubectl get pods -n gpu-operator
# GPU 노드 리소스 확인
kubectl describe node worker-gpu-01 | grep -A 10 "Allocatable"
# 테스트 Pod 실행
kubectl run gpu-test --image=nvcr.io/nvidia/cuda:12.4.1-runtime-ubuntu22.04 \
--limits=nvidia.com/gpu=1 \
--command -- nvidia-smi
kubectl logs gpu-test
GPU 타임슬라이싱 설정
MIG를 지원하지 않는 GPU에서 여러 Pod가 GPU를 공유할 수 있게 합니다.
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 4
# ConfigMap 적용
kubectl apply -f time-slicing-config.yaml
# ClusterPolicy에 타임슬라이싱 설정 추가
kubectl patch clusterpolicy cluster-policy \
--type=merge \
-p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'
타임슬라이싱 적용 후 GPU 리소스가 다음과 같이 변경됩니다.
# 적용 전: nvidia.com/gpu: 1
# 적용 후: nvidia.com/gpu: 4 (replicas=4 설정 시)
MIG 설정 예제
apiVersion: v1
kind: ConfigMap
metadata:
name: mig-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
all-1g.10gb:
- devices: all
mig-enabled: true
mig-devices:
"1g.10gb": 7
all-3g.40gb:
- devices: all
mig-enabled: true
mig-devices:
"3g.40gb": 2
mixed-mig:
- devices: all
mig-enabled: true
mig-devices:
"3g.40gb": 1
"1g.10gb": 4
# MIG 프로파일 적용
kubectl label node worker-gpu-01 nvidia.com/mig.config=all-1g.10gb --overwrite
# MIG 디바이스 확인
kubectl describe node worker-gpu-01 | grep nvidia.com/mig
# nvidia.com/mig-1g.10gb: 7
모니터링 대시보드
Prometheus ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: gpu-operator
spec:
selector:
matchLabels:
app: nvidia-dcgm-exporter
endpoints:
- port: gpu-metrics
interval: 15s
주요 Grafana 대시보드 패널
| 패널 | PromQL 쿼리 |
|---|---|
| GPU 활용률 | DCGM_FI_DEV_GPU_UTIL |
| 메모리 사용량 | DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) * 100 |
| GPU 온도 | DCGM_FI_DEV_GPU_TEMP |
| 전력 사용량 | DCGM_FI_DEV_POWER_USAGE |
| XID 에러 | rate(DCGM_FI_DEV_XID_ERRORS[5m]) |
마치며
NVIDIA GPU Operator는 쿠버네티스에서 GPU 인프라를 관리하는 복잡성을 크게 줄여줍니다. ClusterPolicy 하나로 드라이버부터 모니터링까지 모든 것을 자동화할 수 있으며, MIG와 타임슬라이싱을 통해 GPU 활용률을 극대화할 수 있습니다.
다음 글에서는 KubeVirt와 GPU Operator를 결합하여 VM에서 GPU 가속을 활용하는 방법을 알아보겠습니다.
퀴즈: GPU Operator 이해도 점검
Q1. GPU Operator의 컴포넌트 초기화 순서로 올바른 것은?
A) Device Plugin -> Driver -> Toolkit -> GFD B) Driver -> Toolkit -> Device Plugin -> GFD C) Toolkit -> Driver -> Device Plugin -> GFD D) GFD -> Driver -> Toolkit -> Device Plugin
정답: B) Driver -> Toolkit -> Device Plugin -> GFD 순서로 초기화됩니다. 각 단계는 이전 단계에 의존합니다.
Q2. GPU Feature Discovery(GFD)의 역할은?
A) GPU 드라이버를 설치 B) 컨테이너 런타임을 설정 C) GPU 정보를 감지하여 노드 레이블로 추가 D) GPU 메트릭을 Prometheus로 내보내기
정답: C) GFD는 GPU 모델, 드라이버 버전, CUDA 버전, MIG 지원 여부 등을 감지하여 쿠버네티스 노드 레이블로 추가합니다.
Q3. MIG를 지원하지 않는 GPU에서 여러 Pod가 GPU를 공유하려면?
A) vGPU Manager 사용 B) GPU 타임슬라이싱 설정 C) Device Plugin replicas 증가 D) 별도의 GPU 파티셔닝 불가
정답: B) GPU 타임슬라이싱을 사용하면 MIG를 지원하지 않는 GPU에서도 여러 Pod가 시분할로 GPU를 공유할 수 있습니다.
Q4. DCGM Exporter가 제공하지 않는 메트릭은?
A) GPU 활용률 B) GPU 온도 C) Pod별 네트워크 대역폭 D) XID 에러 카운트
정답: C) DCGM Exporter는 GPU 관련 메트릭만 제공합니다. 네트워크 대역폭은 GPU 모니터링 범위에 포함되지 않습니다.
[Virtualization] 07. NVIDIA GPU Operator: Automating GPU Management on Kubernetes
- Introduction
- What is GPU Operator?
- ClusterPolicy CRD
- Component Details
- Initialization Order
- Installation
- GPU Time-Slicing Configuration
- MIG Configuration Example
- Monitoring Dashboard
- Conclusion
Introduction
Using GPUs on Kubernetes requires installing and managing multiple software components: NVIDIA drivers, container runtime configuration, device plugins, and monitoring. GPU Operator automates all of this, handling everything from Day-0 setup to Day-2 operations.
What is GPU Operator?
NVIDIA GPU Operator is an Operator that automatically deploys and manages all NVIDIA software components needed for GPU usage on Kubernetes.
The Problem It Solves
Let us look at what needs to be manually installed to use GPUs on Kubernetes.
Manual Installation:
+------------------+
| 1. GPU Driver | <-- Direct host OS installation
+------------------+
| 2. Container Toolkit | <-- containerd/CRI-O config changes
+------------------+
| 3. Device Plugin | <-- K8s DaemonSet deployment
+------------------+
| 4. GFD | <-- GPU labeling DaemonSet
+------------------+
| 5. DCGM | <-- Monitoring agent
+------------------+
| 6. MIG Manager | <-- MIG profile management (optional)
+------------------+
With GPU Operator:
+------------------+
| ClusterPolicy CR | <-- One resource automates everything
+------------------+
ClusterPolicy CRD
GPU Operator is built on the Operator Framework and declaratively manages all components through the ClusterPolicy CRD.
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: cluster-policy
spec:
operator:
defaultRuntime: containerd
driver:
enabled: true
version: '550.90.07'
repository: nvcr.io/nvidia
image: driver
toolkit:
enabled: true
version: v1.16.1-ubuntu20.04
devicePlugin:
enabled: true
version: v0.16.1
dcgm:
enabled: true
dcgmExporter:
enabled: true
version: 3.3.7-3.5.0-ubuntu22.04
gfd:
enabled: true
version: v0.16.1
migManager:
enabled: true
version: v0.8.0
nodeStatusExporter:
enabled: true
validator:
version: v24.6.2
Component Details
1. NVIDIA Driver DaemonSet
Containerizes the GPU driver and deploys it as a DaemonSet, eliminating the need for direct host installation.
+------------------------------------------+
| Worker Node |
| +------------------------------------+ |
| | NVIDIA Driver Container (DaemonSet)| |
| | | |
| | - Compiles and loads kernel modules| |
| | - nvidia.ko, nvidia-uvm.ko | |
| | - Creates /dev/nvidia* devices | |
| | | |
| +------------------------------------+ |
| | |
| +------------------------------------+ |
| | Host Kernel | |
| +------------------------------------+ |
+------------------------------------------+
Key features include the following.
- No need to install drivers directly on the host OS
- Driver version upgrades via rolling updates
- Automatic driver compilation for kernel versions
- Pre-compiled driver images also supported
2. NVIDIA Container Toolkit
Configures containerd or CRI-O runtime to recognize NVIDIA GPUs.
Container Execution Flow:
kubelet --> containerd --> nvidia-container-runtime-hook
|
v
nvidia-container-cli
|
v
Mounts GPU devices/libraries
into the container
- Automatically patches containerd/CRI-O configuration
- Registers nvidia-container-runtime hook
- Auto-injects GPU libraries and devices into containers
3. Device Plugin
Registers nvidia.com/gpu resources with Kubernetes so Pods can request GPUs.
# Example: Requesting GPU in a Pod
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-app
image: nvcr.io/nvidia/cuda:12.4.1-runtime-ubuntu22.04
resources:
limits:
nvidia.com/gpu: 1
The Device Plugin operation flow is as follows.
+-------------------+ +------------------+ +---------------+
| Device Plugin | --> | kubelet | --> | API Server |
| (DaemonSet) | | gRPC registration| | Node resource |
+-------------------+ +------------------+ | update |
+---------------+
- Registers GPU devices with kubelet via gRPC
- Manages GPU allocation/deallocation
- Supports topology-aware scheduling
4. GPU Feature Discovery (GFD)
Detects GPU information on nodes and adds them as Kubernetes node labels.
# Example node labels added by GFD
kubectl get node worker-gpu-01 -o json | jq '.metadata.labels' | grep nvidia
# Example output:
# "nvidia.com/cuda.driver.major": "550"
# "nvidia.com/cuda.driver.minor": "90"
# "nvidia.com/cuda.driver.rev": "07"
# "nvidia.com/cuda.runtime.major": "12"
# "nvidia.com/gpu.count": "4"
# "nvidia.com/gpu.family": "ampere"
# "nvidia.com/gpu.machine": "DGX-A100"
# "nvidia.com/gpu.memory": "81920"
# "nvidia.com/gpu.product": "A100-SXM4-80GB"
# "nvidia.com/gpu.replicas": "1"
# "nvidia.com/mig.capable": "true"
Detected information includes the following.
| Label | Description |
|---|---|
| gpu.product | GPU model name (A100, H100, etc.) |
| gpu.memory | GPU memory capacity (MB) |
| gpu.family | GPU architecture (ampere, hopper, etc.) |
| cuda.driver.major | CUDA driver major version |
| cuda.runtime.major | CUDA runtime major version |
| mig.capable | MIG support status |
| gpu.count | Number of GPUs |
5. DCGM + DCGM Exporter
Monitors GPU health using NVIDIA DCGM (Data Center GPU Manager) and exports Prometheus metrics.
+------------------+ +------------------+ +------------------+
| DCGM | --> | DCGM Exporter | --> | Prometheus |
| (GPU Monitoring) | | (Metric Convert) | | (Collect/Store) |
+------------------+ +------------------+ +--------+---------+
|
+--------v---------+
| Grafana |
| (Visualization) |
+------------------+
Key metrics include the following.
| Metric | Description |
|---|---|
| DCGM_FI_DEV_GPU_UTIL | GPU utilization (%) |
| DCGM_FI_DEV_MEM_COPY_UTIL | Memory copy utilization (%) |
| DCGM_FI_DEV_FB_USED | Framebuffer memory used (MB) |
| DCGM_FI_DEV_FB_FREE | Framebuffer memory free (MB) |
| DCGM_FI_DEV_GPU_TEMP | GPU temperature (C) |
| DCGM_FI_DEV_POWER_USAGE | GPU power usage (W) |
| DCGM_FI_DEV_PCIE_TX_THROUGHPUT | PCIe transmit throughput |
| DCGM_FI_DEV_XID_ERRORS | XID error count |
6. MIG Manager
Dynamically configures Multi-Instance GPU (MIG) profiles on NVIDIA A100, H100, and similar GPUs.
A100 80GB MIG Profile Examples:
Full GPU (80GB)
+------------------------------------------------------------------+
| 1g.10gb x 7 |
+--------+--------+--------+--------+--------+--------+--------+ |
| 10GB | 10GB | 10GB | 10GB | 10GB | 10GB | 10GB | |
+--------+--------+--------+--------+--------+--------+--------+ |
Or
+------------------------------------------------------------------+
| 3g.40gb | 4g.40gb |
+---------------------------------+---------------------------------+
| 40GB | 40GB |
+---------------------------------+---------------------------------+
Or
+------------------------------------------------------------------+
| 7g.80gb x 1 |
+------------------------------------------------------------------+
| 80GB |
+------------------------------------------------------------------+
7. vGPU Manager (Optional)
Creates and manages virtual GPUs in environments with NVIDIA vGPU licenses.
- Host driver deployment
- vGPU instance creation and management
- vGPU allocation support for KubeVirt VMs
8. Node Feature Discovery (NFD)
A prerequisite for GPU Operator that detects hardware features on nodes.
# Example labels added by NFD
# feature.node.kubernetes.io/pci-10de.present=true (NVIDIA PCI device)
# feature.node.kubernetes.io/kernel-version.major=5
# feature.node.kubernetes.io/system-os_release.ID=ubuntu
Initialization Order
GPU Operator components initialize sequentially based on dependencies.
1. NFD (Node Feature Discovery)
|
v
2. NVIDIA Driver
|
v
3. NVIDIA Container Toolkit
|
v
4. Device Plugin
|
v
5. GPU Feature Discovery (GFD)
|
v
6. DCGM / DCGM Exporter
|
v
7. MIG Manager (optional)
Each step must complete before the next begins. Validator Pods verify correct operation at each stage.
Installation
Installation via Helm
# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Create GPU Operator namespace
kubectl create namespace gpu-operator
# Install GPU Operator
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--set driver.version=550.90.07 \
--set toolkit.version=v1.16.1-ubuntu20.04
# Check installation status
kubectl get pods -n gpu-operator -w
When Drivers Are Already Installed on Host
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--set driver.enabled=false
Installation Verification
# Check all component Pods
kubectl get pods -n gpu-operator
# Verify GPU node resources
kubectl describe node worker-gpu-01 | grep -A 10 "Allocatable"
# Run test Pod
kubectl run gpu-test --image=nvcr.io/nvidia/cuda:12.4.1-runtime-ubuntu22.04 \
--limits=nvidia.com/gpu=1 \
--command -- nvidia-smi
kubectl logs gpu-test
GPU Time-Slicing Configuration
Allows multiple Pods to share a GPU on devices that do not support MIG.
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 4
# Apply ConfigMap
kubectl apply -f time-slicing-config.yaml
# Add time-slicing config to ClusterPolicy
kubectl patch clusterpolicy cluster-policy \
--type=merge \
-p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'
After applying time-slicing, GPU resources change as follows.
# Before: nvidia.com/gpu: 1
# After: nvidia.com/gpu: 4 (with replicas=4)
MIG Configuration Example
apiVersion: v1
kind: ConfigMap
metadata:
name: mig-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
all-1g.10gb:
- devices: all
mig-enabled: true
mig-devices:
"1g.10gb": 7
all-3g.40gb:
- devices: all
mig-enabled: true
mig-devices:
"3g.40gb": 2
mixed-mig:
- devices: all
mig-enabled: true
mig-devices:
"3g.40gb": 1
"1g.10gb": 4
# Apply MIG profile
kubectl label node worker-gpu-01 nvidia.com/mig.config=all-1g.10gb --overwrite
# Verify MIG devices
kubectl describe node worker-gpu-01 | grep nvidia.com/mig
# nvidia.com/mig-1g.10gb: 7
Monitoring Dashboard
Prometheus ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: gpu-operator
spec:
selector:
matchLabels:
app: nvidia-dcgm-exporter
endpoints:
- port: gpu-metrics
interval: 15s
Key Grafana Dashboard Panels
| Panel | PromQL Query |
|---|---|
| GPU Utilization | DCGM_FI_DEV_GPU_UTIL |
| Memory Usage | DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) * 100 |
| GPU Temperature | DCGM_FI_DEV_GPU_TEMP |
| Power Usage | DCGM_FI_DEV_POWER_USAGE |
| XID Errors | rate(DCGM_FI_DEV_XID_ERRORS[5m]) |
Conclusion
NVIDIA GPU Operator significantly reduces the complexity of managing GPU infrastructure on Kubernetes. With a single ClusterPolicy, you can automate everything from drivers to monitoring, and maximize GPU utilization through MIG and time-slicing.
In the next post, we will explore combining KubeVirt with GPU Operator to leverage GPU acceleration in VMs.
Quiz: GPU Operator Knowledge Check
Q1. What is the correct initialization order of GPU Operator components?
A) Device Plugin -> Driver -> Toolkit -> GFD B) Driver -> Toolkit -> Device Plugin -> GFD C) Toolkit -> Driver -> Device Plugin -> GFD D) GFD -> Driver -> Toolkit -> Device Plugin
Answer: B) Components initialize in the order: Driver -> Toolkit -> Device Plugin -> GFD. Each step depends on the previous one.
Q2. What is the role of GPU Feature Discovery (GFD)?
A) Install GPU drivers B) Configure container runtime C) Detect GPU info and add node labels D) Export GPU metrics to Prometheus
Answer: C) GFD detects GPU model, driver version, CUDA version, MIG support, and more, adding them as Kubernetes node labels.
Q3. How can multiple Pods share a GPU that does not support MIG?
A) Use vGPU Manager B) Configure GPU time-slicing C) Increase Device Plugin replicas D) GPU partitioning is not possible
Answer: B) GPU time-slicing enables multiple Pods to share a GPU through time-division on devices without MIG support.
Q4. Which metric is NOT provided by DCGM Exporter?
A) GPU utilization B) GPU temperature C) Per-pod network bandwidth D) XID error count
Answer: C) DCGM Exporter only provides GPU-related metrics. Network bandwidth is outside the scope of GPU monitoring.