- Published on
[Virtualization] 07. NVIDIA GPU Operator: Automating GPU Management on Kubernetes
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- What is GPU Operator?
- ClusterPolicy CRD
- Component Details
- Initialization Order
- Installation
- GPU Time-Slicing Configuration
- MIG Configuration Example
- Monitoring Dashboard
- Conclusion
Introduction
Using GPUs on Kubernetes requires installing and managing multiple software components: NVIDIA drivers, container runtime configuration, device plugins, and monitoring. GPU Operator automates all of this, handling everything from Day-0 setup to Day-2 operations.
What is GPU Operator?
NVIDIA GPU Operator is an Operator that automatically deploys and manages all NVIDIA software components needed for GPU usage on Kubernetes.
The Problem It Solves
Let us look at what needs to be manually installed to use GPUs on Kubernetes.
Manual Installation:
+------------------+
| 1. GPU Driver | <-- Direct host OS installation
+------------------+
| 2. Container Toolkit | <-- containerd/CRI-O config changes
+------------------+
| 3. Device Plugin | <-- K8s DaemonSet deployment
+------------------+
| 4. GFD | <-- GPU labeling DaemonSet
+------------------+
| 5. DCGM | <-- Monitoring agent
+------------------+
| 6. MIG Manager | <-- MIG profile management (optional)
+------------------+
With GPU Operator:
+------------------+
| ClusterPolicy CR | <-- One resource automates everything
+------------------+
ClusterPolicy CRD
GPU Operator is built on the Operator Framework and declaratively manages all components through the ClusterPolicy CRD.
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: cluster-policy
spec:
operator:
defaultRuntime: containerd
driver:
enabled: true
version: '550.90.07'
repository: nvcr.io/nvidia
image: driver
toolkit:
enabled: true
version: v1.16.1-ubuntu20.04
devicePlugin:
enabled: true
version: v0.16.1
dcgm:
enabled: true
dcgmExporter:
enabled: true
version: 3.3.7-3.5.0-ubuntu22.04
gfd:
enabled: true
version: v0.16.1
migManager:
enabled: true
version: v0.8.0
nodeStatusExporter:
enabled: true
validator:
version: v24.6.2
Component Details
1. NVIDIA Driver DaemonSet
Containerizes the GPU driver and deploys it as a DaemonSet, eliminating the need for direct host installation.
+------------------------------------------+
| Worker Node |
| +------------------------------------+ |
| | NVIDIA Driver Container (DaemonSet)| |
| | | |
| | - Compiles and loads kernel modules| |
| | - nvidia.ko, nvidia-uvm.ko | |
| | - Creates /dev/nvidia* devices | |
| | | |
| +------------------------------------+ |
| | |
| +------------------------------------+ |
| | Host Kernel | |
| +------------------------------------+ |
+------------------------------------------+
Key features include the following.
- No need to install drivers directly on the host OS
- Driver version upgrades via rolling updates
- Automatic driver compilation for kernel versions
- Pre-compiled driver images also supported
2. NVIDIA Container Toolkit
Configures containerd or CRI-O runtime to recognize NVIDIA GPUs.
Container Execution Flow:
kubelet --> containerd --> nvidia-container-runtime-hook
|
v
nvidia-container-cli
|
v
Mounts GPU devices/libraries
into the container
- Automatically patches containerd/CRI-O configuration
- Registers nvidia-container-runtime hook
- Auto-injects GPU libraries and devices into containers
3. Device Plugin
Registers nvidia.com/gpu resources with Kubernetes so Pods can request GPUs.
# Example: Requesting GPU in a Pod
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-app
image: nvcr.io/nvidia/cuda:12.4.1-runtime-ubuntu22.04
resources:
limits:
nvidia.com/gpu: 1
The Device Plugin operation flow is as follows.
+-------------------+ +------------------+ +---------------+
| Device Plugin | --> | kubelet | --> | API Server |
| (DaemonSet) | | gRPC registration| | Node resource |
+-------------------+ +------------------+ | update |
+---------------+
- Registers GPU devices with kubelet via gRPC
- Manages GPU allocation/deallocation
- Supports topology-aware scheduling
4. GPU Feature Discovery (GFD)
Detects GPU information on nodes and adds them as Kubernetes node labels.
# Example node labels added by GFD
kubectl get node worker-gpu-01 -o json | jq '.metadata.labels' | grep nvidia
# Example output:
# "nvidia.com/cuda.driver.major": "550"
# "nvidia.com/cuda.driver.minor": "90"
# "nvidia.com/cuda.driver.rev": "07"
# "nvidia.com/cuda.runtime.major": "12"
# "nvidia.com/gpu.count": "4"
# "nvidia.com/gpu.family": "ampere"
# "nvidia.com/gpu.machine": "DGX-A100"
# "nvidia.com/gpu.memory": "81920"
# "nvidia.com/gpu.product": "A100-SXM4-80GB"
# "nvidia.com/gpu.replicas": "1"
# "nvidia.com/mig.capable": "true"
Detected information includes the following.
| Label | Description |
|---|---|
| gpu.product | GPU model name (A100, H100, etc.) |
| gpu.memory | GPU memory capacity (MB) |
| gpu.family | GPU architecture (ampere, hopper, etc.) |
| cuda.driver.major | CUDA driver major version |
| cuda.runtime.major | CUDA runtime major version |
| mig.capable | MIG support status |
| gpu.count | Number of GPUs |
5. DCGM + DCGM Exporter
Monitors GPU health using NVIDIA DCGM (Data Center GPU Manager) and exports Prometheus metrics.
+------------------+ +------------------+ +------------------+
| DCGM | --> | DCGM Exporter | --> | Prometheus |
| (GPU Monitoring) | | (Metric Convert) | | (Collect/Store) |
+------------------+ +------------------+ +--------+---------+
|
+--------v---------+
| Grafana |
| (Visualization) |
+------------------+
Key metrics include the following.
| Metric | Description |
|---|---|
| DCGM_FI_DEV_GPU_UTIL | GPU utilization (%) |
| DCGM_FI_DEV_MEM_COPY_UTIL | Memory copy utilization (%) |
| DCGM_FI_DEV_FB_USED | Framebuffer memory used (MB) |
| DCGM_FI_DEV_FB_FREE | Framebuffer memory free (MB) |
| DCGM_FI_DEV_GPU_TEMP | GPU temperature (C) |
| DCGM_FI_DEV_POWER_USAGE | GPU power usage (W) |
| DCGM_FI_DEV_PCIE_TX_THROUGHPUT | PCIe transmit throughput |
| DCGM_FI_DEV_XID_ERRORS | XID error count |
6. MIG Manager
Dynamically configures Multi-Instance GPU (MIG) profiles on NVIDIA A100, H100, and similar GPUs.
A100 80GB MIG Profile Examples:
Full GPU (80GB)
+------------------------------------------------------------------+
| 1g.10gb x 7 |
+--------+--------+--------+--------+--------+--------+--------+ |
| 10GB | 10GB | 10GB | 10GB | 10GB | 10GB | 10GB | |
+--------+--------+--------+--------+--------+--------+--------+ |
Or
+------------------------------------------------------------------+
| 3g.40gb | 4g.40gb |
+---------------------------------+---------------------------------+
| 40GB | 40GB |
+---------------------------------+---------------------------------+
Or
+------------------------------------------------------------------+
| 7g.80gb x 1 |
+------------------------------------------------------------------+
| 80GB |
+------------------------------------------------------------------+
7. vGPU Manager (Optional)
Creates and manages virtual GPUs in environments with NVIDIA vGPU licenses.
- Host driver deployment
- vGPU instance creation and management
- vGPU allocation support for KubeVirt VMs
8. Node Feature Discovery (NFD)
A prerequisite for GPU Operator that detects hardware features on nodes.
# Example labels added by NFD
# feature.node.kubernetes.io/pci-10de.present=true (NVIDIA PCI device)
# feature.node.kubernetes.io/kernel-version.major=5
# feature.node.kubernetes.io/system-os_release.ID=ubuntu
Initialization Order
GPU Operator components initialize sequentially based on dependencies.
1. NFD (Node Feature Discovery)
|
v
2. NVIDIA Driver
|
v
3. NVIDIA Container Toolkit
|
v
4. Device Plugin
|
v
5. GPU Feature Discovery (GFD)
|
v
6. DCGM / DCGM Exporter
|
v
7. MIG Manager (optional)
Each step must complete before the next begins. Validator Pods verify correct operation at each stage.
Installation
Installation via Helm
# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Create GPU Operator namespace
kubectl create namespace gpu-operator
# Install GPU Operator
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--set driver.version=550.90.07 \
--set toolkit.version=v1.16.1-ubuntu20.04
# Check installation status
kubectl get pods -n gpu-operator -w
When Drivers Are Already Installed on Host
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--set driver.enabled=false
Installation Verification
# Check all component Pods
kubectl get pods -n gpu-operator
# Verify GPU node resources
kubectl describe node worker-gpu-01 | grep -A 10 "Allocatable"
# Run test Pod
kubectl run gpu-test --image=nvcr.io/nvidia/cuda:12.4.1-runtime-ubuntu22.04 \
--limits=nvidia.com/gpu=1 \
--command -- nvidia-smi
kubectl logs gpu-test
GPU Time-Slicing Configuration
Allows multiple Pods to share a GPU on devices that do not support MIG.
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 4
# Apply ConfigMap
kubectl apply -f time-slicing-config.yaml
# Add time-slicing config to ClusterPolicy
kubectl patch clusterpolicy cluster-policy \
--type=merge \
-p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'
After applying time-slicing, GPU resources change as follows.
# Before: nvidia.com/gpu: 1
# After: nvidia.com/gpu: 4 (with replicas=4)
MIG Configuration Example
apiVersion: v1
kind: ConfigMap
metadata:
name: mig-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
all-1g.10gb:
- devices: all
mig-enabled: true
mig-devices:
"1g.10gb": 7
all-3g.40gb:
- devices: all
mig-enabled: true
mig-devices:
"3g.40gb": 2
mixed-mig:
- devices: all
mig-enabled: true
mig-devices:
"3g.40gb": 1
"1g.10gb": 4
# Apply MIG profile
kubectl label node worker-gpu-01 nvidia.com/mig.config=all-1g.10gb --overwrite
# Verify MIG devices
kubectl describe node worker-gpu-01 | grep nvidia.com/mig
# nvidia.com/mig-1g.10gb: 7
Monitoring Dashboard
Prometheus ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: gpu-operator
spec:
selector:
matchLabels:
app: nvidia-dcgm-exporter
endpoints:
- port: gpu-metrics
interval: 15s
Key Grafana Dashboard Panels
| Panel | PromQL Query |
|---|---|
| GPU Utilization | DCGM_FI_DEV_GPU_UTIL |
| Memory Usage | DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) * 100 |
| GPU Temperature | DCGM_FI_DEV_GPU_TEMP |
| Power Usage | DCGM_FI_DEV_POWER_USAGE |
| XID Errors | rate(DCGM_FI_DEV_XID_ERRORS[5m]) |
Conclusion
NVIDIA GPU Operator significantly reduces the complexity of managing GPU infrastructure on Kubernetes. With a single ClusterPolicy, you can automate everything from drivers to monitoring, and maximize GPU utilization through MIG and time-slicing.
In the next post, we will explore combining KubeVirt with GPU Operator to leverage GPU acceleration in VMs.
Quiz: GPU Operator Knowledge Check
Q1. What is the correct initialization order of GPU Operator components?
A) Device Plugin -> Driver -> Toolkit -> GFD B) Driver -> Toolkit -> Device Plugin -> GFD C) Toolkit -> Driver -> Device Plugin -> GFD D) GFD -> Driver -> Toolkit -> Device Plugin
Answer: B) Components initialize in the order: Driver -> Toolkit -> Device Plugin -> GFD. Each step depends on the previous one.
Q2. What is the role of GPU Feature Discovery (GFD)?
A) Install GPU drivers B) Configure container runtime C) Detect GPU info and add node labels D) Export GPU metrics to Prometheus
Answer: C) GFD detects GPU model, driver version, CUDA version, MIG support, and more, adding them as Kubernetes node labels.
Q3. How can multiple Pods share a GPU that does not support MIG?
A) Use vGPU Manager B) Configure GPU time-slicing C) Increase Device Plugin replicas D) GPU partitioning is not possible
Answer: B) GPU time-slicing enables multiple Pods to share a GPU through time-division on devices without MIG support.
Q4. Which metric is NOT provided by DCGM Exporter?
A) GPU utilization B) GPU temperature C) Per-pod network bandwidth D) XID error count
Answer: C) DCGM Exporter only provides GPU-related metrics. Network bandwidth is outside the scope of GPU monitoring.