[Virtualization] 07. NVIDIA GPU Operator: Automating GPU Management on Kubernetes

Introduction
What is GPU Operator?
- The Problem It Solves
ClusterPolicy CRD
Component Details
Initialization Order
Installation
GPU Time-Slicing Configuration
MIG Configuration Example
Monitoring Dashboard
- Prometheus ServiceMonitor
- Key Grafana Dashboard Panels
Conclusion

Introduction

Using GPUs on Kubernetes requires installing and managing multiple software components: NVIDIA drivers, container runtime configuration, device plugins, and monitoring. GPU Operator automates all of this, handling everything from Day-0 setup to Day-2 operations.

What is GPU Operator?

NVIDIA GPU Operator is an Operator that automatically deploys and manages all NVIDIA software components needed for GPU usage on Kubernetes.

The Problem It Solves

Let us look at what needs to be manually installed to use GPUs on Kubernetes.

Manual Installation:
+------------------+
| 1. GPU Driver    |  <-- Direct host OS installation
+------------------+
| 2. Container Toolkit | <-- containerd/CRI-O config changes
+------------------+
| 3. Device Plugin |  <-- K8s DaemonSet deployment
+------------------+
| 4. GFD           |  <-- GPU labeling DaemonSet
+------------------+
| 5. DCGM          |  <-- Monitoring agent
+------------------+
| 6. MIG Manager   |  <-- MIG profile management (optional)
+------------------+

With GPU Operator:
+------------------+
| ClusterPolicy CR |  <-- One resource automates everything
+------------------+

ClusterPolicy CRD

GPU Operator is built on the Operator Framework and declaratively manages all components through the ClusterPolicy CRD.

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: cluster-policy
spec:
  operator:
    defaultRuntime: containerd
  driver:
    enabled: true
    version: '550.90.07'
    repository: nvcr.io/nvidia
    image: driver
  toolkit:
    enabled: true
    version: v1.16.1-ubuntu20.04
  devicePlugin:
    enabled: true
    version: v0.16.1
  dcgm:
    enabled: true
  dcgmExporter:
    enabled: true
    version: 3.3.7-3.5.0-ubuntu22.04
  gfd:
    enabled: true
    version: v0.16.1
  migManager:
    enabled: true
    version: v0.8.0
  nodeStatusExporter:
    enabled: true
  validator:
    version: v24.6.2

Component Details

1. NVIDIA Driver DaemonSet

Containerizes the GPU driver and deploys it as a DaemonSet, eliminating the need for direct host installation.

+------------------------------------------+
|            Worker Node                    |
|  +------------------------------------+  |
|  | NVIDIA Driver Container (DaemonSet)|  |
|  |                                    |  |
|  |  - Compiles and loads kernel modules|  |
|  |  - nvidia.ko, nvidia-uvm.ko       |  |
|  |  - Creates /dev/nvidia* devices    |  |
|  |                                    |  |
|  +------------------------------------+  |
|                  |                        |
|  +------------------------------------+  |
|  | Host Kernel                        |  |
|  +------------------------------------+  |
+------------------------------------------+

Key features include the following.

No need to install drivers directly on the host OS
Driver version upgrades via rolling updates
Automatic driver compilation for kernel versions
Pre-compiled driver images also supported

2. NVIDIA Container Toolkit

Configures containerd or CRI-O runtime to recognize NVIDIA GPUs.

Container Execution Flow:

kubelet --> containerd --> nvidia-container-runtime-hook
                                    |
                                    v
                          nvidia-container-cli
                                    |
                                    v
                          Mounts GPU devices/libraries
                          into the container

Automatically patches containerd/CRI-O configuration
Registers nvidia-container-runtime hook
Auto-injects GPU libraries and devices into containers

3. Device Plugin

Registers nvidia.com/gpu resources with Kubernetes so Pods can request GPUs.

# Example: Requesting GPU in a Pod
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-app
      image: nvcr.io/nvidia/cuda:12.4.1-runtime-ubuntu22.04
      resources:
        limits:
          nvidia.com/gpu: 1

The Device Plugin operation flow is as follows.

+-------------------+     +------------------+     +---------------+
| Device Plugin     | --> | kubelet          | --> | API Server    |
| (DaemonSet)       |     | gRPC registration|     | Node resource |
+-------------------+     +------------------+     | update        |
                                                    +---------------+

Registers GPU devices with kubelet via gRPC
Manages GPU allocation/deallocation
Supports topology-aware scheduling

4. GPU Feature Discovery (GFD)

Detects GPU information on nodes and adds them as Kubernetes node labels.

# Example node labels added by GFD
kubectl get node worker-gpu-01 -o json | jq '.metadata.labels' | grep nvidia

# Example output:
# "nvidia.com/cuda.driver.major": "550"
# "nvidia.com/cuda.driver.minor": "90"
# "nvidia.com/cuda.driver.rev": "07"
# "nvidia.com/cuda.runtime.major": "12"
# "nvidia.com/gpu.count": "4"
# "nvidia.com/gpu.family": "ampere"
# "nvidia.com/gpu.machine": "DGX-A100"
# "nvidia.com/gpu.memory": "81920"
# "nvidia.com/gpu.product": "A100-SXM4-80GB"
# "nvidia.com/gpu.replicas": "1"
# "nvidia.com/mig.capable": "true"

Detected information includes the following.

Label	Description
gpu.product	GPU model name (A100, H100, etc.)
gpu.memory	GPU memory capacity (MB)
gpu.family	GPU architecture (ampere, hopper, etc.)
cuda.driver.major	CUDA driver major version
cuda.runtime.major	CUDA runtime major version
mig.capable	MIG support status
gpu.count	Number of GPUs

5. DCGM + DCGM Exporter

Monitors GPU health using NVIDIA DCGM (Data Center GPU Manager) and exports Prometheus metrics.

+------------------+     +------------------+     +------------------+
| DCGM             | --> | DCGM Exporter    | --> | Prometheus       |
| (GPU Monitoring) |     | (Metric Convert) |     | (Collect/Store)  |
+------------------+     +------------------+     +--------+---------+
                                                           |
                                                  +--------v---------+
                                                  | Grafana           |
                                                  | (Visualization)   |
                                                  +------------------+

Key metrics include the following.

Metric	Description
DCGM_FI_DEV_GPU_UTIL	GPU utilization (%)
DCGM_FI_DEV_MEM_COPY_UTIL	Memory copy utilization (%)
DCGM_FI_DEV_FB_USED	Framebuffer memory used (MB)
DCGM_FI_DEV_FB_FREE	Framebuffer memory free (MB)
DCGM_FI_DEV_GPU_TEMP	GPU temperature (C)
DCGM_FI_DEV_POWER_USAGE	GPU power usage (W)
DCGM_FI_DEV_PCIE_TX_THROUGHPUT	PCIe transmit throughput
DCGM_FI_DEV_XID_ERRORS	XID error count

6. MIG Manager

Dynamically configures Multi-Instance GPU (MIG) profiles on NVIDIA A100, H100, and similar GPUs.

A100 80GB MIG Profile Examples:

Full GPU (80GB)
+------------------------------------------------------------------+
|                          1g.10gb x 7                              |
+--------+--------+--------+--------+--------+--------+--------+  |
| 10GB   | 10GB   | 10GB   | 10GB   | 10GB   | 10GB   | 10GB   |  |
+--------+--------+--------+--------+--------+--------+--------+  |

Or

+------------------------------------------------------------------+
|              3g.40gb              |           4g.40gb             |
+---------------------------------+---------------------------------+
|            40GB                  |            40GB                |
+---------------------------------+---------------------------------+

Or

+------------------------------------------------------------------+
|                          7g.80gb x 1                              |
+------------------------------------------------------------------+
|                           80GB                                    |
+------------------------------------------------------------------+

7. vGPU Manager (Optional)

Creates and manages virtual GPUs in environments with NVIDIA vGPU licenses.

Host driver deployment
vGPU instance creation and management
vGPU allocation support for KubeVirt VMs

8. Node Feature Discovery (NFD)

A prerequisite for GPU Operator that detects hardware features on nodes.

# Example labels added by NFD
# feature.node.kubernetes.io/pci-10de.present=true  (NVIDIA PCI device)
# feature.node.kubernetes.io/kernel-version.major=5
# feature.node.kubernetes.io/system-os_release.ID=ubuntu

Initialization Order

GPU Operator components initialize sequentially based on dependencies.

1. NFD (Node Feature Discovery)
   |
   v
2. NVIDIA Driver
   |
   v
3. NVIDIA Container Toolkit
   |
   v
4. Device Plugin
   |
   v
5. GPU Feature Discovery (GFD)
   |
   v
6. DCGM / DCGM Exporter
   |
   v
7. MIG Manager (optional)

Each step must complete before the next begins. Validator Pods verify correct operation at each stage.

Installation

Installation via Helm

# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Create GPU Operator namespace
kubectl create namespace gpu-operator

# Install GPU Operator
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set driver.version=550.90.07 \
  --set toolkit.version=v1.16.1-ubuntu20.04

# Check installation status
kubectl get pods -n gpu-operator -w

When Drivers Are Already Installed on Host

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set driver.enabled=false

Installation Verification

# Check all component Pods
kubectl get pods -n gpu-operator

# Verify GPU node resources
kubectl describe node worker-gpu-01 | grep -A 10 "Allocatable"

# Run test Pod
kubectl run gpu-test --image=nvcr.io/nvidia/cuda:12.4.1-runtime-ubuntu22.04 \
  --limits=nvidia.com/gpu=1 \
  --command -- nvidia-smi
kubectl logs gpu-test

GPU Time-Slicing Configuration

Allows multiple Pods to share a GPU on devices that do not support MIG.

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4

# Apply ConfigMap
kubectl apply -f time-slicing-config.yaml

# Add time-slicing config to ClusterPolicy
kubectl patch clusterpolicy cluster-policy \
  --type=merge \
  -p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'

After applying time-slicing, GPU resources change as follows.

# Before: nvidia.com/gpu: 1
# After:  nvidia.com/gpu: 4 (with replicas=4)

MIG Configuration Example

apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-1g.10gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7
      all-3g.40gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "3g.40gb": 2
      mixed-mig:
        - devices: all
          mig-enabled: true
          mig-devices:
            "3g.40gb": 1
            "1g.10gb": 4

# Apply MIG profile
kubectl label node worker-gpu-01 nvidia.com/mig.config=all-1g.10gb --overwrite

# Verify MIG devices
kubectl describe node worker-gpu-01 | grep nvidia.com/mig
# nvidia.com/mig-1g.10gb: 7

Monitoring Dashboard

Prometheus ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
    - port: gpu-metrics
      interval: 15s

Key Grafana Dashboard Panels

Panel	PromQL Query
GPU Utilization	`DCGM_FI_DEV_GPU_UTIL`
Memory Usage	`DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) * 100`
GPU Temperature	`DCGM_FI_DEV_GPU_TEMP`
Power Usage	`DCGM_FI_DEV_POWER_USAGE`
XID Errors	`rate(DCGM_FI_DEV_XID_ERRORS[5m])`

Conclusion

NVIDIA GPU Operator significantly reduces the complexity of managing GPU infrastructure on Kubernetes. With a single ClusterPolicy, you can automate everything from drivers to monitoring, and maximize GPU utilization through MIG and time-slicing.

In the next post, we will explore combining KubeVirt with GPU Operator to leverage GPU acceleration in VMs.

Quiz: GPU Operator Knowledge Check

Q1. What is the correct initialization order of GPU Operator components?

A) Device Plugin -> Driver -> Toolkit -> GFD B) Driver -> Toolkit -> Device Plugin -> GFD C) Toolkit -> Driver -> Device Plugin -> GFD D) GFD -> Driver -> Toolkit -> Device Plugin

Answer: B) Components initialize in the order: Driver -> Toolkit -> Device Plugin -> GFD. Each step depends on the previous one.

Q2. What is the role of GPU Feature Discovery (GFD)?

A) Install GPU drivers B) Configure container runtime C) Detect GPU info and add node labels D) Export GPU metrics to Prometheus

Answer: C) GFD detects GPU model, driver version, CUDA version, MIG support, and more, adding them as Kubernetes node labels.

Q3. How can multiple Pods share a GPU that does not support MIG?

A) Use vGPU Manager B) Configure GPU time-slicing C) Increase Device Plugin replicas D) GPU partitioning is not possible

Answer: B) GPU time-slicing enables multiple Pods to share a GPU through time-division on devices without MIG support.

Q4. Which metric is NOT provided by DCGM Exporter?

A) GPU utilization B) GPU temperature C) Per-pod network bandwidth D) XID error count

Answer: C) DCGM Exporter only provides GPU-related metrics. Network bandwidth is outside the scope of GPU monitoring.