Skip to content
Published on

[Virtualization] 07. NVIDIA GPU Operator: Automating GPU Management on Kubernetes

Authors

Introduction

Using GPUs on Kubernetes requires installing and managing multiple software components: NVIDIA drivers, container runtime configuration, device plugins, and monitoring. GPU Operator automates all of this, handling everything from Day-0 setup to Day-2 operations.

What is GPU Operator?

NVIDIA GPU Operator is an Operator that automatically deploys and manages all NVIDIA software components needed for GPU usage on Kubernetes.

The Problem It Solves

Let us look at what needs to be manually installed to use GPUs on Kubernetes.

Manual Installation:
+------------------+
| 1. GPU Driver    |  <-- Direct host OS installation
+------------------+
| 2. Container Toolkit | <-- containerd/CRI-O config changes
+------------------+
| 3. Device Plugin |  <-- K8s DaemonSet deployment
+------------------+
| 4. GFD           |  <-- GPU labeling DaemonSet
+------------------+
| 5. DCGM          |  <-- Monitoring agent
+------------------+
| 6. MIG Manager   |  <-- MIG profile management (optional)
+------------------+

With GPU Operator:
+------------------+
| ClusterPolicy CR |  <-- One resource automates everything
+------------------+

ClusterPolicy CRD

GPU Operator is built on the Operator Framework and declaratively manages all components through the ClusterPolicy CRD.

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: cluster-policy
spec:
  operator:
    defaultRuntime: containerd
  driver:
    enabled: true
    version: '550.90.07'
    repository: nvcr.io/nvidia
    image: driver
  toolkit:
    enabled: true
    version: v1.16.1-ubuntu20.04
  devicePlugin:
    enabled: true
    version: v0.16.1
  dcgm:
    enabled: true
  dcgmExporter:
    enabled: true
    version: 3.3.7-3.5.0-ubuntu22.04
  gfd:
    enabled: true
    version: v0.16.1
  migManager:
    enabled: true
    version: v0.8.0
  nodeStatusExporter:
    enabled: true
  validator:
    version: v24.6.2

Component Details

1. NVIDIA Driver DaemonSet

Containerizes the GPU driver and deploys it as a DaemonSet, eliminating the need for direct host installation.

+------------------------------------------+
|            Worker Node                    |
|  +------------------------------------+  |
|  | NVIDIA Driver Container (DaemonSet)|  |
|  |                                    |  |
|  |  - Compiles and loads kernel modules|  |
|  |  - nvidia.ko, nvidia-uvm.ko       |  |
|  |  - Creates /dev/nvidia* devices    |  |
|  |                                    |  |
|  +------------------------------------+  |
|                  |                        |
|  +------------------------------------+  |
|  | Host Kernel                        |  |
|  +------------------------------------+  |
+------------------------------------------+

Key features include the following.

  • No need to install drivers directly on the host OS
  • Driver version upgrades via rolling updates
  • Automatic driver compilation for kernel versions
  • Pre-compiled driver images also supported

2. NVIDIA Container Toolkit

Configures containerd or CRI-O runtime to recognize NVIDIA GPUs.

Container Execution Flow:

kubelet --> containerd --> nvidia-container-runtime-hook
                                    |
                                    v
                          nvidia-container-cli
                                    |
                                    v
                          Mounts GPU devices/libraries
                          into the container
  • Automatically patches containerd/CRI-O configuration
  • Registers nvidia-container-runtime hook
  • Auto-injects GPU libraries and devices into containers

3. Device Plugin

Registers nvidia.com/gpu resources with Kubernetes so Pods can request GPUs.

# Example: Requesting GPU in a Pod
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-app
      image: nvcr.io/nvidia/cuda:12.4.1-runtime-ubuntu22.04
      resources:
        limits:
          nvidia.com/gpu: 1

The Device Plugin operation flow is as follows.

+-------------------+     +------------------+     +---------------+
| Device Plugin     | --> | kubelet          | --> | API Server    |
| (DaemonSet)       |     | gRPC registration|     | Node resource |
+-------------------+     +------------------+     | update        |
                                                    +---------------+
  • Registers GPU devices with kubelet via gRPC
  • Manages GPU allocation/deallocation
  • Supports topology-aware scheduling

4. GPU Feature Discovery (GFD)

Detects GPU information on nodes and adds them as Kubernetes node labels.

# Example node labels added by GFD
kubectl get node worker-gpu-01 -o json | jq '.metadata.labels' | grep nvidia

# Example output:
# "nvidia.com/cuda.driver.major": "550"
# "nvidia.com/cuda.driver.minor": "90"
# "nvidia.com/cuda.driver.rev": "07"
# "nvidia.com/cuda.runtime.major": "12"
# "nvidia.com/gpu.count": "4"
# "nvidia.com/gpu.family": "ampere"
# "nvidia.com/gpu.machine": "DGX-A100"
# "nvidia.com/gpu.memory": "81920"
# "nvidia.com/gpu.product": "A100-SXM4-80GB"
# "nvidia.com/gpu.replicas": "1"
# "nvidia.com/mig.capable": "true"

Detected information includes the following.

LabelDescription
gpu.productGPU model name (A100, H100, etc.)
gpu.memoryGPU memory capacity (MB)
gpu.familyGPU architecture (ampere, hopper, etc.)
cuda.driver.majorCUDA driver major version
cuda.runtime.majorCUDA runtime major version
mig.capableMIG support status
gpu.countNumber of GPUs

5. DCGM + DCGM Exporter

Monitors GPU health using NVIDIA DCGM (Data Center GPU Manager) and exports Prometheus metrics.

+------------------+     +------------------+     +------------------+
| DCGM             | --> | DCGM Exporter    | --> | Prometheus       |
| (GPU Monitoring) |     | (Metric Convert) |     | (Collect/Store)  |
+------------------+     +------------------+     +--------+---------+
                                                           |
                                                  +--------v---------+
                                                  | Grafana           |
                                                  | (Visualization)   |
                                                  +------------------+

Key metrics include the following.

MetricDescription
DCGM_FI_DEV_GPU_UTILGPU utilization (%)
DCGM_FI_DEV_MEM_COPY_UTILMemory copy utilization (%)
DCGM_FI_DEV_FB_USEDFramebuffer memory used (MB)
DCGM_FI_DEV_FB_FREEFramebuffer memory free (MB)
DCGM_FI_DEV_GPU_TEMPGPU temperature (C)
DCGM_FI_DEV_POWER_USAGEGPU power usage (W)
DCGM_FI_DEV_PCIE_TX_THROUGHPUTPCIe transmit throughput
DCGM_FI_DEV_XID_ERRORSXID error count

6. MIG Manager

Dynamically configures Multi-Instance GPU (MIG) profiles on NVIDIA A100, H100, and similar GPUs.

A100 80GB MIG Profile Examples:

Full GPU (80GB)
+------------------------------------------------------------------+
|                          1g.10gb x 7                              |
+--------+--------+--------+--------+--------+--------+--------+  |
| 10GB   | 10GB   | 10GB   | 10GB   | 10GB   | 10GB   | 10GB   |  |
+--------+--------+--------+--------+--------+--------+--------+  |

Or

+------------------------------------------------------------------+
|              3g.40gb              |           4g.40gb             |
+---------------------------------+---------------------------------+
|            40GB                  |            40GB                |
+---------------------------------+---------------------------------+

Or

+------------------------------------------------------------------+
|                          7g.80gb x 1                              |
+------------------------------------------------------------------+
|                           80GB                                    |
+------------------------------------------------------------------+

7. vGPU Manager (Optional)

Creates and manages virtual GPUs in environments with NVIDIA vGPU licenses.

  • Host driver deployment
  • vGPU instance creation and management
  • vGPU allocation support for KubeVirt VMs

8. Node Feature Discovery (NFD)

A prerequisite for GPU Operator that detects hardware features on nodes.

# Example labels added by NFD
# feature.node.kubernetes.io/pci-10de.present=true  (NVIDIA PCI device)
# feature.node.kubernetes.io/kernel-version.major=5
# feature.node.kubernetes.io/system-os_release.ID=ubuntu

Initialization Order

GPU Operator components initialize sequentially based on dependencies.

1. NFD (Node Feature Discovery)
   |
   v
2. NVIDIA Driver
   |
   v
3. NVIDIA Container Toolkit
   |
   v
4. Device Plugin
   |
   v
5. GPU Feature Discovery (GFD)
   |
   v
6. DCGM / DCGM Exporter
   |
   v
7. MIG Manager (optional)

Each step must complete before the next begins. Validator Pods verify correct operation at each stage.

Installation

Installation via Helm

# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Create GPU Operator namespace
kubectl create namespace gpu-operator

# Install GPU Operator
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set driver.version=550.90.07 \
  --set toolkit.version=v1.16.1-ubuntu20.04

# Check installation status
kubectl get pods -n gpu-operator -w

When Drivers Are Already Installed on Host

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set driver.enabled=false

Installation Verification

# Check all component Pods
kubectl get pods -n gpu-operator

# Verify GPU node resources
kubectl describe node worker-gpu-01 | grep -A 10 "Allocatable"

# Run test Pod
kubectl run gpu-test --image=nvcr.io/nvidia/cuda:12.4.1-runtime-ubuntu22.04 \
  --limits=nvidia.com/gpu=1 \
  --command -- nvidia-smi
kubectl logs gpu-test

GPU Time-Slicing Configuration

Allows multiple Pods to share a GPU on devices that do not support MIG.

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4
# Apply ConfigMap
kubectl apply -f time-slicing-config.yaml

# Add time-slicing config to ClusterPolicy
kubectl patch clusterpolicy cluster-policy \
  --type=merge \
  -p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'

After applying time-slicing, GPU resources change as follows.

# Before: nvidia.com/gpu: 1
# After:  nvidia.com/gpu: 4 (with replicas=4)

MIG Configuration Example

apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-1g.10gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7
      all-3g.40gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "3g.40gb": 2
      mixed-mig:
        - devices: all
          mig-enabled: true
          mig-devices:
            "3g.40gb": 1
            "1g.10gb": 4
# Apply MIG profile
kubectl label node worker-gpu-01 nvidia.com/mig.config=all-1g.10gb --overwrite

# Verify MIG devices
kubectl describe node worker-gpu-01 | grep nvidia.com/mig
# nvidia.com/mig-1g.10gb: 7

Monitoring Dashboard

Prometheus ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
    - port: gpu-metrics
      interval: 15s

Key Grafana Dashboard Panels

PanelPromQL Query
GPU UtilizationDCGM_FI_DEV_GPU_UTIL
Memory UsageDCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) * 100
GPU TemperatureDCGM_FI_DEV_GPU_TEMP
Power UsageDCGM_FI_DEV_POWER_USAGE
XID Errorsrate(DCGM_FI_DEV_XID_ERRORS[5m])

Conclusion

NVIDIA GPU Operator significantly reduces the complexity of managing GPU infrastructure on Kubernetes. With a single ClusterPolicy, you can automate everything from drivers to monitoring, and maximize GPU utilization through MIG and time-slicing.

In the next post, we will explore combining KubeVirt with GPU Operator to leverage GPU acceleration in VMs.


Quiz: GPU Operator Knowledge Check

Q1. What is the correct initialization order of GPU Operator components?

A) Device Plugin -> Driver -> Toolkit -> GFD B) Driver -> Toolkit -> Device Plugin -> GFD C) Toolkit -> Driver -> Device Plugin -> GFD D) GFD -> Driver -> Toolkit -> Device Plugin

Answer: B) Components initialize in the order: Driver -> Toolkit -> Device Plugin -> GFD. Each step depends on the previous one.


Q2. What is the role of GPU Feature Discovery (GFD)?

A) Install GPU drivers B) Configure container runtime C) Detect GPU info and add node labels D) Export GPU metrics to Prometheus

Answer: C) GFD detects GPU model, driver version, CUDA version, MIG support, and more, adding them as Kubernetes node labels.


Q3. How can multiple Pods share a GPU that does not support MIG?

A) Use vGPU Manager B) Configure GPU time-slicing C) Increase Device Plugin replicas D) GPU partitioning is not possible

Answer: B) GPU time-slicing enables multiple Pods to share a GPU through time-division on devices without MIG support.


Q4. Which metric is NOT provided by DCGM Exporter?

A) GPU utilization B) GPU temperature C) Per-pod network bandwidth D) XID error count

Answer: C) DCGM Exporter only provides GPU-related metrics. Network bandwidth is outside the scope of GPU monitoring.