Skip to content
Published on

NVIDIA GPU Operator Complete Guide: Components, Installation, and KubeVirt GPU Passthrough

Authors
  • Name
    Twitter

1. Introduction to GPU Operator

1.1 What is GPU Operator?

NVIDIA GPU Operator is a Kubernetes Operator that automates the installation, configuration, and management of all software components required to utilize GPUs in a Kubernetes cluster. Built on the Kubernetes Operator Framework, it consistently manages drivers, runtimes, device plugins, monitoring tools, and more from Day-0 provisioning through Day-2 operations.

Traditionally, using GPUs in Kubernetes required manually performing the following tasks on each node:

  1. Installing NVIDIA drivers (exact version matching the kernel version)
  2. Installing NVIDIA Container Toolkit and configuring Container Runtime
  3. Deploying NVIDIA Device Plugin
  4. Installing monitoring tools (DCGM Exporter)
  5. Deploying Node Feature Discovery

GPU Operator manages all these processes declaratively through a single Custom Resource Definition (CRD) called ClusterPolicy.

1.2 Why is GPU Operator Needed?

Manual GPU driver installation has the following serious operational issues:

IssueManual ManagementGPU Operator
Driver InstallationSSH into each node for manual installAuto-deploy via DaemonSet
Kernel Update ResponseDriver recompile needed on kernel updateAuto-match pre-compiled drivers
Version ConsistencyRisk of driver version mismatch per nodeUnified version via ClusterPolicy
New Node AdditionProvisioning script maintenance requiredLabel-based auto-detect and install
MonitoringSeparate setup requiredAuto-deploy DCGM Exporter
GPU SharingManual MIG/Time-Slicing setupDeclarative ConfigMap-based setup
UpgradesManual update after node drainRolling Update support

1.3 Supported Environments

Supported OS:

OSVersion
Ubuntu20.04, 22.04, 24.04 LTS
Red Hat Enterprise Linux8.x, 9.x
CentOS Stream8, 9
Rocky Linux8.x, 9.x
SUSE Linux Enterprise15 SP4+

Supported GPUs:

GPU GenerationExample ModelsMIG Support
TuringT4, RTX 2080X
AmpereA100, A30, A10, A2O (A100, A30)
HopperH100, H200O
Ada LovelaceL40, L40S, L4X
BlackwellB100, B200, GB200O

Supported Kubernetes Version: v1.25+ (recommended: v1.28+)

Supported Container Runtime: containerd, CRI-O

1.4 Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
Kubernetes Control Plane│  ┌──────────────────────────────────────────────────────────────┐   │
│  │                   GPU Operator Controller                    │   │
              (Deployment in gpu-operator namespace)          │   │
│  │                                                              │   │
│  │  ┌────────────────┐    ┌─────────────────────────────────┐   │   │
│  │  │  ClusterPolicy │───▶│  Reconciliation Loop             │   │   │
│  │       (CRD)      │    │  - Watch GPU Nodes              │   │   │
│  │  └────────────────┘    │  - Deploy DaemonSets            │   │   │
│  │                        │  - Manage Lifecycle             │   │   │
│  │                        └─────────────────────────────────┘   │   │
│  └──────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘
                    ┌──────────────┼──────────────┐
                    ▼              ▼              ▼
     ┌──────────────────┐ ┌──────────────┐ ┌──────────────────┐
GPU Worker #1  │ │ GPU Worker #2│ │   GPU Worker #3     │                  │ │              │ │                  │
     │ ┌──────────────┐ │ │ ┌──────────┐ │ │ ┌──────────────┐ │
     │ │ NFD          │ │ │ │ NFD      │ │ │ │ NFD          │ │
 (Node Feature│ │ │ │          │ │ │ │              │ │
     │ │  Discovery)  │ │ │ └──────────┘ │ │ └──────────────┘ │
     │ └──────────────┘ │ │ ┌──────────┐ │ │ ┌──────────────┐ │
     │ ┌──────────────┐ │ │ │ GFD      │ │ │ │ GFD          │ │
     │ │ GFD          │ │ │ │          │ │ │ │              │ │
 (GPU Feature │ │ │ └──────────┘ │ │ └──────────────┘ │
     │ │  Discovery)  │ │ │ ┌──────────┐ │ │ ┌──────────────┐ │
     │ └──────────────┘ │ │ │ Driver   │ │ │ │ Driver       │ │
     │ ┌──────────────┐ │ │ │          │ │ │ │              │ │
     │ │ NVIDIA Driver│ │ │ └──────────┘ │ │ └──────────────┘ │
 (DaemonSet)  │ │ │ ┌──────────┐ │ │ ┌──────────────┐ │
     │ └──────────────┘ │ │ │ Toolkit  │ │ │ │ Toolkit      │ │
     │ ┌──────────────┐ │ │ │          │ │ │ │              │ │
     │ │ Container    │ │ │ └──────────┘ │ │ └──────────────┘ │
     │ │ Toolkit      │ │ │ ┌──────────┐ │ │ ┌──────────────┐ │
     │ └──────────────┘ │ │ │ Device   │ │ │ │ Device       │ │
     │ ┌──────────────┐ │ │ │ Plugin   │ │ │ │ Plugin       │ │
     │ │ Device Plugin│ │ │ └──────────┘ │ │ └──────────────┘ │
     │ └──────────────┘ │ │ ┌──────────┐ │ │ ┌──────────────┐ │
     │ ┌──────────────┐ │ │ │ DCGM     │ │ │ │ DCGM         │ │
     │ │ DCGM Exporter│ │ │ │ Exporter │ │ │ │ Exporter     │ │
     │ └──────────────┘ │ │ └──────────┘ │ │ └──────────────┘ │
     │ ┌──────────────┐ │ │ ┌──────────┐ │ │ ┌──────────────┐ │
     │ │ MIG Manager  │ │ │ │ MIG Mgr  │ │ │ │ MIG Manager  │ │
     │ └──────────────┘ │ │ └──────────┘ │ │ └──────────────┘ │
     │                  │ │              │ │                  │
     │  ╔════════════╗  │ │ ╔══════════╗ │ │  ╔════════════╗  │
     │  ║ NVIDIA GPU ║  │ │ ║ NVIDIA   ║ │ │  ║ NVIDIA GPU ║  │
  (A100)    ║  │ │ ║ GPU(H100)║ │ │    (T4)      ║  │
     │  ╚════════════╝  │ │ ╚══════════╝ │ │  ╚════════════╝  │
     └──────────────────┘ └──────────────┘ └──────────────────┘

The GPU Operator Controller watches the ClusterPolicy CRD, automatically detects worker nodes with GPUs, and deploys all necessary components as DaemonSets. A node is recognized as having an NVIDIA GPU when it has the label feature.node.kubernetes.io/pci-10de.present=true.


2. Detailed Analysis of the 7 Core Components

GPU Operator consists of 7 core components that form the complete software stack required for GPU workloads. Each component is deployed as a DaemonSet and has dependencies on each other.

2.1 NVIDIA Driver (nvidia-driver-daemonset)

The NVIDIA Driver DaemonSet automatically installs the kernel module required to communicate with GPU hardware on each GPU node. This is the most fundamental and essential component of GPU Operator.

Key Features:

  • Auto-build and load NVIDIA kernel modules (nvidia.ko, nvidia-modeset.ko, nvidia-uvm.ko)
  • Auto-match drivers to host kernel version
  • Support for both pre-compiled and run-compiled drivers
  • Driver version management and upgrades

Pre-compiled vs Run-compiled:

PropertyPre-compiledRun-compiled
Build TimeNone (already built)10-20 min on node
Kernel CompatibilityMatched to specific kernel versionSupports all kernel versions
Node Boot TimeFastSlow
Supported OSUbuntu (primarily)Most Linux
SettingkernelModuleType: precompiledkernelModuleType: runcompiled

Driver Configuration Example in ClusterPolicy:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: cluster-policy
spec:
  driver:
    enabled: true
    repository: nvcr.io/nvidia
    image: driver
    version: '560.35.03'
    kernelModuleType: auto # auto | precompiled | runcompiled
    manager:
      env:
        - name: ENABLE_AUTO_DRAIN
          value: 'true'
    rdma:
      enabled: false
    licensingConfig:
      nlsEnabled: false
      configMapName: ''

Note: If NVIDIA drivers are already installed on the host, set driver.enabled=false. Conflicting drivers will prevent the GPU from functioning properly.

2.2 NVIDIA Container Toolkit (nvidia-container-toolkit)

NVIDIA Container Toolkit acts as a bridge enabling Container Runtimes (containerd, CRI-O) to access GPUs from within containers.

Core Operation Principle:

┌─────────────────────────────────────────────────────┐
Container Creation Flow│                                                     │
│   kubelet                                           │
│     │                                               │
│     ▼                                               │
│   containerd/CRI-O│     │                                               │
│     ▼                                               │
│   ┌─────────────────────────────────────┐           │
│   │   NVIDIA Container Runtime Hook     │           │
   (nvidia-container-runtime-hook)   │           │
│   │                                     │           │
│   │   1. Modify OCI spec               │           │
│   │   2. Mount GPU devices             │           │
│   │   3. Inject NVIDIA libraries       │           │
│   │   4. Set environment variables     │           │
│   └─────────────────────────────────────┘           │
│     │                                               │
│     ▼                                               │
│   ┌─────────────────────────────────────┐           │
│   │   libnvidia-container               │           │
│   │   - GPU device node binding         │           │
│   │   - CUDA library mount              │           │
│   │   - Driver library mount            │           │
│   └─────────────────────────────────────┘           │
│     │                                               │
│     ▼                                               │
Container (with GPU access)/dev/nvidia0, /dev/nvidiactl, ...└─────────────────────────────────────────────────────┘

CDI (Container Device Interface) Support:

Starting from GPU Operator v25.x, CDI is enabled by default, allowing the container runtime to recognize GPU devices in a standardized manner.

spec:
  toolkit:
    enabled: true
    version: v1.17.3-ubuntu22.04
  cdi:
    enabled: true # Enable CDI (default: true)
    default: true # Use CDI as default device allocation method

2.3 NVIDIA Device Plugin (nvidia-device-plugin)

NVIDIA Device Plugin registers GPUs as Kubernetes Extended Resources with kubelet, allowing Pods to request nvidia.com/gpu resources.

Operation Flow:

  1. Device Plugin registers with kubelet's Device Plugin gRPC interface
  2. Reports the number of GPUs on the node to kubelet
  3. Kubernetes Scheduler schedules Pods requesting nvidia.com/gpu to the appropriate node
  4. Device Plugin passes GPU device info to kubelet when a Pod runs

GPU Request Example in Pod:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-test
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-test
      image: nvcr.io/nvidia/cuda:12.6.0-base-ubuntu22.04
      command: ['nvidia-smi']
      resources:
        limits:
          nvidia.com/gpu: 1 # Request 1 GPU

Resource Types Managed by Device Plugin:

Resource NameDescriptionUse Case
nvidia.com/gpuFull GPU (or Time-Slicing shared)Default GPU workloads
nvidia.com/mig-1g.5gbMIG 1g.5gb instanceA100 MIG partition
nvidia.com/mig-2g.10gbMIG 2g.10gb instanceA100 MIG partition
nvidia.com/mig-3g.20gbMIG 3g.20gb instanceA100 MIG partition
nvidia.com/mig-7g.40gbMIG 7g.40gb full instanceA100 full MIG
nvidia.com/gpu.sharedTime-Slicing shared GPUWhen renameByDefault is enabled

2.4 NVIDIA DCGM & DCGM Exporter

DCGM (Data Center GPU Manager) is a tool for monitoring NVIDIA GPU health and performance, while DCGM Exporter exposes these metrics in Prometheus format.

Key Collected Metrics:

Metric NameDescriptionUnit
DCGM_FI_DEV_GPU_UTILGPU core utilization%
DCGM_FI_DEV_MEM_COPY_UTILMemory bandwidth utilization%
DCGM_FI_DEV_GPU_TEMPGPU temperatureC
DCGM_FI_DEV_POWER_USAGEPower consumptionW
DCGM_FI_DEV_FB_FREEAvailable framebuffer memoryMiB
DCGM_FI_DEV_FB_USEDUsed framebuffer memoryMiB
DCGM_FI_DEV_XID_ERRORSXID error countcount
DCGM_FI_DEV_ECC_SBE_VOL_TOTALSingle-bit ECC errors (volatile)count
DCGM_FI_DEV_ECC_DBE_VOL_TOTALDouble-bit ECC errors (persistent)count

2.5 NVIDIA MIG Manager

MIG (Multi-Instance GPU) Manager manages the partitioning of a single physical GPU into multiple independent GPU instances on NVIDIA Ampere and newer architectures (A100, A30, H100, etc.).

Unlike Time-Slicing, MIG provides hardware-level memory isolation and compute isolation. Each MIG instance has its own memory, cache, and SM (Streaming Multiprocessor).

A100 40GB MIG Profiles:

ProfileGPU SliceMemoryMax Instances
1g.5gb1/75 GB7
2g.10gb2/710 GB3
3g.20gb3/720 GB2
7g.40gb7/740 GB1

2.6 Node Feature Discovery (NFD)

Node Feature Discovery (NFD) automatically detects hardware characteristics of Kubernetes nodes and registers them as node labels. As a dependency of GPU Operator, it plays a key role in identifying GPU nodes.

GPU Operator uses the feature.node.kubernetes.io/pci-10de.present=true label to automatically identify nodes with NVIDIA GPUs and deploy GPU-related DaemonSets only to those nodes.

Note: If NFD is already installed in the cluster, set nfd.enabled=false during GPU Operator installation to prevent duplicate deployment.

2.7 GPU Feature Discovery (GFD)

GPU Feature Discovery (GFD) registers GPU-specific detailed information as node labels, similar to NFD but focused on GPU details. GFD communicates with the NVIDIA Driver to collect GPU model, driver version, CUDA version, and other information.

Component Deployment Order (Dependency Chain):

NFD (Node Feature Discovery)
 └──▶ GPU Feature Discovery (GFD)
       └──▶ NVIDIA Driver
             └──▶ NVIDIA Container Toolkit
                   └──▶ NVIDIA Device Plugin
                         └──▶ DCGM / DCGM Exporter
                               └──▶ MIG Manager (MIG-capable GPUs only)
                                     └──▶ Validator (Verification)

Component Role Summary Table:

ComponentDaemonSet NameRoleNamespace
NFDnode-feature-discovery-workerNode hardware labelinggpu-operator
GFDgpu-feature-discoveryGPU detail labelinggpu-operator
Drivernvidia-driver-daemonsetGPU kernel module installgpu-operator
Toolkitnvidia-container-toolkit-daemonsetContainer Runtime GPU integrationgpu-operator
Device Pluginnvidia-device-plugin-daemonsetGPU resource kubelet registrationgpu-operator
DCGMnvidia-dcgmGPU metrics collection daemongpu-operator
DCGM Exporternvidia-dcgm-exporterPrometheus metrics exposuregpu-operator
MIG Managernvidia-mig-managerMIG partition managementgpu-operator
Validatornvidia-operator-validatorFull stack verificationgpu-operator

3. Installation Guide (Helm-based)

3.1 Prerequisites

3.1.1 Hardware Requirements

  • Worker nodes with NVIDIA GPUs (Turing architecture or newer recommended)
  • PCIe 3.0+ slots
  • IOMMU support (when using KubeVirt GPU Passthrough)

3.1.2 Software Requirements

# Verify kubectl and helm installation
kubectl version --client
helm version

# Verify Kubernetes cluster access
kubectl get nodes

# Check Container Runtime (containerd or CRI-O)
kubectl get nodes -o wide

3.1.3 Disable Nouveau Driver

The GPU Operator Driver DaemonSet requires the open-source Nouveau driver to be disabled to load the NVIDIA driver.

# Check if nouveau module is loaded
lsmod | grep nouveau

# Disable nouveau (if needed)
cat <<EOF | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
EOF

# Regenerate initramfs
sudo update-initramfs -u

# Reboot
sudo reboot

3.2 Helm Repo Addition and Basic Installation

# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Check available versions
helm search repo nvidia/gpu-operator --versions

# Basic installation (latest version)
helm install gpu-operator \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --version=v25.10.1 \
  --wait

3.3 Verification

# nvidia-smi execution test
kubectl run nvidia-smi \
  --restart=Never \
  --image=nvcr.io/nvidia/cuda:12.6.0-base-ubuntu22.04 \
  --overrides='{"spec":{"restartPolicy":"Never","containers":[{"name":"nvidia-smi","image":"nvcr.io/nvidia/cuda:12.6.0-base-ubuntu22.04","command":["nvidia-smi"],"resources":{"limits":{"nvidia.com/gpu":"1"}}}]}}' \
  && kubectl logs nvidia-smi

# Cleanup
kubectl delete pod nvidia-smi

4. GPU Sharing Strategies

4.1 Time-Slicing

Time-Slicing shares a single physical GPU among multiple containers by time-based division. Supported on all NVIDIA GPUs.

Characteristics:

  • No hardware-level isolation (no memory isolation)
  • Risk of GPU memory OOM
  • All NVIDIA GPUs supported (Turing+)
  • Simple and flexible configuration

4.2 MIG (Multi-Instance GPU)

MIG is a hardware-level GPU partitioning technology supported on NVIDIA Ampere and newer architectures (A100, A30, H100, H200).

4.3 vGPU (Virtual GPU)

NVIDIA vGPU virtualizes GPUs for sharing across multiple VMs. Primarily used with KubeVirt VMs and requires a separate NVIDIA license.

4.4 GPU Sharing Strategy Comparison

PropertyTime-SlicingMIGvGPU
Isolation LevelNone (time-sharing only)Hardware (memory+compute)Hardware (virtualization)
Memory IsolationXOO
Fault IsolationXOO
Supported GPUsAll NVIDIA GPUsA100, A30, H100, H200vGPU-capable GPUs
Max PartitionsUnlimited by configVaries by GPU (max 7)Varies by GPU
Performance OverheadContext SwitchingMinimalVirtualization overhead
LicenseFreeFreePaid (NVIDIA AI Enterprise)
Use CasesDev/test, light inferenceProduction inference, multi-tenantVM-based workloads

5. KubeVirt and GPU Operator Integration

5.1 KubeVirt Introduction

KubeVirt is a project that enables managing virtual machines (VMs) on Kubernetes. By integrating KubeVirt with GPU Operator, you can assign physical GPUs directly to VMs (Passthrough) or allocate virtual GPUs (vGPU).

Important: A single GPU worker node can only run one type of GPU workload. You cannot mix container, vm-passthrough, and vm-vgpu on the same node.

5.2 GPU Passthrough (PCI Passthrough)

GPU Passthrough assigns the entire physical GPU directly to a KubeVirt VM. The VM can use the GPU at native performance, but GPU sharing is not possible.

Key Steps:

  1. Enable IOMMU/VT-d in BIOS
  2. Add kernel parameters (intel_iommu=on iommu=pt)
  3. Load VFIO driver module
  4. Install GPU Operator with sandboxWorkloads.enabled=true
  5. Label GPU node with nvidia.com/gpu.workload.config=vm-passthrough
  6. Register GPU device in KubeVirt CR
  7. Assign GPU in VirtualMachine spec

5.3 vGPU with KubeVirt

vGPU splits a single physical GPU into multiple virtual GPUs that can be shared across multiple VMs. Unlike GPU Passthrough, a single GPU can be used by multiple VMs simultaneously.

Prerequisites:

  • NVIDIA AI Enterprise license or vGPU license
  • GPU supporting vGPU (A100, A10, A30, A16, L40, L4, T4, etc.)

5.4 Live Migration with GPU Constraints

ConstraintGPU PassthroughvGPU
Live MigrationNot supportedLimited support (NVIDIA vGPU Migration)
ReasonPCIe direct assignment cannot be migratedvGPU state transfer needed
AlternativeCold Migration (stop VM then migrate)NVIDIA vGPU Migration license required

6. Monitoring and Observability

6.1 DCGM Exporter -> Prometheus -> Grafana Pipeline

GPU Operator's DCGM Exporter collects metrics from each GPU node and exposes them in Prometheus format on port 9400. Prometheus scrapes these metrics, and Grafana visualizes them using Dashboard ID 12239 (NVIDIA DCGM Exporter Dashboard).

6.2 Alert Rule Examples

Key alerts include GPU temperature warnings (over 85C), GPU temperature critical (over 90C), GPU memory nearly full (over 95% usage), double-bit ECC errors, prolonged high GPU utilization (over 95% for 30+ min), and XID errors.


7. Production Deployment Best Practices

7.1 Node Labeling Strategy

In production, GPU nodes and CPU nodes should be clearly distinguished.

7.2 Taints & Tolerations

Set Taints on GPU nodes so only GPU workloads are scheduled.

7.3 ResourceQuota for GPU Allocation Limits

Limit GPU usage per namespace.

7.4 PriorityClass Configuration

Configure PriorityClass to preempt lower-priority workloads when GPU resources are scarce.

7.5 Driver Upgrade Strategy

GPU Operator supports Rolling Updates for driver upgrades. Set ENABLE_AUTO_DRAIN=true for automatic node drain before driver upgrade, and maxUnavailable: 1 to upgrade one node at a time in production.


8. Troubleshooting Guide

8.1 GPU not detected (Driver Loading Failure)

Symptoms: nvidia-driver-daemonset Pod in CrashLoopBackOff or Error state

Check driver Pod logs, verify Nouveau module conflicts, check for missing kernel headers, check dmesg for GPU-related errors, and verify GPU hardware recognition.

8.2 nvidia-smi not found in container

Check Container Toolkit Pod status, Device Plugin Pod status, node GPU resources, and CDI configuration.

8.3 CUDA version mismatch

Verify the driver's CUDA support version is greater than or equal to the application's CUDA version.

8.4 MIG partition failure

Check MIG Manager logs, verify MIG status on the node, check for running GPU processes, and retry MIG configuration.

8.5 KubeVirt VM GPU not recognized

Verify node workload label, VFIO/vGPU Manager status, Sandbox Device Plugin status, GPU resources on the node, KubeVirt CR permittedHostDevices, and IOMMU group configuration.


9. References

Official Documentation

Additional Resources

ResourceURL
GPU Operator GitHubhttps://github.com/NVIDIA/gpu-operator
DCGM Exporter Grafana Dashboardhttps://grafana.com/grafana/dashboards/12239