NVIDIA GPU Operator Complete Guide: Components, Installation, and KubeVirt GPU Passthrough

1. Introduction to GPU Operator
2. Detailed Analysis of the 7 Core Components
3. Installation Guide (Helm-based)
4. GPU Sharing Strategies
5. KubeVirt and GPU Operator Integration
6. Monitoring and Observability
- 6.1 DCGM Exporter -> Prometheus -> Grafana Pipeline
- 6.2 Alert Rule Examples
7. Production Deployment Best Practices
8. Troubleshooting Guide
9. References

1. Introduction to GPU Operator

1.1 What is GPU Operator?

NVIDIA GPU Operator is a Kubernetes Operator that automates the installation, configuration, and management of all software components required to utilize GPUs in a Kubernetes cluster. Built on the Kubernetes Operator Framework, it consistently manages drivers, runtimes, device plugins, monitoring tools, and more from Day-0 provisioning through Day-2 operations.

Traditionally, using GPUs in Kubernetes required manually performing the following tasks on each node:

Installing NVIDIA drivers (exact version matching the kernel version)
Installing NVIDIA Container Toolkit and configuring Container Runtime
Deploying NVIDIA Device Plugin
Installing monitoring tools (DCGM Exporter)
Deploying Node Feature Discovery

GPU Operator manages all these processes declaratively through a single Custom Resource Definition (CRD) called ClusterPolicy.

1.2 Why is GPU Operator Needed?

Manual GPU driver installation has the following serious operational issues:

Issue	Manual Management	GPU Operator
Driver Installation	SSH into each node for manual install	Auto-deploy via DaemonSet
Kernel Update Response	Driver recompile needed on kernel update	Auto-match pre-compiled drivers
Version Consistency	Risk of driver version mismatch per node	Unified version via ClusterPolicy
New Node Addition	Provisioning script maintenance required	Label-based auto-detect and install
Monitoring	Separate setup required	Auto-deploy DCGM Exporter
GPU Sharing	Manual MIG/Time-Slicing setup	Declarative ConfigMap-based setup
Upgrades	Manual update after node drain	Rolling Update support

1.3 Supported Environments

Supported OS:

OS	Version
Ubuntu	20.04, 22.04, 24.04 LTS
Red Hat Enterprise Linux	8.x, 9.x
CentOS Stream	8, 9
Rocky Linux	8.x, 9.x
SUSE Linux Enterprise	15 SP4+

Supported GPUs:

GPU Generation	Example Models	MIG Support
Turing	T4, RTX 2080	X
Ampere	A100, A30, A10, A2	O (A100, A30)
Hopper	H100, H200	O
Ada Lovelace	L40, L40S, L4	X
Blackwell	B100, B200, GB200	O

Supported Kubernetes Version: v1.25+ (recommended: v1.28+)

Supported Container Runtime: containerd, CRI-O

1.4 Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                    Kubernetes Control Plane                         │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │                   GPU Operator Controller                    │   │
│  │              (Deployment in gpu-operator namespace)          │   │
│  │                                                              │   │
│  │  ┌────────────────┐    ┌─────────────────────────────────┐   │   │
│  │  │  ClusterPolicy │───▶│  Reconciliation Loop             │   │   │
│  │  │     (CRD)      │    │  - Watch GPU Nodes              │   │   │
│  │  └────────────────┘    │  - Deploy DaemonSets            │   │   │
│  │                        │  - Manage Lifecycle             │   │   │
│  │                        └─────────────────────────────────┘   │   │
│  └──────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘
                                   │
                    ┌──────────────┼──────────────┐
                    ▼              ▼              ▼
     ┌──────────────────┐ ┌──────────────┐ ┌──────────────────┐
     │   GPU Worker #1  │ │ GPU Worker #2│ │   GPU Worker #3  │
     │                  │ │              │ │                  │
     │ ┌──────────────┐ │ │ ┌──────────┐ │ │ ┌──────────────┐ │
     │ │ NFD          │ │ │ │ NFD      │ │ │ │ NFD          │ │
     │ │ (Node Feature│ │ │ │          │ │ │ │              │ │
     │ │  Discovery)  │ │ │ └──────────┘ │ │ └──────────────┘ │
     │ └──────────────┘ │ │ ┌──────────┐ │ │ ┌──────────────┐ │
     │ ┌──────────────┐ │ │ │ GFD      │ │ │ │ GFD          │ │
     │ │ GFD          │ │ │ │          │ │ │ │              │ │
     │ │ (GPU Feature │ │ │ └──────────┘ │ │ └──────────────┘ │
     │ │  Discovery)  │ │ │ ┌──────────┐ │ │ ┌──────────────┐ │
     │ └──────────────┘ │ │ │ Driver   │ │ │ │ Driver       │ │
     │ ┌──────────────┐ │ │ │          │ │ │ │              │ │
     │ │ NVIDIA Driver│ │ │ └──────────┘ │ │ └──────────────┘ │
     │ │ (DaemonSet)  │ │ │ ┌──────────┐ │ │ ┌──────────────┐ │
     │ └──────────────┘ │ │ │ Toolkit  │ │ │ │ Toolkit      │ │
     │ ┌──────────────┐ │ │ │          │ │ │ │              │ │
     │ │ Container    │ │ │ └──────────┘ │ │ └──────────────┘ │
     │ │ Toolkit      │ │ │ ┌──────────┐ │ │ ┌──────────────┐ │
     │ └──────────────┘ │ │ │ Device   │ │ │ │ Device       │ │
     │ ┌──────────────┐ │ │ │ Plugin   │ │ │ │ Plugin       │ │
     │ │ Device Plugin│ │ │ └──────────┘ │ │ └──────────────┘ │
     │ └──────────────┘ │ │ ┌──────────┐ │ │ ┌──────────────┐ │
     │ ┌──────────────┐ │ │ │ DCGM     │ │ │ │ DCGM         │ │
     │ │ DCGM Exporter│ │ │ │ Exporter │ │ │ │ Exporter     │ │
     │ └──────────────┘ │ │ └──────────┘ │ │ └──────────────┘ │
     │ ┌──────────────┐ │ │ ┌──────────┐ │ │ ┌──────────────┐ │
     │ │ MIG Manager  │ │ │ │ MIG Mgr  │ │ │ │ MIG Manager  │ │
     │ └──────────────┘ │ │ └──────────┘ │ │ └──────────────┘ │
     │                  │ │              │ │                  │
     │  ╔════════════╗  │ │ ╔══════════╗ │ │  ╔════════════╗  │
     │  ║ NVIDIA GPU ║  │ │ ║ NVIDIA   ║ │ │  ║ NVIDIA GPU ║  │
     │  ║  (A100)    ║  │ │ ║ GPU(H100)║ │ │  ║  (T4)      ║  │
     │  ╚════════════╝  │ │ ╚══════════╝ │ │  ╚════════════╝  │
     └──────────────────┘ └──────────────┘ └──────────────────┘

The GPU Operator Controller watches the ClusterPolicy CRD, automatically detects worker nodes with GPUs, and deploys all necessary components as DaemonSets. A node is recognized as having an NVIDIA GPU when it has the label feature.node.kubernetes.io/pci-10de.present=true.

2. Detailed Analysis of the 7 Core Components

GPU Operator consists of 7 core components that form the complete software stack required for GPU workloads. Each component is deployed as a DaemonSet and has dependencies on each other.

2.1 NVIDIA Driver (nvidia-driver-daemonset)

The NVIDIA Driver DaemonSet automatically installs the kernel module required to communicate with GPU hardware on each GPU node. This is the most fundamental and essential component of GPU Operator.

Key Features:

Auto-build and load NVIDIA kernel modules (nvidia.ko, nvidia-modeset.ko, nvidia-uvm.ko)
Auto-match drivers to host kernel version
Support for both pre-compiled and run-compiled drivers
Driver version management and upgrades

Pre-compiled vs Run-compiled:

Property	Pre-compiled	Run-compiled
Build Time	None (already built)	10-20 min on node
Kernel Compatibility	Matched to specific kernel version	Supports all kernel versions
Node Boot Time	Fast	Slow
Supported OS	Ubuntu (primarily)	Most Linux
Setting	`kernelModuleType: precompiled`	`kernelModuleType: runcompiled`

Driver Configuration Example in ClusterPolicy:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: cluster-policy
spec:
  driver:
    enabled: true
    repository: nvcr.io/nvidia
    image: driver
    version: '560.35.03'
    kernelModuleType: auto # auto | precompiled | runcompiled
    manager:
      env:
        - name: ENABLE_AUTO_DRAIN
          value: 'true'
    rdma:
      enabled: false
    licensingConfig:
      nlsEnabled: false
      configMapName: ''

Note: If NVIDIA drivers are already installed on the host, set driver.enabled=false. Conflicting drivers will prevent the GPU from functioning properly.

2.2 NVIDIA Container Toolkit (nvidia-container-toolkit)

NVIDIA Container Toolkit acts as a bridge enabling Container Runtimes (containerd, CRI-O) to access GPUs from within containers.

Core Operation Principle:

┌─────────────────────────────────────────────────────┐
│                 Container Creation Flow              │
│                                                     │
│   kubelet                                           │
│     │                                               │
│     ▼                                               │
│   containerd/CRI-O                                  │
│     │                                               │
│     ▼                                               │
│   ┌─────────────────────────────────────┐           │
│   │   NVIDIA Container Runtime Hook     │           │
│   │   (nvidia-container-runtime-hook)   │           │
│   │                                     │           │
│   │   1. Modify OCI spec               │           │
│   │   2. Mount GPU devices             │           │
│   │   3. Inject NVIDIA libraries       │           │
│   │   4. Set environment variables     │           │
│   └─────────────────────────────────────┘           │
│     │                                               │
│     ▼                                               │
│   ┌─────────────────────────────────────┐           │
│   │   libnvidia-container               │           │
│   │   - GPU device node binding         │           │
│   │   - CUDA library mount              │           │
│   │   - Driver library mount            │           │
│   └─────────────────────────────────────┘           │
│     │                                               │
│     ▼                                               │
│   Container (with GPU access)                       │
│   /dev/nvidia0, /dev/nvidiactl, ...                 │
└─────────────────────────────────────────────────────┘

CDI (Container Device Interface) Support:

Starting from GPU Operator v25.x, CDI is enabled by default, allowing the container runtime to recognize GPU devices in a standardized manner.

spec:
  toolkit:
    enabled: true
    version: v1.17.3-ubuntu22.04
  cdi:
    enabled: true # Enable CDI (default: true)
    default: true # Use CDI as default device allocation method

2.3 NVIDIA Device Plugin (nvidia-device-plugin)

NVIDIA Device Plugin registers GPUs as Kubernetes Extended Resources with kubelet, allowing Pods to request nvidia.com/gpu resources.

Operation Flow:

Device Plugin registers with kubelet's Device Plugin gRPC interface
Reports the number of GPUs on the node to kubelet
Kubernetes Scheduler schedules Pods requesting nvidia.com/gpu to the appropriate node
Device Plugin passes GPU device info to kubelet when a Pod runs

GPU Request Example in Pod:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-test
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-test
      image: nvcr.io/nvidia/cuda:12.6.0-base-ubuntu22.04
      command: ['nvidia-smi']
      resources:
        limits:
          nvidia.com/gpu: 1 # Request 1 GPU

Resource Types Managed by Device Plugin:

Resource Name	Description	Use Case
`nvidia.com/gpu`	Full GPU (or Time-Slicing shared)	Default GPU workloads
`nvidia.com/mig-1g.5gb`	MIG 1g.5gb instance	A100 MIG partition
`nvidia.com/mig-2g.10gb`	MIG 2g.10gb instance	A100 MIG partition
`nvidia.com/mig-3g.20gb`	MIG 3g.20gb instance	A100 MIG partition
`nvidia.com/mig-7g.40gb`	MIG 7g.40gb full instance	A100 full MIG
`nvidia.com/gpu.shared`	Time-Slicing shared GPU	When renameByDefault is enabled

2.4 NVIDIA DCGM & DCGM Exporter

DCGM (Data Center GPU Manager) is a tool for monitoring NVIDIA GPU health and performance, while DCGM Exporter exposes these metrics in Prometheus format.

Key Collected Metrics:

Metric Name	Description	Unit
`DCGM_FI_DEV_GPU_UTIL`	GPU core utilization	%
`DCGM_FI_DEV_MEM_COPY_UTIL`	Memory bandwidth utilization	%
`DCGM_FI_DEV_GPU_TEMP`	GPU temperature	C
`DCGM_FI_DEV_POWER_USAGE`	Power consumption	W
`DCGM_FI_DEV_FB_FREE`	Available framebuffer memory	MiB
`DCGM_FI_DEV_FB_USED`	Used framebuffer memory	MiB
`DCGM_FI_DEV_XID_ERRORS`	XID error count	count
`DCGM_FI_DEV_ECC_SBE_VOL_TOTAL`	Single-bit ECC errors (volatile)	count
`DCGM_FI_DEV_ECC_DBE_VOL_TOTAL`	Double-bit ECC errors (persistent)	count

2.5 NVIDIA MIG Manager

MIG (Multi-Instance GPU) Manager manages the partitioning of a single physical GPU into multiple independent GPU instances on NVIDIA Ampere and newer architectures (A100, A30, H100, etc.).

Unlike Time-Slicing, MIG provides hardware-level memory isolation and compute isolation. Each MIG instance has its own memory, cache, and SM (Streaming Multiprocessor).

A100 40GB MIG Profiles:

Profile	GPU Slice	Memory	Max Instances
`1g.5gb`	1/7	5 GB	7
`2g.10gb`	2/7	10 GB	3
`3g.20gb`	3/7	20 GB	2
`7g.40gb`	7/7	40 GB	1

2.6 Node Feature Discovery (NFD)

Node Feature Discovery (NFD) automatically detects hardware characteristics of Kubernetes nodes and registers them as node labels. As a dependency of GPU Operator, it plays a key role in identifying GPU nodes.

GPU Operator uses the feature.node.kubernetes.io/pci-10de.present=true label to automatically identify nodes with NVIDIA GPUs and deploy GPU-related DaemonSets only to those nodes.

Note: If NFD is already installed in the cluster, set nfd.enabled=false during GPU Operator installation to prevent duplicate deployment.

2.7 GPU Feature Discovery (GFD)

GPU Feature Discovery (GFD) registers GPU-specific detailed information as node labels, similar to NFD but focused on GPU details. GFD communicates with the NVIDIA Driver to collect GPU model, driver version, CUDA version, and other information.

Component Deployment Order (Dependency Chain):

NFD (Node Feature Discovery)
 └──▶ GPU Feature Discovery (GFD)
       └──▶ NVIDIA Driver
             └──▶ NVIDIA Container Toolkit
                   └──▶ NVIDIA Device Plugin
                         └──▶ DCGM / DCGM Exporter
                               └──▶ MIG Manager (MIG-capable GPUs only)
                                     └──▶ Validator (Verification)

Component Role Summary Table:

Component	DaemonSet Name	Role	Namespace
NFD	`node-feature-discovery-worker`	Node hardware labeling	`gpu-operator`
GFD	`gpu-feature-discovery`	GPU detail labeling	`gpu-operator`
Driver	`nvidia-driver-daemonset`	GPU kernel module install	`gpu-operator`
Toolkit	`nvidia-container-toolkit-daemonset`	Container Runtime GPU integration	`gpu-operator`
Device Plugin	`nvidia-device-plugin-daemonset`	GPU resource kubelet registration	`gpu-operator`
DCGM	`nvidia-dcgm`	GPU metrics collection daemon	`gpu-operator`
DCGM Exporter	`nvidia-dcgm-exporter`	Prometheus metrics exposure	`gpu-operator`
MIG Manager	`nvidia-mig-manager`	MIG partition management	`gpu-operator`
Validator	`nvidia-operator-validator`	Full stack verification	`gpu-operator`

3. Installation Guide (Helm-based)

3.1 Prerequisites

3.1.1 Hardware Requirements

Worker nodes with NVIDIA GPUs (Turing architecture or newer recommended)
PCIe 3.0+ slots
IOMMU support (when using KubeVirt GPU Passthrough)

3.1.2 Software Requirements

# Verify kubectl and helm installation
kubectl version --client
helm version

# Verify Kubernetes cluster access
kubectl get nodes

# Check Container Runtime (containerd or CRI-O)
kubectl get nodes -o wide

3.1.3 Disable Nouveau Driver

The GPU Operator Driver DaemonSet requires the open-source Nouveau driver to be disabled to load the NVIDIA driver.

# Check if nouveau module is loaded
lsmod | grep nouveau

# Disable nouveau (if needed)
cat <<EOF | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
EOF

# Regenerate initramfs
sudo update-initramfs -u

# Reboot
sudo reboot

3.2 Helm Repo Addition and Basic Installation

# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Check available versions
helm search repo nvidia/gpu-operator --versions

# Basic installation (latest version)
helm install gpu-operator \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --version=v25.10.1 \
  --wait

3.3 Verification

# nvidia-smi execution test
kubectl run nvidia-smi \
  --restart=Never \
  --image=nvcr.io/nvidia/cuda:12.6.0-base-ubuntu22.04 \
  --overrides='{"spec":{"restartPolicy":"Never","containers":[{"name":"nvidia-smi","image":"nvcr.io/nvidia/cuda:12.6.0-base-ubuntu22.04","command":["nvidia-smi"],"resources":{"limits":{"nvidia.com/gpu":"1"}}}]}}' \
  && kubectl logs nvidia-smi

# Cleanup
kubectl delete pod nvidia-smi

4.1 Time-Slicing

Time-Slicing shares a single physical GPU among multiple containers by time-based division. Supported on all NVIDIA GPUs.

Characteristics:

No hardware-level isolation (no memory isolation)
Risk of GPU memory OOM
All NVIDIA GPUs supported (Turing+)
Simple and flexible configuration

4.2 MIG (Multi-Instance GPU)

MIG is a hardware-level GPU partitioning technology supported on NVIDIA Ampere and newer architectures (A100, A30, H100, H200).

4.3 vGPU (Virtual GPU)

NVIDIA vGPU virtualizes GPUs for sharing across multiple VMs. Primarily used with KubeVirt VMs and requires a separate NVIDIA license.

Property	Time-Slicing	MIG	vGPU
Isolation Level	None (time-sharing only)	Hardware (memory+compute)	Hardware (virtualization)
Memory Isolation	X	O	O
Fault Isolation	X	O	O
Supported GPUs	All NVIDIA GPUs	A100, A30, H100, H200	vGPU-capable GPUs
Max Partitions	Unlimited by config	Varies by GPU (max 7)	Varies by GPU
Performance Overhead	Context Switching	Minimal	Virtualization overhead
License	Free	Free	Paid (NVIDIA AI Enterprise)
Use Cases	Dev/test, light inference	Production inference, multi-tenant	VM-based workloads

5. KubeVirt and GPU Operator Integration

5.1 KubeVirt Introduction

KubeVirt is a project that enables managing virtual machines (VMs) on Kubernetes. By integrating KubeVirt with GPU Operator, you can assign physical GPUs directly to VMs (Passthrough) or allocate virtual GPUs (vGPU).

Important: A single GPU worker node can only run one type of GPU workload. You cannot mix container, vm-passthrough, and vm-vgpu on the same node.

5.2 GPU Passthrough (PCI Passthrough)

GPU Passthrough assigns the entire physical GPU directly to a KubeVirt VM. The VM can use the GPU at native performance, but GPU sharing is not possible.

Key Steps:

Enable IOMMU/VT-d in BIOS
Add kernel parameters (intel_iommu=on iommu=pt)
Load VFIO driver module
Install GPU Operator with sandboxWorkloads.enabled=true
Label GPU node with nvidia.com/gpu.workload.config=vm-passthrough
Register GPU device in KubeVirt CR
Assign GPU in VirtualMachine spec

5.3 vGPU with KubeVirt

vGPU splits a single physical GPU into multiple virtual GPUs that can be shared across multiple VMs. Unlike GPU Passthrough, a single GPU can be used by multiple VMs simultaneously.

Prerequisites:

NVIDIA AI Enterprise license or vGPU license
GPU supporting vGPU (A100, A10, A30, A16, L40, L4, T4, etc.)

5.4 Live Migration with GPU Constraints

Constraint	GPU Passthrough	vGPU
Live Migration	Not supported	Limited support (NVIDIA vGPU Migration)
Reason	PCIe direct assignment cannot be migrated	vGPU state transfer needed
Alternative	Cold Migration (stop VM then migrate)	NVIDIA vGPU Migration license required

6. Monitoring and Observability

6.1 DCGM Exporter -> Prometheus -> Grafana Pipeline

GPU Operator's DCGM Exporter collects metrics from each GPU node and exposes them in Prometheus format on port 9400. Prometheus scrapes these metrics, and Grafana visualizes them using Dashboard ID 12239 (NVIDIA DCGM Exporter Dashboard).

6.2 Alert Rule Examples

Key alerts include GPU temperature warnings (over 85C), GPU temperature critical (over 90C), GPU memory nearly full (over 95% usage), double-bit ECC errors, prolonged high GPU utilization (over 95% for 30+ min), and XID errors.

7. Production Deployment Best Practices

7.1 Node Labeling Strategy

In production, GPU nodes and CPU nodes should be clearly distinguished.

7.2 Taints & Tolerations

Set Taints on GPU nodes so only GPU workloads are scheduled.

7.3 ResourceQuota for GPU Allocation Limits

Limit GPU usage per namespace.

7.4 PriorityClass Configuration

Configure PriorityClass to preempt lower-priority workloads when GPU resources are scarce.

7.5 Driver Upgrade Strategy

GPU Operator supports Rolling Updates for driver upgrades. Set ENABLE_AUTO_DRAIN=true for automatic node drain before driver upgrade, and maxUnavailable: 1 to upgrade one node at a time in production.

8. Troubleshooting Guide

8.1 GPU not detected (Driver Loading Failure)

Symptoms: nvidia-driver-daemonset Pod in CrashLoopBackOff or Error state

Check driver Pod logs, verify Nouveau module conflicts, check for missing kernel headers, check dmesg for GPU-related errors, and verify GPU hardware recognition.

8.2 nvidia-smi not found in container

Check Container Toolkit Pod status, Device Plugin Pod status, node GPU resources, and CDI configuration.

8.3 CUDA version mismatch

Verify the driver's CUDA support version is greater than or equal to the application's CUDA version.

8.4 MIG partition failure

Check MIG Manager logs, verify MIG status on the node, check for running GPU processes, and retry MIG configuration.

8.5 KubeVirt VM GPU not recognized

Verify node workload label, VFIO/vGPU Manager status, Sandbox Device Plugin status, GPU resources on the node, KubeVirt CR permittedHostDevices, and IOMMU group configuration.

9. References

Official Documentation

Resource	URL
NVIDIA GPU Operator Official Docs	https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/
GPU Operator Installation Guide	https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html
GPU Operator with KubeVirt	https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-kubevirt.html
GPU Time-Slicing Guide	https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html
GPU Operator MIG Guide	https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html
GPU Operator Troubleshooting	https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/troubleshooting.html
NVIDIA DCGM Exporter	https://github.com/NVIDIA/dcgm-exporter
NVIDIA MIG User Guide	https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html

Resource	URL
KubeVirt Official Docs	https://kubevirt.io/user-guide/
KubeVirt Host Devices	https://kubevirt.io/user-guide/compute/host-devices/

Additional Resources

Resource	URL
GPU Operator GitHub	https://github.com/NVIDIA/gpu-operator
DCGM Exporter Grafana Dashboard	https://grafana.com/grafana/dashboards/12239