- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- GPU Operator and KubeVirt Integration
- GPU Passthrough Workflow
- vGPU Workflow
- KubeVirt CR Configuration
- Attaching GPUs in VMI Spec
- Guest OS Driver Installation
- Full Architecture Diagram
- Performance Considerations
- Use Cases
- Troubleshooting
- Conclusion
Introduction
With KubeVirt running VMs on Kubernetes and GPU Operator automating GPU management, we can now combine both technologies to leverage GPU acceleration inside VMs.
GPU Operator and KubeVirt Integration
GPU Operator supports two modes for providing GPUs to KubeVirt VMs.
Node Labeling: gpu.workload.config
The nvidia.com/gpu.workload.config label value determines the GPU usage mode.
| Label Value | Description | Target |
|---|---|---|
| container | Default. Standard container GPU | Pod |
| vm-passthrough | Full GPU passthrough to VM | KubeVirt VM |
| vm-vgpu | vGPU instances for VM | KubeVirt VM |
# Set node to VM passthrough mode
kubectl label node worker-gpu-01 \
nvidia.com/gpu.workload.config=vm-passthrough --overwrite
# Set node to vGPU mode
kubectl label node worker-gpu-01 \
nvidia.com/gpu.workload.config=vm-vgpu --overwrite
# Restore to default container mode
kubectl label node worker-gpu-01 \
nvidia.com/gpu.workload.config=container --overwrite
Mode Change Workflow
Node label change
|
v
GPU Operator detects change
|
v
Clean up existing GPU software stack
|
v
Deploy components for new mode
|
+-- container: Driver + Toolkit + Device Plugin
|
+-- vm-passthrough: VFIO Manager + Sandbox Device Plugin
|
+-- vm-vgpu: vGPU Manager + Sandbox Device Plugin
GPU Passthrough Workflow
GPU passthrough assigns an entire physical GPU directly to a VM, providing near-native performance.
Step 1: Enable IOMMU
IOMMU must be enabled in both BIOS and kernel.
# For Intel CPUs, add kernel parameter
# GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt"
# For AMD CPUs
# GRUB_CMDLINE_LINUX="amd_iommu=on iommu=pt"
# Update GRUB and reboot
sudo update-grub
sudo reboot
# Verify IOMMU is enabled
dmesg | grep -i iommu
# Output: DMAR: IOMMU enabled
Step 2: VFIO Manager DaemonSet
GPU Operator automatically deploys the VFIO Manager.
+------------------------------------------+
| Worker Node |
| +------------------------------------+ |
| | VFIO Manager (DaemonSet) | |
| | | |
| | 1. Load vfio-pci kernel module | |
| | 2. Unbind GPU from host driver | |
| | 3. Bind GPU to vfio-pci | |
| | | |
| +------------------------------------+ |
| |
| GPU state: |
| nvidia driver --> vfio-pci driver |
+------------------------------------------+
The VFIO Manager performs the following operations.
- Loads the
vfio-pcikernel module - Unbinds GPU devices from the NVIDIA driver
- Binds GPUs to the
vfio-pcidriver - Enables direct VM access to the GPU
Step 3: Sandbox Device Plugin
The Sandbox Device Plugin discovers passthrough-capable GPUs and registers them with kubelet.
+------------------------------------------+
| Sandbox Device Plugin (DaemonSet) |
| |
| 1. Discover GPU devices bound to VFIO |
| 2. Register as nvidia.com/gpu resource |
| 3. KubeVirt VMs can request GPUs |
+------------------------------------------+
vGPU Workflow
vGPU splits a single physical GPU across multiple VMs.
vGPU Manager Driver Deployment
+------------------------------------------+
| Worker Node |
| +------------------------------------+ |
| | vGPU Manager (DaemonSet) | |
| | | |
| | - Install vGPU host driver | |
| | - Create Mediated Devices or | |
| | SR-IOV VFs | |
| +------------------------------------+ |
| |
| +------------------------------------+ |
| | Sandbox Device Plugin | |
| | - Discover vGPU devices | |
| | - Register as nvidia.com/VGPU_TYPE| |
| +------------------------------------+ |
+------------------------------------------+
vGPU Implementation by GPU Architecture
| GPU Architecture | vGPU Implementation |
|---|---|
| Pre-Ampere (V100, etc.) | Mediated Devices (MDEV) |
| Ampere (A100) | MIG-backed vGPU or MDEV |
| Hopper and later (H100) | SR-IOV Virtual Functions |
vGPU Type Examples
Available vGPU profiles on A100 are as follows.
| vGPU Type | Framebuffer | Max Instances |
|---|---|---|
| A100-1-5C | 5GB | 7 |
| A100-2-10C | 10GB | 3 |
| A100-3-20C | 20GB | 2 |
| A100-4-40C | 40GB | 1 |
| A100-1-5CME | 5GB (MIG) | 7 |
KubeVirt CR Configuration
permittedDevices Setup
You must allow GPU devices in the KubeVirt CR for VM usage.
apiVersion: kubevirt.io/v1
kind: KubeVirt
metadata:
name: kubevirt
namespace: kubevirt
spec:
configuration:
permittedHostDevices:
pciHostDevices:
- pciVendorSelector: '10DE:20B5'
resourceName: 'nvidia.com/gpu'
externalResourceProvider: true
mediatedDevices:
- mdevNameSelector: 'NVIDIA A100-1-5C'
resourceName: 'nvidia.com/NVIDIA_A100-1-5C'
The pciVendorSelector is the PCI vendor/device ID of the GPU. You can check it with the following command.
# Check GPU PCI ID
lspci -nn | grep NVIDIA
# Output: 3b:00.0 3D controller [0302]: NVIDIA Corporation A100 [10DE:20B5]
Attaching GPUs in VMI Spec
GPU Passthrough VMI
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: gpu-passthrough-vm
spec:
running: true
template:
metadata:
labels:
app: gpu-vm
spec:
domain:
cpu:
cores: 8
memory:
guest: 32Gi
devices:
disks:
- name: rootdisk
disk:
bus: virtio
gpus:
- name: gpu1
deviceName: nvidia.com/gpu
interfaces:
- name: default
masquerade: {}
networks:
- name: default
pod: {}
volumes:
- name: rootdisk
dataVolume:
name: ubuntu-gpu-dv
vGPU VMI
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: vgpu-vm
spec:
running: true
template:
spec:
domain:
cpu:
cores: 4
memory:
guest: 16Gi
devices:
disks:
- name: rootdisk
disk:
bus: virtio
gpus:
- name: vgpu1
deviceName: nvidia.com/NVIDIA_A100-1-5C
interfaces:
- name: default
masquerade: {}
networks:
- name: default
pod: {}
volumes:
- name: rootdisk
dataVolume:
name: ubuntu-vgpu-dv
Guest OS Driver Installation
GPU Operator does NOT install drivers inside VMs. You must install them directly in the guest OS.
Linux Guest
# Connect to VM console
virtctl console gpu-passthrough-vm
# Install NVIDIA driver (Ubuntu)
sudo apt-get update
sudo apt-get install -y linux-headers-$(uname -r)
# Add NVIDIA driver repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
# Install driver
sudo apt-get install -y nvidia-driver-550
# Reboot and verify
sudo reboot
nvidia-smi
Windows Guest
1. Download driver from NVIDIA official site
- Passthrough: standard GPU driver
- vGPU: NVIDIA vGPU guest driver
2. Connect to Windows VM via RDP or VNC
3. Run driver installer
4. Reboot and verify GPU in Device Manager
Full Architecture Diagram
+------------------------------------------------------------------+
| Kubernetes Cluster |
| |
| +---------------------------+ +------------------------------+ |
| | Container Node | | GPU Node (vm-passthrough) | |
| | gpu.workload.config: | | gpu.workload.config: | |
| | container | | vm-passthrough | |
| | | | | |
| | +-------+ +-------+ | | +-------------------------+ | |
| | | Pod | | Pod | | | | VFIO Manager | | |
| | | GPU:1 | | GPU:1 | | | | (vfio-pci bind) | | |
| | +-------+ +-------+ | | +-------------------------+ | |
| | | | | |
| | NVIDIA Driver | | +-------------------------+ | |
| | Container Toolkit | | | Sandbox Device Plugin | | |
| | Device Plugin | | +-------------------------+ | |
| +---------------------------+ | | |
| | +--------+ +--------+ | |
| | | VM 1 | | VM 2 | | |
| | | GPU:1 | | GPU:1 | | |
| | +--------+ +--------+ | |
| +------------------------------+ |
+------------------------------------------------------------------+
Performance Considerations
Passthrough vs vGPU Performance Comparison
| Factor | Passthrough | vGPU |
|---|---|---|
| GPU Utilization | 100% (dedicated) | Shared (by profile) |
| Performance Overhead | 1-3% | 5-10% |
| GPU Memory | Full access | Limited by profile |
| VM Density | 1 GPU = 1 VM | 1 GPU = multiple VMs |
| Live Migration | Limited | Supported |
| License | No additional | vGPU license required |
Benchmark Reference Values
The following shows typical performance comparisons on an A100 GPU.
| Workload | Bare Metal | Passthrough | vGPU (Full Profile) |
|---|---|---|---|
| ResNet-50 Training | 100% | 97-99% | 90-95% |
| BERT Inference | 100% | 98-99% | 92-96% |
| CUDA Benchmark | 100% | 97-99% | 91-95% |
Use Cases
1. VDI (Virtual Desktop Infrastructure) on K8s
+--------------------------------------------------+
| Kubernetes Cluster |
| |
| +--------+ +--------+ +--------+ +--------+ |
| | Win VM | | Win VM | | Win VM | | Win VM | |
| | vGPU | | vGPU | | vGPU | | vGPU | |
| +--------+ +--------+ +--------+ +--------+ |
| | | | | |
| +----------+----------+----------+ |
| | |
| +------+------+ |
| | Physical GPU| |
| | (A100/L40) | |
| +-------------+ |
+--------------------------------------------------+
2. Legacy Windows ML Workloads
Run Windows-based ML applications that cannot be containerized in KubeVirt VMs with GPU acceleration.
3. Multi-tenant GPU Isolation
Use vGPU to isolate GPU memory and compute resources between tenants at the hardware level.
4. VMware Migration
Migrate GPU VMs from existing VMware environments to KubeVirt to reduce licensing costs.
Troubleshooting
Common Issues and Solutions
| Problem | Cause | Solution |
|---|---|---|
| GPU not visible in VM | permittedDevices not set | Add device to KubeVirt CR |
| VFIO bind failure | IOMMU not enabled | Check BIOS/kernel IOMMU settings |
| vGPU creation failure | vGPU Manager not installed | Check GPU Operator vGPU config |
| nvidia-smi fails in guest | Guest driver not installed | Install driver inside VM |
# Check GPU device status
kubectl get node worker-gpu-01 -o json | jq '.status.allocatable'
# Check VFIO bind status
kubectl logs -n gpu-operator -l app=nvidia-vfio-manager
# Check Sandbox Device Plugin logs
kubectl logs -n gpu-operator -l app=nvidia-sandbox-device-plugin
Conclusion
Combining KubeVirt with GPU Operator provides a powerful way to consolidate VM-based GPU workloads on Kubernetes. Passthrough is ideal when maximum performance is required, while vGPU is suited for GPU sharing and efficiency.
In the next post, we will comprehensively compare QEMU, VirtualBox, VMware, and KubeVirt to analyze each platform's strengths and weaknesses.
Quiz: KubeVirt GPU Knowledge Check
Q1. What node label value enables GPU passthrough mode?
A) container B) vm-passthrough C) gpu-direct D) vfio-passthrough
Answer: B) Set nvidia.com/gpu.workload.config=vm-passthrough label.
Q2. What is the role of VFIO Manager during GPU passthrough?
A) Install GPU drivers inside the VM B) Bind GPUs to the vfio-pci driver C) Create vGPU instances D) Collect GPU metrics
Answer: B) VFIO Manager unbinds GPUs from the NVIDIA driver and binds them to vfio-pci, enabling direct VM access.
Q3. What does GPU Operator NOT do for VMs?
A) Deploy VFIO Manager B) Deploy Sandbox Device Plugin C) Install GPU drivers inside guest OS D) Automate node labeling
Answer: C) GPU Operator only manages host-level software. Guest OS GPU drivers must be installed manually by the user.
Q4. Which is NOT an advantage of vGPU?
A) Share one GPU across multiple VMs B) Live migration support C) No additional license cost D) Hardware-level GPU isolation
Answer: C) vGPU requires an NVIDIA vGPU license. Passthrough does not require additional licensing.