Split View: [가상화] 08. KubeVirt + GPU: VM에서 GPU 가속 활용하기

[가상화] 08. KubeVirt + GPU: VM에서 GPU 가속 활용하기

들어가며
GPU Operator와 KubeVirt 통합
- 노드 레이블링: gpu.workload.config
- 모드 변경 시 동작 흐름
GPU 패스스루 워크플로우
vGPU 워크플로우
KubeVirt CR 설정
- permittedDevices 구성
VMI 스펙에 GPU 연결
- GPU 패스스루 VMI
- vGPU VMI
게스트 OS 드라이버 설치
- Linux 게스트
- Windows 게스트
전체 아키텍처 다이어그램
성능 고려사항
- 패스스루 vs vGPU 성능 비교
- 벤치마크 참고값
유스케이스
트러블슈팅
- 일반적인 문제와 해결 방법
마치며

들어가며

KubeVirt로 VM을 쿠버네티스에서 실행할 수 있게 되었고, GPU Operator로 GPU 관리를 자동화했습니다. 이제 두 기술을 결합하여 VM에서 GPU 가속을 활용하는 방법을 알아보겠습니다.

GPU Operator와 KubeVirt 통합

GPU Operator는 KubeVirt VM에 GPU를 제공하기 위한 두 가지 모드를 지원합니다.

노드 레이블링: gpu.workload.config

nvidia.com/gpu.workload.config 레이블 값에 따라 GPU의 사용 모드가 결정됩니다.

레이블 값	설명	대상
container	기본값. 표준 컨테이너 GPU	Pod
vm-passthrough	GPU 전체를 VM에 패스스루	KubeVirt VM
vm-vgpu	vGPU 인스턴스를 VM에 할당	KubeVirt VM

# 노드를 VM 패스스루 모드로 설정
kubectl label node worker-gpu-01 \
  nvidia.com/gpu.workload.config=vm-passthrough --overwrite

# 노드를 vGPU 모드로 설정
kubectl label node worker-gpu-01 \
  nvidia.com/gpu.workload.config=vm-vgpu --overwrite

# 기본 컨테이너 모드로 복원
kubectl label node worker-gpu-01 \
  nvidia.com/gpu.workload.config=container --overwrite

모드 변경 시 동작 흐름

노드 레이블 변경
      |
      v
GPU Operator 감지
      |
      v
기존 GPU 소프트웨어 스택 정리
      |
      v
새 모드에 맞는 컴포넌트 배포
      |
      +-- container: Driver + Toolkit + Device Plugin
      |
      +-- vm-passthrough: VFIO Manager + Sandbox Device Plugin
      |
      +-- vm-vgpu: vGPU Manager + Sandbox Device Plugin

GPU 패스스루 워크플로우

GPU 패스스루는 물리 GPU 전체를 VM에 직접 할당하는 방식입니다. 네이티브에 가까운 성능을 제공합니다.

1단계: IOMMU 활성화

BIOS와 커널 모두에서 IOMMU를 활성화해야 합니다.

# Intel CPU의 경우 커널 파라미터 추가
# GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt"

# AMD CPU의 경우
# GRUB_CMDLINE_LINUX="amd_iommu=on iommu=pt"

# GRUB 업데이트 후 재부팅
sudo update-grub
sudo reboot

# IOMMU 활성화 확인
dmesg | grep -i iommu
# 출력: DMAR: IOMMU enabled

2단계: VFIO Manager DaemonSet

GPU Operator가 VFIO Manager를 자동으로 배포합니다.

+------------------------------------------+
|            Worker Node                    |
|  +------------------------------------+  |
|  | VFIO Manager (DaemonSet)           |  |
|  |                                    |  |
|  |  1. vfio-pci 커널 모듈 로드         |  |
|  |  2. GPU를 호스트 드라이버에서 언바인드 |  |
|  |  3. GPU를 vfio-pci에 바인드          |  |
|  |                                    |  |
|  +------------------------------------+  |
|                                          |
|  GPU 상태:                                |
|  nvidia 드라이버 --> vfio-pci 드라이버     |
+------------------------------------------+

VFIO Manager의 동작 과정은 다음과 같습니다.

vfio-pci 커널 모듈 로드
GPU 디바이스를 NVIDIA 드라이버에서 분리(unbind)
GPU를 vfio-pci 드라이버에 바인드(bind)
VM이 GPU에 직접 접근 가능하게 됨

3단계: Sandbox Device Plugin

Sandbox Device Plugin이 패스스루 가능한 GPU를 검색하여 kubelet에 등록합니다.

+------------------------------------------+
| Sandbox Device Plugin (DaemonSet)        |
|                                          |
| 1. VFIO에 바인드된 GPU 디바이스 검색       |
| 2. nvidia.com/gpu 리소스로 kubelet에 등록 |
| 3. KubeVirt VM이 GPU를 요청 가능          |
+------------------------------------------+

vGPU 워크플로우

vGPU는 하나의 물리 GPU를 여러 VM에 분할하여 제공하는 방식입니다.

vGPU Manager 드라이버 배포

+------------------------------------------+
|            Worker Node                    |
|  +------------------------------------+  |
|  | vGPU Manager (DaemonSet)           |  |
|  |                                    |  |
|  |  - vGPU 호스트 드라이버 설치         |  |
|  |  - Mediated Device 또는 SR-IOV VF   |  |
|  |    생성                             |  |
|  +------------------------------------+  |
|                                          |
|  +------------------------------------+  |
|  | Sandbox Device Plugin              |  |
|  |  - vGPU 디바이스 검색               |  |
|  |  - nvidia.com/VGPU_TYPE으로 등록    |  |
|  +------------------------------------+  |
+------------------------------------------+

GPU 아키텍처별 vGPU 구현

GPU 아키텍처	vGPU 구현 방식
Ampere 이전 (V100 등)	Mediated Devices (MDEV)
Ampere (A100)	MIG-backed vGPU 또는 MDEV
Hopper 이후 (H100)	SR-IOV Virtual Functions

vGPU 타입 예시

A100에서 사용 가능한 vGPU 프로파일은 다음과 같습니다.

vGPU 타입	프레임버퍼	최대 인스턴스
A100-1-5C	5GB	7
A100-2-10C	10GB	3
A100-3-20C	20GB	2
A100-4-40C	40GB	1
A100-1-5CME	5GB (MIG)	7

KubeVirt CR 설정

permittedDevices 구성

KubeVirt CR에서 GPU 디바이스를 VM에서 사용할 수 있도록 허용해야 합니다.

apiVersion: kubevirt.io/v1
kind: KubeVirt
metadata:
  name: kubevirt
  namespace: kubevirt
spec:
  configuration:
    permittedHostDevices:
      pciHostDevices:
        - pciVendorSelector: '10DE:20B5'
          resourceName: 'nvidia.com/gpu'
          externalResourceProvider: true
      mediatedDevices:
        - mdevNameSelector: 'NVIDIA A100-1-5C'
          resourceName: 'nvidia.com/NVIDIA_A100-1-5C'

pciVendorSelector는 GPU의 PCI 벤더/디바이스 ID입니다. 다음 명령으로 확인할 수 있습니다.

# GPU PCI ID 확인
lspci -nn | grep NVIDIA
# 출력: 3b:00.0 3D controller [0302]: NVIDIA Corporation A100 [10DE:20B5]

VMI 스펙에 GPU 연결

GPU 패스스루 VMI

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: gpu-passthrough-vm
spec:
  running: true
  template:
    metadata:
      labels:
        app: gpu-vm
    spec:
      domain:
        cpu:
          cores: 8
        memory:
          guest: 32Gi
        devices:
          disks:
            - name: rootdisk
              disk:
                bus: virtio
          gpus:
            - name: gpu1
              deviceName: nvidia.com/gpu
          interfaces:
            - name: default
              masquerade: {}
      networks:
        - name: default
          pod: {}
      volumes:
        - name: rootdisk
          dataVolume:
            name: ubuntu-gpu-dv

vGPU VMI

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: vgpu-vm
spec:
  running: true
  template:
    spec:
      domain:
        cpu:
          cores: 4
        memory:
          guest: 16Gi
        devices:
          disks:
            - name: rootdisk
              disk:
                bus: virtio
          gpus:
            - name: vgpu1
              deviceName: nvidia.com/NVIDIA_A100-1-5C
          interfaces:
            - name: default
              masquerade: {}
      networks:
        - name: default
          pod: {}
      volumes:
        - name: rootdisk
          dataVolume:
            name: ubuntu-vgpu-dv

게스트 OS 드라이버 설치

GPU Operator는 VM 내부에 드라이버를 설치하지 않습니다. 게스트 OS에서 직접 설치해야 합니다.

Linux 게스트

# VM 콘솔에 접속
virtctl console gpu-passthrough-vm

# NVIDIA 드라이버 설치 (Ubuntu)
sudo apt-get update
sudo apt-get install -y linux-headers-$(uname -r)

# NVIDIA 드라이버 리포지토리 추가
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

# 드라이버 설치
sudo apt-get install -y nvidia-driver-550

# 재부팅 후 확인
sudo reboot
nvidia-smi

Windows 게스트

1. NVIDIA 공식 사이트에서 드라이버 다운로드
   - 패스스루: 일반 GPU 드라이버
   - vGPU: NVIDIA vGPU 게스트 드라이버

2. RDP 또는 VNC로 Windows VM에 접속

3. 드라이버 설치 프로그램 실행

4. 재부팅 후 Device Manager에서 GPU 확인

전체 아키텍처 다이어그램

+------------------------------------------------------------------+
|                    Kubernetes Cluster                              |
|                                                                    |
|  +---------------------------+  +------------------------------+  |
|  | Container Node            |  | GPU Node (vm-passthrough)    |  |
|  | gpu.workload.config:      |  | gpu.workload.config:         |  |
|  |   container               |  |   vm-passthrough             |  |
|  |                           |  |                              |  |
|  | +-------+ +-------+      |  | +-------------------------+  |  |
|  | | Pod   | | Pod   |      |  | | VFIO Manager            |  |  |
|  | | GPU:1 | | GPU:1 |      |  | | (vfio-pci bind)         |  |  |
|  | +-------+ +-------+      |  | +-------------------------+  |  |
|  |                           |  |                              |  |
|  | NVIDIA Driver             |  | +-------------------------+  |  |
|  | Container Toolkit         |  | | Sandbox Device Plugin   |  |  |
|  | Device Plugin             |  | +-------------------------+  |  |
|  +---------------------------+  |                              |  |
|                                  | +--------+  +--------+     |  |
|                                  | |  VM 1  |  |  VM 2  |     |  |
|                                  | | GPU:1  |  | GPU:1  |     |  |
|                                  | +--------+  +--------+     |  |
|                                  +------------------------------+  |
+------------------------------------------------------------------+

성능 고려사항

패스스루 vs vGPU 성능 비교

항목	패스스루	vGPU
GPU 활용률	100% (전용)	공유 (프로파일에 따라)
성능 오버헤드	1-3%	5-10%
GPU 메모리	전체 사용 가능	프로파일에 따라 제한
VM 밀도	1 GPU = 1 VM	1 GPU = 여러 VM
라이브 마이그레이션	제한적	지원
라이선스	추가 없음	vGPU 라이선스 필요

벤치마크 참고값

다음은 A100 GPU에서의 일반적인 성능 비교입니다.

워크로드	베어메탈	패스스루	vGPU (전체 프로파일)
ResNet-50 학습	100%	97-99%	90-95%
BERT 추론	100%	98-99%	92-96%
CUDA 벤치마크	100%	97-99%	91-95%

유스케이스

1. VDI (Virtual Desktop Infrastructure) on K8s

+--------------------------------------------------+
| Kubernetes Cluster                                |
|                                                    |
| +--------+ +--------+ +--------+ +--------+      |
| | Win VM | | Win VM | | Win VM | | Win VM |      |
| | vGPU   | | vGPU   | | vGPU   | | vGPU   |      |
| +--------+ +--------+ +--------+ +--------+      |
|      |          |          |          |            |
|      +----------+----------+----------+            |
|                     |                              |
|              +------+------+                       |
|              | Physical GPU|                       |
|              | (A100/L40)  |                       |
|              +-------------+                       |
+--------------------------------------------------+

2. 레거시 Windows ML 워크로드

컨테이너화할 수 없는 Windows 기반 ML 애플리케이션을 KubeVirt VM에서 GPU와 함께 실행합니다.

3. 멀티테넌트 GPU 격리

vGPU를 사용하여 테넌트 간 GPU 메모리와 컴퓨팅 리소스를 하드웨어 수준에서 격리합니다.

4. VMware 마이그레이션

기존 VMware 환경의 GPU VM을 KubeVirt로 마이그레이션하여 라이선스 비용을 절감합니다.

트러블슈팅

일반적인 문제와 해결 방법

문제	원인	해결 방법
GPU가 VM에 보이지 않음	permittedDevices 미설정	KubeVirt CR에 디바이스 추가
VFIO 바인드 실패	IOMMU 미활성화	BIOS/커널 IOMMU 설정 확인
vGPU 생성 실패	vGPU Manager 미설치	GPU Operator vGPU 설정 확인
게스트에서 nvidia-smi 실패	게스트 드라이버 미설치	VM 내부에 드라이버 설치

# GPU 디바이스 상태 확인
kubectl get node worker-gpu-01 -o json | jq '.status.allocatable'

# VFIO 바인드 상태 확인
kubectl logs -n gpu-operator -l app=nvidia-vfio-manager

# Sandbox Device Plugin 로그 확인
kubectl logs -n gpu-operator -l app=nvidia-sandbox-device-plugin

마치며

KubeVirt와 GPU Operator의 결합은 VM 기반 GPU 워크로드를 쿠버네티스로 통합하는 강력한 방법입니다. 패스스루는 최대 성능이 필요한 경우에, vGPU는 GPU 공유와 효율성이 중요한 경우에 적합합니다.

다음 글에서는 QEMU, VirtualBox, VMware, KubeVirt를 종합적으로 비교하여 각 플랫폼의 장단점을 분석하겠습니다.

퀴즈: KubeVirt GPU 이해도 점검

Q1. GPU 패스스루 모드를 활성화하기 위한 노드 레이블 값은?

A) container B) vm-passthrough C) gpu-direct D) vfio-passthrough

정답: B) nvidia.com/gpu.workload.config=vm-passthrough 레이블을 설정합니다.

Q2. GPU 패스스루 시 VFIO Manager의 역할은?

A) VM 내부에 GPU 드라이버 설치 B) GPU를 vfio-pci 드라이버에 바인드 C) vGPU 인스턴스 생성 D) GPU 메트릭 수집

정답: B) VFIO Manager는 GPU를 NVIDIA 드라이버에서 분리하고 vfio-pci에 바인드하여 VM이 직접 접근할 수 있게 합니다.

Q3. GPU Operator가 VM에서 하지 않는 것은?

A) VFIO Manager 배포 B) Sandbox Device Plugin 배포 C) 게스트 OS 내 GPU 드라이버 설치 D) 노드 레이블링 자동화

정답: C) GPU Operator는 호스트 수준의 소프트웨어만 관리합니다. 게스트 OS 내부의 GPU 드라이버는 사용자가 직접 설치해야 합니다.

Q4. vGPU의 장점이 아닌 것은?

A) 하나의 GPU를 여러 VM에서 공유 B) 라이브 마이그레이션 지원 C) 추가 라이선스 비용 없음 D) 하드웨어 수준의 GPU 격리

정답: C) vGPU는 NVIDIA vGPU 라이선스가 필요합니다. 패스스루는 추가 라이선스가 필요하지 않습니다.

[Virtualization] 08. KubeVirt + GPU: Leveraging GPU Acceleration in VMs

Introduction
GPU Operator and KubeVirt Integration
- Node Labeling: gpu.workload.config
- Mode Change Workflow
GPU Passthrough Workflow
vGPU Workflow
KubeVirt CR Configuration
- permittedDevices Setup
Attaching GPUs in VMI Spec
- GPU Passthrough VMI
- vGPU VMI
Guest OS Driver Installation
- Linux Guest
- Windows Guest
Full Architecture Diagram
Performance Considerations
- Passthrough vs vGPU Performance Comparison
- Benchmark Reference Values
Use Cases
Troubleshooting
- Common Issues and Solutions
Conclusion

Introduction

With KubeVirt running VMs on Kubernetes and GPU Operator automating GPU management, we can now combine both technologies to leverage GPU acceleration inside VMs.

GPU Operator and KubeVirt Integration

GPU Operator supports two modes for providing GPUs to KubeVirt VMs.

Node Labeling: gpu.workload.config

The nvidia.com/gpu.workload.config label value determines the GPU usage mode.

Label Value	Description	Target
container	Default. Standard container GPU	Pod
vm-passthrough	Full GPU passthrough to VM	KubeVirt VM
vm-vgpu	vGPU instances for VM	KubeVirt VM

# Set node to VM passthrough mode
kubectl label node worker-gpu-01 \
  nvidia.com/gpu.workload.config=vm-passthrough --overwrite

# Set node to vGPU mode
kubectl label node worker-gpu-01 \
  nvidia.com/gpu.workload.config=vm-vgpu --overwrite

# Restore to default container mode
kubectl label node worker-gpu-01 \
  nvidia.com/gpu.workload.config=container --overwrite

Mode Change Workflow

Node label change
      |
      v
GPU Operator detects change
      |
      v
Clean up existing GPU software stack
      |
      v
Deploy components for new mode
      |
      +-- container: Driver + Toolkit + Device Plugin
      |
      +-- vm-passthrough: VFIO Manager + Sandbox Device Plugin
      |
      +-- vm-vgpu: vGPU Manager + Sandbox Device Plugin

GPU Passthrough Workflow

GPU passthrough assigns an entire physical GPU directly to a VM, providing near-native performance.

Step 1: Enable IOMMU

IOMMU must be enabled in both BIOS and kernel.

# For Intel CPUs, add kernel parameter
# GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt"

# For AMD CPUs
# GRUB_CMDLINE_LINUX="amd_iommu=on iommu=pt"

# Update GRUB and reboot
sudo update-grub
sudo reboot

# Verify IOMMU is enabled
dmesg | grep -i iommu
# Output: DMAR: IOMMU enabled

Step 2: VFIO Manager DaemonSet

GPU Operator automatically deploys the VFIO Manager.

+------------------------------------------+
|            Worker Node                    |
|  +------------------------------------+  |
|  | VFIO Manager (DaemonSet)           |  |
|  |                                    |  |
|  |  1. Load vfio-pci kernel module    |  |
|  |  2. Unbind GPU from host driver    |  |
|  |  3. Bind GPU to vfio-pci           |  |
|  |                                    |  |
|  +------------------------------------+  |
|                                          |
|  GPU state:                              |
|  nvidia driver --> vfio-pci driver       |
+------------------------------------------+

The VFIO Manager performs the following operations.

Loads the vfio-pci kernel module
Unbinds GPU devices from the NVIDIA driver
Binds GPUs to the vfio-pci driver
Enables direct VM access to the GPU

Step 3: Sandbox Device Plugin

The Sandbox Device Plugin discovers passthrough-capable GPUs and registers them with kubelet.

+------------------------------------------+
| Sandbox Device Plugin (DaemonSet)        |
|                                          |
| 1. Discover GPU devices bound to VFIO    |
| 2. Register as nvidia.com/gpu resource   |
| 3. KubeVirt VMs can request GPUs         |
+------------------------------------------+

vGPU Workflow

vGPU splits a single physical GPU across multiple VMs.

vGPU Manager Driver Deployment

+------------------------------------------+
|            Worker Node                    |
|  +------------------------------------+  |
|  | vGPU Manager (DaemonSet)           |  |
|  |                                    |  |
|  |  - Install vGPU host driver        |  |
|  |  - Create Mediated Devices or      |  |
|  |    SR-IOV VFs                      |  |
|  +------------------------------------+  |
|                                          |
|  +------------------------------------+  |
|  | Sandbox Device Plugin              |  |
|  |  - Discover vGPU devices           |  |
|  |  - Register as nvidia.com/VGPU_TYPE|  |
|  +------------------------------------+  |
+------------------------------------------+

vGPU Implementation by GPU Architecture

GPU Architecture	vGPU Implementation
Pre-Ampere (V100, etc.)	Mediated Devices (MDEV)
Ampere (A100)	MIG-backed vGPU or MDEV
Hopper and later (H100)	SR-IOV Virtual Functions

vGPU Type Examples

Available vGPU profiles on A100 are as follows.

vGPU Type	Framebuffer	Max Instances
A100-1-5C	5GB	7
A100-2-10C	10GB	3
A100-3-20C	20GB	2
A100-4-40C	40GB	1
A100-1-5CME	5GB (MIG)	7

KubeVirt CR Configuration

permittedDevices Setup

You must allow GPU devices in the KubeVirt CR for VM usage.

apiVersion: kubevirt.io/v1
kind: KubeVirt
metadata:
  name: kubevirt
  namespace: kubevirt
spec:
  configuration:
    permittedHostDevices:
      pciHostDevices:
        - pciVendorSelector: '10DE:20B5'
          resourceName: 'nvidia.com/gpu'
          externalResourceProvider: true
      mediatedDevices:
        - mdevNameSelector: 'NVIDIA A100-1-5C'
          resourceName: 'nvidia.com/NVIDIA_A100-1-5C'

The pciVendorSelector is the PCI vendor/device ID of the GPU. You can check it with the following command.

# Check GPU PCI ID
lspci -nn | grep NVIDIA
# Output: 3b:00.0 3D controller [0302]: NVIDIA Corporation A100 [10DE:20B5]

Attaching GPUs in VMI Spec

GPU Passthrough VMI

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: gpu-passthrough-vm
spec:
  running: true
  template:
    metadata:
      labels:
        app: gpu-vm
    spec:
      domain:
        cpu:
          cores: 8
        memory:
          guest: 32Gi
        devices:
          disks:
            - name: rootdisk
              disk:
                bus: virtio
          gpus:
            - name: gpu1
              deviceName: nvidia.com/gpu
          interfaces:
            - name: default
              masquerade: {}
      networks:
        - name: default
          pod: {}
      volumes:
        - name: rootdisk
          dataVolume:
            name: ubuntu-gpu-dv

vGPU VMI

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: vgpu-vm
spec:
  running: true
  template:
    spec:
      domain:
        cpu:
          cores: 4
        memory:
          guest: 16Gi
        devices:
          disks:
            - name: rootdisk
              disk:
                bus: virtio
          gpus:
            - name: vgpu1
              deviceName: nvidia.com/NVIDIA_A100-1-5C
          interfaces:
            - name: default
              masquerade: {}
      networks:
        - name: default
          pod: {}
      volumes:
        - name: rootdisk
          dataVolume:
            name: ubuntu-vgpu-dv

Guest OS Driver Installation

GPU Operator does NOT install drivers inside VMs. You must install them directly in the guest OS.

Linux Guest

# Connect to VM console
virtctl console gpu-passthrough-vm

# Install NVIDIA driver (Ubuntu)
sudo apt-get update
sudo apt-get install -y linux-headers-$(uname -r)

# Add NVIDIA driver repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

# Install driver
sudo apt-get install -y nvidia-driver-550

# Reboot and verify
sudo reboot
nvidia-smi

Windows Guest

1. Download driver from NVIDIA official site
   - Passthrough: standard GPU driver
   - vGPU: NVIDIA vGPU guest driver

2. Connect to Windows VM via RDP or VNC

3. Run driver installer

4. Reboot and verify GPU in Device Manager

Full Architecture Diagram

+------------------------------------------------------------------+
|                    Kubernetes Cluster                              |
|                                                                    |
|  +---------------------------+  +------------------------------+  |
|  | Container Node            |  | GPU Node (vm-passthrough)    |  |
|  | gpu.workload.config:      |  | gpu.workload.config:         |  |
|  |   container               |  |   vm-passthrough             |  |
|  |                           |  |                              |  |
|  | +-------+ +-------+      |  | +-------------------------+  |  |
|  | | Pod   | | Pod   |      |  | | VFIO Manager            |  |  |
|  | | GPU:1 | | GPU:1 |      |  | | (vfio-pci bind)         |  |  |
|  | +-------+ +-------+      |  | +-------------------------+  |  |
|  |                           |  |                              |  |
|  | NVIDIA Driver             |  | +-------------------------+  |  |
|  | Container Toolkit         |  | | Sandbox Device Plugin   |  |  |
|  | Device Plugin             |  | +-------------------------+  |  |
|  +---------------------------+  |                              |  |
|                                  | +--------+  +--------+     |  |
|                                  | |  VM 1  |  |  VM 2  |     |  |
|                                  | | GPU:1  |  | GPU:1  |     |  |
|                                  | +--------+  +--------+     |  |
|                                  +------------------------------+  |
+------------------------------------------------------------------+

Performance Considerations

Passthrough vs vGPU Performance Comparison

Factor	Passthrough	vGPU
GPU Utilization	100% (dedicated)	Shared (by profile)
Performance Overhead	1-3%	5-10%
GPU Memory	Full access	Limited by profile
VM Density	1 GPU = 1 VM	1 GPU = multiple VMs
Live Migration	Limited	Supported
License	No additional	vGPU license required

Benchmark Reference Values

The following shows typical performance comparisons on an A100 GPU.

Workload	Bare Metal	Passthrough	vGPU (Full Profile)
ResNet-50 Training	100%	97-99%	90-95%
BERT Inference	100%	98-99%	92-96%
CUDA Benchmark	100%	97-99%	91-95%

Use Cases

1. VDI (Virtual Desktop Infrastructure) on K8s

+--------------------------------------------------+
| Kubernetes Cluster                                |
|                                                    |
| +--------+ +--------+ +--------+ +--------+      |
| | Win VM | | Win VM | | Win VM | | Win VM |      |
| | vGPU   | | vGPU   | | vGPU   | | vGPU   |      |
| +--------+ +--------+ +--------+ +--------+      |
|      |          |          |          |            |
|      +----------+----------+----------+            |
|                     |                              |
|              +------+------+                       |
|              | Physical GPU|                       |
|              | (A100/L40)  |                       |
|              +-------------+                       |
+--------------------------------------------------+

2. Legacy Windows ML Workloads

Run Windows-based ML applications that cannot be containerized in KubeVirt VMs with GPU acceleration.

3. Multi-tenant GPU Isolation

Use vGPU to isolate GPU memory and compute resources between tenants at the hardware level.

4. VMware Migration

Migrate GPU VMs from existing VMware environments to KubeVirt to reduce licensing costs.

Troubleshooting

Common Issues and Solutions

Problem	Cause	Solution
GPU not visible in VM	permittedDevices not set	Add device to KubeVirt CR
VFIO bind failure	IOMMU not enabled	Check BIOS/kernel IOMMU settings
vGPU creation failure	vGPU Manager not installed	Check GPU Operator vGPU config
nvidia-smi fails in guest	Guest driver not installed	Install driver inside VM

# Check GPU device status
kubectl get node worker-gpu-01 -o json | jq '.status.allocatable'

# Check VFIO bind status
kubectl logs -n gpu-operator -l app=nvidia-vfio-manager

# Check Sandbox Device Plugin logs
kubectl logs -n gpu-operator -l app=nvidia-sandbox-device-plugin

Conclusion

Combining KubeVirt with GPU Operator provides a powerful way to consolidate VM-based GPU workloads on Kubernetes. Passthrough is ideal when maximum performance is required, while vGPU is suited for GPU sharing and efficiency.

In the next post, we will comprehensively compare QEMU, VirtualBox, VMware, and KubeVirt to analyze each platform's strengths and weaknesses.

Quiz: KubeVirt GPU Knowledge Check

Q1. What node label value enables GPU passthrough mode?

A) container B) vm-passthrough C) gpu-direct D) vfio-passthrough

Answer: B) Set nvidia.com/gpu.workload.config=vm-passthrough label.

Q2. What is the role of VFIO Manager during GPU passthrough?

A) Install GPU drivers inside the VM B) Bind GPUs to the vfio-pci driver C) Create vGPU instances D) Collect GPU metrics

Answer: B) VFIO Manager unbinds GPUs from the NVIDIA driver and binds them to vfio-pci, enabling direct VM access.

Q3. What does GPU Operator NOT do for VMs?

A) Deploy VFIO Manager B) Deploy Sandbox Device Plugin C) Install GPU drivers inside guest OS D) Automate node labeling

Answer: C) GPU Operator only manages host-level software. Guest OS GPU drivers must be installed manually by the user.

Q4. Which is NOT an advantage of vGPU?

A) Share one GPU across multiple VMs B) Live migration support C) No additional license cost D) Hardware-level GPU isolation

Answer: C) vGPU requires an NVIDIA vGPU license. Passthrough does not require additional licensing.