Skip to content
Published on

KVM, cgroup, namespace, tap, netlink: The Kernel Technologies Behind KubeVirt

Authors

Introduction

To deeply understand KubeVirt, reading Go controller code alone is insufficient. The reason this system is possible is ultimately because the Linux kernel and host runtime provide the underlying primitives. KubeVirt did not create a new hypervisor -- it is a system that ties existing kernel features and QEMU/libvirt into the Kubernetes resource model.

This post organizes the kernel technologies that form this foundation.

1. KVM: Why VMs Can Run "Fast"

The most critical piece is /dev/kvm. pkg/virt-handler/node-labeller/util/util.go even defines KVMPath = "/dev/kvm" as a constant. This means KubeVirt considers the availability of hardware virtualization acceleration very important.

Emulation is possible with QEMU alone, but performance can suffer significantly. KVM offloads many of the guest's CPU execution paths to hardware virtualization, boosting performance.

In other words, for KubeVirt to function as a production VM platform, /dev/kvm access must be available.

2. Namespace: Why a Pod Can Be a VM Execution Sandbox

A Pod is not just a deployment bundle -- it is a set of namespaces.

  • Network namespace
  • Mount namespace
  • PID namespace

QEMU is ultimately a Linux process. Therefore, it can run within these namespace boundaries. This is the most realistic meaning of "VMs run on top of Pods."

Looking at KubeVirt's virt-chroot code, you can find functions that enter specific mount namespaces to perform mount or SELinux operations. KubeVirt actively uses namespaces not as an abstract concept but as a practical operational tool.

3. cgroup: VMs Must Also Be Subject to Host Resource Constraints

VMs cannot exist outside host resource constraints just because they are VMs. QEMU and related threads are ultimately Linux processes and must be subject to cgroup constraints.

Looking at cmd/virt-chroot/cgroup.go, pkg/virt-handler/cgroup/*, and vm.go, you can see that KubeVirt pays attention to cgroup v1 and v2 differences, device allow lists, cpuset, and block device rules.

In other words, KubeVirt does not stop at the "Pod request" level but aims to apply precise cgroup resource constraints to the actual VM execution process.

4. TAP: The Key Device Connecting Guest NIC to Host Network

TAP devices are the representative backend connecting the virtual NIC seen by the guest to the host or Pod-side network. cmd/virt-chroot/tap-device-maker.go and pkg/network/setup/podnic.go demonstrate this well.

From the guest's perspective, a virtual NIC like virtio-net is visible, but on the host side, TAP devices, bridges, and NAT rules connect the packets.

In other words, TAP is the practical gateway between the guest and Pod namespace network.

The reason KubeVirt network code heavily uses netlink is that bridges, addresses, link states, TAP, and MTU settings are all kernel network objects.

For example:

  • TAP creation
  • MTU configuration
  • Bridge address calculation
  • Masquerade gateway and guest IP calculation

These operations are not simple file edits but kernel network state manipulations.

In other words, KubeVirt networking is not just a "YAML to CNI" problem -- it is Linux network stack manipulation inside namespaces.

6. nftables: The Actual Data Plane for Masquerade Binding

pkg/network/setup/netpod/masquerade/masquerade.go configures nftables rules for masquerade binding. This is an important point.

Many users think of NAT abstractly, but in practice:

  • NAT table
  • Prerouting
  • Postrouting
  • DNAT
  • SNAT
  • Masquerade

These rules enter the kernel packet path.

In other words, KubeVirt's guest egress and some inbound models rest on kernel packet filtering and NAT capabilities, not userspace magic.

7. VFIO: The Foundation for SR-IOV and Device Passthrough

Looking at docs/network/sriov.md and the virtwrap/device/hostdevice code, SR-IOV and device passthrough rely on the VFIO model. To pass a device directly to a guest, the host kernel must be able to safely delegate the device to a userspace hypervisor.

Therefore, SR-IOV is not a simple network feature -- it is a set of kernel and hardware features that must all align:

  • IOMMU
  • VFIO
  • Device plugin
  • libvirt hostdev configuration

8. seccomp and Device Allow Lists Are Also Kernel Issues

virt-handler also installs seccomp profiles. In the hotplug block device path, cgroup device rules are adjusted. This means the system is designed not as "QEMU is a simple privileged process that can do everything" but as one that allows only necessary syscalls and device access.

In other words, kernel primitives are essential not only for performance but also for security boundaries.

Why Kubernetes Alone Cannot Do This

Kubernetes places Pods and prepares volumes and networking. However:

  • How to use /dev/kvm
  • How to create TAP devices
  • How to connect libvirt and QEMU
  • How to move guest memory and device state

These virtualization-specific problems are not solved by Kubernetes itself. KubeVirt fills precisely this gap. However, the foundation remains Linux kernel features.

Common Misconceptions

Misconception 1: KubeVirt adds a virtualization engine inside Kubernetes

No. The virtualization engine itself is KVM, QEMU, and libvirt. KubeVirt integrates them with Kubernetes.

Misconception 2: Running on top of Pods means kernel technologies are not very important

The opposite is true. Even with the Pod wrapper, actual execution happens on kernel namespaces, cgroups, KVM, netlink, and nftables.

Misconception 3: Networking and devices are mostly userspace logic

No. The important parts continually touch kernel objects and kernel policies.

Key Points for Operators

  • VM problems are often not Kubernetes problems but host kernel problems.
  • You should look at /dev/kvm, cpuset, cgroup device rules, TAP, and nftables state together.
  • SR-IOV and VFIO require hardware, BIOS, kernel, CNI, and libvirt to all align.

Conclusion

The reason KubeVirt can implement VMs on top of Pods is not because Kubernetes does everything, but because the Linux kernel already provides powerful virtualization and isolation primitives. KVM enables CPU acceleration, namespaces and cgroups provide execution boundaries, TAP and netlink enable guest network connectivity, nftables enables NAT, and VFIO enables device passthrough. KubeVirt is a system that combines precisely these primitives on top of the Kubernetes resource model.

In the next post, we will examine the security boundaries that sit on top of these kernel boundaries -- KubeVirt's security model related to SELinux, seccomp, and device access.