Skip to content
Published on

SELinux, seccomp, Device Access: How KubeVirt Maintains Security Boundaries

Authors

Introduction

KubeVirt runs QEMU inside a Pod, but it does not allow unrestricted access to host resources. In fact, when reading KubeVirt deeply, there is more complex code for security boundaries than for controllers. This is because sensitive paths like /dev/kvm, TAP, VFIO, migration sockets, and mount namespaces must be handled.

This post examines how KubeVirt establishes security boundaries. Three things are central:

  • Aligning SELinux contexts.
  • Managing syscall allowances with seccomp.
  • Narrowing device access through cgroup and device plugin layers.

1. Why KubeVirt's Security Model Is More Demanding

Regular Pods often need only network and storage access. In contrast, KubeVirt must:

  • Access hardware virtualization devices.
  • Create TAP devices.
  • Move memory and device state between nodes during migration.
  • Manage guest disks, cloud-init, sockets, and QEMU processes together.

Running a single VM means touching host kernel features deeply. That is why KubeVirt does not settle for "one privileged Pod" but splits security boundaries into multiple layers.

2. SELinux: Even Within the Same Pod, the Correct Label Must Match

That KubeVirt takes SELinux seriously is evident from the API types alone. VirtualMachineInstanceStatus and MigrationState record the actual selinuxContext. This means SELinux is not just environmental information but an execution condition that migration and host-side helpers must reproduce.

Looking at pkg/virt-controller/watch/migration/migration.go, when creating the target migration Pod, the source-side SELinux context can be read and applied to the target Pod. The default behavior aligns the same level as the source. This design reflects the operational reality that "the target Pod must also be able to access the same files and sockets."

Especially when RWX volumes or shared state are involved, SELinux level mismatch is not just a warning but a migration failure cause.

3. SELinux Also Affects Network Helper Execution

SELinux is not used only for migration Pods. Looking at pkg/network/driver/virtchroot/tap.go, there is an AddTapDeviceWithSELinuxLabel path when creating TAP devices. Here, helper commands are executed based on the SELinux label of the launcher PID.

The actual core of this behavior is in pkg/virt-handler/selinux/context_executor.go.

  • Reads the current label of the target PID.
  • Preserves the original label of the current process.
  • Switches to the desired label just before helper execution.
  • Restores the original label after execution completes.

KubeVirt does not stop at "a host helper does the work." It also restores the SELinux context in which that helper must execute.

The reason is simple. For TAP creation or namespace-internal operations to succeed, simple root privileges may not be enough -- the correct label context must match.

4. seccomp: Syscalls Are Not Left Wide Open Either

pkg/virt-handler/seccomp/seccomp.go installs KubeVirt-specific seccomp profiles under the kubelet root. Looking at the installation location, it creates seccomp/kubevirt/kubevirt.json under the host's kubelet management directory.

A particularly notable syscall here is userfaultfd. While building on the base profile, KubeVirt explicitly allows this syscall. The comments explain why: it is needed for post-copy migration.

This point is important.

  • Normally, syscalls are kept as close to the base profile as possible.
  • But specific stages of live migration require additional syscalls.
  • So KubeVirt does not "give up security for functionality" but precisely opens only the necessary syscalls.

In other words, seccomp in KubeVirt is not just a compliance setting -- it is an adjustment layer that simultaneously satisfies live migration feature requirements and security requirements.

5. cgroup Device Access: Limiting the Scope of Devices QEMU Can Use

The fact that VM execution is a QEMU process is also important from a security perspective. Device access control must ultimately be process-based.

Looking at pkg/virt-handler/vm.go, KubeVirt handles device controllers and cgroup managers together. Also, cmd/virt-chroot/cgroup.go enters host cgroup paths to set actual resources through runc's cgroup manager. Support for both v1 and v2 is notable.

What this layer does is roughly:

  • Reflects only the devices needed by the VM in the allow list.
  • Applies actual kernel constraints matching cgroup v1 and v2 differences.
  • Reflects device access rules on the host side, not just CPU and memory.

In other words, KubeVirt does not rely solely on Pod spec resource requests but separately adjusts the cgroup boundaries that the actual VM process encounters on the host.

6. Device Plugin and Permanent Host Devices

Looking at the initialization code in pkg/virt-handler/vm.go, KubeVirt uses a permanent host device plugin concept for hypervisor devices. This is a structure for managing devices like /dev/kvm as node-level manageable resources.

Thanks to this structure, instead of a privileged container directly searching host devices, KubeVirt:

  • Exposes what devices exist on a node
  • Makes the scheduler consider those resources
  • Connects the actual launcher to use those devices

From this perspective, the device plugin is not just a performance feature but also a security feature. It allows explicit management of which VMs receive which host devices.

7. Why VFIO and Host Device Passthrough Are Sensitive

SR-IOV and PCI host device passthrough pass devices to the guest through the virtwrap/device/hostdevice family and the VFIO model. This creates much stronger host dependencies than regular virtual NICs.

Therefore, such configurations directly affect migration feasibility. The API types explicitly define reasons like HostDeviceNotLiveMigratable, SEVNotLiveMigratable, and SecureExecutionNotLiveMigratable.

This means KubeVirt reveals at the API level that the moment you enable security or hardware-specific features, some flexibility must be sacrificed as a trade-off.

8. Even with Privileged Helpers, "Limited Privileged" Is the Goal

Looking at KubeVirt code, virt-handler, virt-launcher, and virt-chroot divide responsibilities.

  • Cluster-side declarations and coordination are handled by the controller.
  • Node-local privileged work is handled by virt-handler and helpers.
  • Actual VM execution is handled by virt-launcher with libvirt and QEMU.

Not everything is done by a single container. Responsibilities are separated, and host-level helpers are called only when needed.

This design, while not perfect least privilege, is a direction of distributing privileges along functional boundaries.

9. Live Migration and Security Boundaries Are Connected

Many users see live migration only as a performance and availability feature. But in practice, it is also a security context reproduction problem.

  • Can the target Pod have the same SELinux level as the source?
  • Are the syscalls needed for post-copy allowed?
  • Can migration sockets and state files be accessed?
  • Are devices and volumes prepared identically on the target?

If any of these are misaligned, migration breaks. In KubeVirt, security settings are not add-on features but prerequisites for migration.

Key Points for Operators

  • VM platform is not complete just because /dev/kvm access works.
  • SELinux context is directly connected to migration success.
  • seccomp can conflict with advanced features like post-copy, so it must be understood at the profile level.
  • cgroup device rules and device plugins handle both security and scheduling simultaneously.

Conclusion

KubeVirt runs QEMU inside a Pod, but the internals are far from simple. SELinux aligns execution contexts for helpers and migration targets, seccomp allows only necessary syscalls, and cgroups with device plugins manage device access scope. Ultimately, KubeVirt's security model confronts the fact that "VMs are also Linux processes" head-on and precisely adjusts the points where those processes meet the kernel.

In the next post, we will follow how these internal states are exposed to operators through VMI status, guest agent, domain stats, metrics, and debugging paths.