Split View: SELinux, seccomp, device access: KubeVirt의 보안 경계는 어떻게 유지되는가

SELinux, seccomp, device access: KubeVirt의 보안 경계는 어떻게 유지되는가

들어가며
1. 왜 KubeVirt는 보안 모델이 더 까다로운가
2. SELinux: 같은 Pod라도 올바른 label이 맞아야 한다
3. SELinux는 네트워크 helper 실행에도 영향을 준다
4. seccomp: syscall도 그대로 다 열어 두지 않는다
5. cgroup device access: QEMU가 장치를 쓸 수 있는 범위를 제한한다
6. device plugin과 permanent host device
7. VFIO와 host device passthrough가 왜 민감한가
8. privileged helper가 있어도 "제한된 privileged"를 지향한다
9. live migration과 보안 경계는 연결돼 있다
운영자가 기억해야 할 핵심
마무리

들어가며

KubeVirt는 QEMU를 Pod 안에서 실행하지만, 그렇다고 아무 제약 없이 host 자원에 접근하게 두지는 않는다. 오히려 KubeVirt를 깊게 읽어 보면 보안 경계 때문에 controller보다 더 까다로운 코드가 많다. /dev/kvm, TAP, VFIO, migration socket, mount namespace 같은 민감한 경로를 다뤄야 하기 때문이다.

이번 글에서는 KubeVirt가 어떤 식으로 보안 경계를 세우는지 본다. 핵심은 세 가지다.

SELinux context를 맞춘다.
seccomp로 syscall 허용 범위를 관리한다.
cgroup과 device plugin 계층으로 장치 접근을 좁힌다.

1. 왜 KubeVirt는 보안 모델이 더 까다로운가

일반 Pod는 네트워크와 스토리지를 쓰는 정도면 충분한 경우가 많다. 반면 KubeVirt는 다음을 해야 한다.

하드웨어 가상화 장치에 접근한다.
TAP 장치를 만든다.
migration 시 node 간 메모리와 device state를 옮긴다.
guest disk, cloud-init, socket, qemu process를 함께 관리한다.

즉 VM 하나를 띄운다는 것은 host kernel 기능에 깊게 닿는다는 뜻이다. 그래서 KubeVirt는 단순히 "privileged Pod 하나"로 끝내지 않고, 보안 경계를 여러 층으로 쪼갠다.

2. SELinux: 같은 Pod라도 올바른 label이 맞아야 한다

KubeVirt가 SELinux를 중요하게 다룬다는 것은 API 타입만 봐도 드러난다. VirtualMachineInstanceStatus와 MigrationState에는 실제 selinuxContext가 기록된다. 이는 SELinux가 단순 환경 정보가 아니라, migration과 host-side helper가 재현해야 하는 실행 조건이라는 뜻이다.

pkg/virt-controller/watch/migration/migration.go를 보면 target migration Pod를 만들 때 source 쪽 SELinux context를 읽어 와서 target Pod에 적용할 수 있다. 기본 동작은 source와 같은 level을 맞추는 쪽이다. 이 설계는 "target Pod도 같은 파일과 socket에 접근 가능해야 한다"는 운영 현실을 반영한다.

특히 RWX 볼륨이나 shared state가 걸리면 SELinux level mismatch는 단순 경고가 아니라 migration 실패 원인이 된다.

3. SELinux는 네트워크 helper 실행에도 영향을 준다

SELinux가 migration Pod에만 쓰이는 것은 아니다. pkg/network/driver/virtchroot/tap.go를 보면 TAP 장치를 만들 때 AddTapDeviceWithSELinuxLabel 경로가 있다. 여기서는 launcher PID의 SELinux label을 기준으로 helper 명령을 실행한다.

이 동작의 실제 핵심은 pkg/virt-handler/selinux/context_executor.go에 있다.

대상 PID의 현재 label을 읽는다.
현재 프로세스의 원래 label도 보관한다.
helper 실행 직전에 desired label로 바꾼다.
실행이 끝나면 원래 label로 되돌린다.

즉 KubeVirt는 "host helper가 대신 작업한다"에서 멈추지 않는다. 그 helper가 어떤 SELinux 문맥에서 실행돼야 하는지까지 복원한다.

이게 중요한 이유는 간단하다. TAP 생성이나 namespace 내부 작업이 성공하려면 단순 root 권한만으로는 부족하고, 올바른 label 문맥이 맞아야 할 수 있기 때문이다.

4. seccomp: syscall도 그대로 다 열어 두지 않는다

pkg/virt-handler/seccomp/seccomp.go는 kubelet root 아래에 KubeVirt용 seccomp profile을 설치한다. 설치 위치를 보면 host의 kubelet 관리 디렉터리 아래에 seccomp/kubevirt/kubevirt.json을 만든다.

여기서 특히 눈에 띄는 syscall이 userfaultfd다. 기본 profile을 바탕으로 하되, KubeVirt는 이 syscall을 명시적으로 허용한다. 주석에도 이유가 적혀 있다. post-copy migration에 필요하기 때문이다.

이 포인트는 중요하다.

평소에는 syscall을 가능한 한 기본 profile에 맞춘다.
하지만 live migration의 특정 단계에는 추가 syscall이 꼭 필요하다.
그래서 KubeVirt는 "기능 때문에 보안을 포기"하지 않고, 필요한 syscall만 정밀하게 뚫는다.

즉 seccomp는 KubeVirt에서 단순한 compliance 설정이 아니라, live migration 기능 요구사항과 보안 요구사항을 동시에 만족시키는 조정 레이어다.

5. cgroup device access: QEMU가 장치를 쓸 수 있는 범위를 제한한다

VM 실행이 QEMU 프로세스라는 사실은 보안 측면에서도 중요하다. 결국 장치 접근 제어는 프로세스 기준으로 이뤄져야 한다.

pkg/virt-handler/vm.go를 보면 KubeVirt는 device controller와 cgroup manager를 함께 다룬다. 또한 cmd/virt-chroot/cgroup.go는 host cgroup 경로로 들어가서 runc의 cgroup manager를 통해 실제 resource를 설정한다. v1과 v2를 둘 다 지원하는 것도 눈에 띈다.

이 레이어가 하는 일은 대략 다음과 같다.

VM에 필요한 device만 allow list에 반영한다.
cgroup v1, v2 차이에 맞춰 실제 kernel 제약을 건다.
CPU와 메모리뿐 아니라 device 접근 규칙까지 host 쪽에 반영한다.

즉 KubeVirt는 Pod spec의 resource request만 믿지 않고, 실제 VM 프로세스가 host에서 만나는 cgroup 경계를 별도로 조정한다.

6. device plugin과 permanent host device

pkg/virt-handler/vm.go 초기화 코드를 보면 KubeVirt는 hypervisor device에 대해 permanent host device plugin 개념을 둔다. /dev/kvm 같은 장치를 node 차원에서 관리 가능한 자원으로 다루기 위한 구조다.

이 구조 덕분에 KubeVirt는 단순히 privileged container가 host device를 직접 뒤지는 방식이 아니라:

node에 어떤 장치가 있는지 노출하고
scheduler가 그 자원을 고려하게 만들고
실제 launcher가 해당 장치를 쓰도록 연결한다

는 흐름을 만들 수 있다.

이 관점에서 보면 device plugin은 성능 기능이 아니라 보안 기능이기도 하다. 어떤 VM이 어떤 host device를 받을지 명시적으로 관리할 수 있기 때문이다.

7. VFIO와 host device passthrough가 왜 민감한가

SR-IOV나 PCI host device passthrough는 virtwrap/device/hostdevice 계열과 VFIO 모델을 통해 guest로 장치를 넘긴다. 이때는 일반 가상 NIC보다 훨씬 더 강한 host 의존성이 생긴다.

그래서 이런 설정은 migration 가능성에도 직접 영향을 준다. API 타입에는 아예 HostDeviceNotLiveMigratable, SEVNotLiveMigratable, SecureExecutionNotLiveMigratable 같은 reason이 정의돼 있다.

이는 KubeVirt가 보안 또는 하드웨어 특수 기능을 켠 순간, 그 대가로 일부 유연성을 포기해야 함을 API 수준에서 드러낸다는 뜻이다.

8. privileged helper가 있어도 "제한된 privileged"를 지향한다

KubeVirt 코드를 보면 virt-handler, virt-launcher, virt-chroot가 역할을 나눠 갖는다.

cluster 쪽 선언과 조정은 controller가 맡는다.
node-local privileged 작업은 virt-handler와 helper가 맡는다.
실제 VM 실행은 virt-launcher와 libvirt, QEMU가 맡는다.

즉 모든 작업을 한 컨테이너가 다 하는 구조가 아니다. 역할을 분리하고, 필요한 순간에만 host-level helper를 호출한다.

이 설계는 완벽한 최소 권한은 아니더라도, 권한을 기능 경계에 맞춰 분산하는 방향이다.

9. live migration과 보안 경계는 연결돼 있다

많은 사용자가 live migration을 성능과 가용성 기능으로만 본다. 하지만 실제로는 보안 문맥 재현 문제이기도 하다.

target Pod가 source와 같은 SELinux level을 가질 수 있는가
post-copy에 필요한 syscall이 허용되는가
migration socket과 state file에 접근 가능한가
device와 volume이 target에서도 동일하게 준비되는가

이 중 하나라도 어긋나면 migration은 깨진다. 즉 KubeVirt에서 보안 설정은 부가기능이 아니라 migration의 전제 조건이다.

운영자가 기억해야 할 핵심

/dev/kvm 접근만 된다고 VM 플랫폼이 완성되는 것은 아니다.
SELinux context는 migration 성공 여부와 직접 연결된다.
seccomp는 post-copy 같은 고급 기능과 충돌할 수 있으므로 profile 수준에서 이해해야 한다.
cgroup device rule과 device plugin은 보안과 스케줄링을 동시에 다룬다.

마무리

KubeVirt는 Pod 안에서 QEMU를 실행하지만, 그 내부는 전혀 단순하지 않다. SELinux는 helper와 migration target의 실행 문맥을 맞추고, seccomp는 필요한 syscall만 허용하며, cgroup과 device plugin은 장치 접근 범위를 관리한다. 결국 KubeVirt의 보안 모델은 "VM도 결국 Linux process"라는 사실을 정면으로 받아들이고, 그 프로세스가 kernel과 만나는 지점을 세밀하게 조정하는 방식이다.

다음 글에서는 이런 내부 상태가 운영자에게 어떻게 드러나는지, 즉 VMI status, guest agent, domain stats, metrics, 디버깅 경로를 따라가 보겠다.

SELinux, seccomp, Device Access: How KubeVirt Maintains Security Boundaries

Introduction
1. Why KubeVirt's Security Model Is More Demanding
2. SELinux: Even Within the Same Pod, the Correct Label Must Match
3. SELinux Also Affects Network Helper Execution
4. seccomp: Syscalls Are Not Left Wide Open Either
5. cgroup Device Access: Limiting the Scope of Devices QEMU Can Use
6. Device Plugin and Permanent Host Devices
7. Why VFIO and Host Device Passthrough Are Sensitive
8. Even with Privileged Helpers, "Limited Privileged" Is the Goal
9. Live Migration and Security Boundaries Are Connected
Key Points for Operators
Conclusion

Introduction

KubeVirt runs QEMU inside a Pod, but it does not allow unrestricted access to host resources. In fact, when reading KubeVirt deeply, there is more complex code for security boundaries than for controllers. This is because sensitive paths like /dev/kvm, TAP, VFIO, migration sockets, and mount namespaces must be handled.

This post examines how KubeVirt establishes security boundaries. Three things are central:

Aligning SELinux contexts.
Managing syscall allowances with seccomp.
Narrowing device access through cgroup and device plugin layers.

1. Why KubeVirt's Security Model Is More Demanding

Regular Pods often need only network and storage access. In contrast, KubeVirt must:

Access hardware virtualization devices.
Create TAP devices.
Move memory and device state between nodes during migration.
Manage guest disks, cloud-init, sockets, and QEMU processes together.

Running a single VM means touching host kernel features deeply. That is why KubeVirt does not settle for "one privileged Pod" but splits security boundaries into multiple layers.

2. SELinux: Even Within the Same Pod, the Correct Label Must Match

That KubeVirt takes SELinux seriously is evident from the API types alone. VirtualMachineInstanceStatus and MigrationState record the actual selinuxContext. This means SELinux is not just environmental information but an execution condition that migration and host-side helpers must reproduce.

Looking at pkg/virt-controller/watch/migration/migration.go, when creating the target migration Pod, the source-side SELinux context can be read and applied to the target Pod. The default behavior aligns the same level as the source. This design reflects the operational reality that "the target Pod must also be able to access the same files and sockets."

Especially when RWX volumes or shared state are involved, SELinux level mismatch is not just a warning but a migration failure cause.

3. SELinux Also Affects Network Helper Execution

SELinux is not used only for migration Pods. Looking at pkg/network/driver/virtchroot/tap.go, there is an AddTapDeviceWithSELinuxLabel path when creating TAP devices. Here, helper commands are executed based on the SELinux label of the launcher PID.

The actual core of this behavior is in pkg/virt-handler/selinux/context_executor.go.

Reads the current label of the target PID.
Preserves the original label of the current process.
Switches to the desired label just before helper execution.
Restores the original label after execution completes.

KubeVirt does not stop at "a host helper does the work." It also restores the SELinux context in which that helper must execute.

The reason is simple. For TAP creation or namespace-internal operations to succeed, simple root privileges may not be enough -- the correct label context must match.

4. seccomp: Syscalls Are Not Left Wide Open Either

pkg/virt-handler/seccomp/seccomp.go installs KubeVirt-specific seccomp profiles under the kubelet root. Looking at the installation location, it creates seccomp/kubevirt/kubevirt.json under the host's kubelet management directory.

A particularly notable syscall here is userfaultfd. While building on the base profile, KubeVirt explicitly allows this syscall. The comments explain why: it is needed for post-copy migration.

This point is important.

Normally, syscalls are kept as close to the base profile as possible.
But specific stages of live migration require additional syscalls.
So KubeVirt does not "give up security for functionality" but precisely opens only the necessary syscalls.

In other words, seccomp in KubeVirt is not just a compliance setting -- it is an adjustment layer that simultaneously satisfies live migration feature requirements and security requirements.

5. cgroup Device Access: Limiting the Scope of Devices QEMU Can Use

The fact that VM execution is a QEMU process is also important from a security perspective. Device access control must ultimately be process-based.

Looking at pkg/virt-handler/vm.go, KubeVirt handles device controllers and cgroup managers together. Also, cmd/virt-chroot/cgroup.go enters host cgroup paths to set actual resources through runc's cgroup manager. Support for both v1 and v2 is notable.

What this layer does is roughly:

Reflects only the devices needed by the VM in the allow list.
Applies actual kernel constraints matching cgroup v1 and v2 differences.
Reflects device access rules on the host side, not just CPU and memory.

In other words, KubeVirt does not rely solely on Pod spec resource requests but separately adjusts the cgroup boundaries that the actual VM process encounters on the host.

6. Device Plugin and Permanent Host Devices

Looking at the initialization code in pkg/virt-handler/vm.go, KubeVirt uses a permanent host device plugin concept for hypervisor devices. This is a structure for managing devices like /dev/kvm as node-level manageable resources.

Thanks to this structure, instead of a privileged container directly searching host devices, KubeVirt:

Exposes what devices exist on a node
Makes the scheduler consider those resources
Connects the actual launcher to use those devices

From this perspective, the device plugin is not just a performance feature but also a security feature. It allows explicit management of which VMs receive which host devices.

7. Why VFIO and Host Device Passthrough Are Sensitive

SR-IOV and PCI host device passthrough pass devices to the guest through the virtwrap/device/hostdevice family and the VFIO model. This creates much stronger host dependencies than regular virtual NICs.

Therefore, such configurations directly affect migration feasibility. The API types explicitly define reasons like HostDeviceNotLiveMigratable, SEVNotLiveMigratable, and SecureExecutionNotLiveMigratable.

This means KubeVirt reveals at the API level that the moment you enable security or hardware-specific features, some flexibility must be sacrificed as a trade-off.

8. Even with Privileged Helpers, "Limited Privileged" Is the Goal

Looking at KubeVirt code, virt-handler, virt-launcher, and virt-chroot divide responsibilities.

Cluster-side declarations and coordination are handled by the controller.
Node-local privileged work is handled by virt-handler and helpers.
Actual VM execution is handled by virt-launcher with libvirt and QEMU.

Not everything is done by a single container. Responsibilities are separated, and host-level helpers are called only when needed.

This design, while not perfect least privilege, is a direction of distributing privileges along functional boundaries.

9. Live Migration and Security Boundaries Are Connected

Many users see live migration only as a performance and availability feature. But in practice, it is also a security context reproduction problem.

Can the target Pod have the same SELinux level as the source?
Are the syscalls needed for post-copy allowed?
Can migration sockets and state files be accessed?
Are devices and volumes prepared identically on the target?

If any of these are misaligned, migration breaks. In KubeVirt, security settings are not add-on features but prerequisites for migration.

Key Points for Operators

VM platform is not complete just because /dev/kvm access works.
SELinux context is directly connected to migration success.
seccomp can conflict with advanced features like post-copy, so it must be understood at the profile level.
cgroup device rules and device plugins handle both security and scheduling simultaneously.

Conclusion

KubeVirt runs QEMU inside a Pod, but the internals are far from simple. SELinux aligns execution contexts for helpers and migration targets, seccomp allows only necessary syscalls, and cgroups with device plugins manage device access scope. Ultimately, KubeVirt's security model confronts the fact that "VMs are also Linux processes" head-on and precisely adjusts the points where those processes meet the kernel.

In the next post, we will follow how these internal states are exposed to operators through VMI status, guest agent, domain stats, metrics, and debugging paths.