Split View: VMI status, metrics, guest agent, debugging: KubeVirt는 내부 상태를 어떻게 드러내는가

VMI status, metrics, guest agent, debugging: KubeVirt는 내부 상태를 어떻게 드러내는가

들어가며
1. VMI status는 가장 중요한 운영 표면이다
2. phase만 보면 부족하고 conditions를 같이 봐야 한다
3. activePods는 migration 시점에 특히 중요하다
4. 네트워크 상태는 Pod annotation과 guest 정보가 합쳐진다
5. guest agent는 guest 내부 정보를 꺼내오는 창구다
6. domain stats는 host 관측성과 guest 관측성의 중간층이다
7. Prometheus metrics는 virt-handler에서 많이 나온다
8. guest agent가 없으면 무엇이 줄어드는가
9. 디버깅은 control plane, node, guest를 나눠서 봐야 한다
10. status가 항상 진실을 즉시 반영하지는 않는다
운영자가 기억해야 할 핵심
마무리

들어가며

KubeVirt를 운영하다 보면 가장 어려운 질문은 이것이다. "지금 이 VM이 정말 어디까지 살아 있는가?" Pod는 Running인데 guest는 멈췄을 수 있고, guest는 살아 있어도 migration은 실패 직전일 수 있다. 그래서 KubeVirt는 상태를 한 군데가 아니라 여러 층에서 수집한다.

Kubernetes 객체 상태
libvirt domain 상태
guest agent가 알려 주는 guest 내부 정보
network status와 migration status
Prometheus metrics

이번 글에서는 이 관측 레이어가 어떻게 연결되는지 본다.

1. VMI status는 가장 중요한 운영 표면이다

staging/src/kubevirt.io/api/core/v1/types.go의 VirtualMachineInstanceStatus를 보면 운영자가 보고 싶은 정보가 꽤 많이 들어 있다.

phase
conditions
interfaces
guestOSInfo
migrationState
qosClass
activePods
selinuxContext
memory
currentCPUTopology

이 타입만 읽어도 KubeVirt가 상태를 단순히 "켜짐 또는 꺼짐"으로 보지 않는다는 걸 알 수 있다. VM 상태는 Kubernetes phase와 guest 내부 정보, migration 진행도, 네트워크 인터페이스 상태가 합쳐진 결과다.

2. `phase`만 보면 부족하고 `conditions`를 같이 봐야 한다

phase는 상위 흐름을 요약한다. Pending, Scheduling, Scheduled, Running, Succeeded, Failed, Unknown 같은 값은 큰 방향을 보여 준다.

하지만 실제 운영 판단은 conditions와 reason, message에서 나온다. API 타입에는 다음 같은 condition과 reason이 미리 정의돼 있다.

LiveMigratable
StorageLiveMigratable
MigrationRequired
EvictionRequested
DataVolumesReady
DisksNotLiveMigratable
InterfaceNotLiveMigratable
HostDeviceNotLiveMigratable
SEVNotLiveMigratable
SecureExecutionNotLiveMigratable

즉 KubeVirt는 "안 된다"라고만 말하지 않고, 왜 live migration이 안 되는지까지 타입 시스템에 표준화해 두었다.

3. `activePods`는 migration 시점에 특히 중요하다

VirtualMachineInstanceStatus.ActivePods는 pod UID와 node 이름의 매핑이다. 주석에도 적혀 있듯 migration 중에는 하나의 VMI에 여러 Pod가 동시에 걸릴 수 있다.

이 필드는 "현재 어떤 virt-launcher Pod가 source이고 target인지"를 읽는 데 중요하다. 실제 운영에서 migration 타이밍의 혼란은 대부분 여기서 시작된다. 단일 VM이라고 생각했는데, control plane 입장에서는 source launcher와 target launcher가 동시에 존재하는 짧은 구간이 있기 때문이다.

즉 activePods는 migration debugging에서 숨은 핵심 필드다.

4. 네트워크 상태는 Pod annotation과 guest 정보가 합쳐진다

pkg/network/controllers/vmi.go를 보면 VMI status의 interfaces는 한 군데에서만 오지 않는다.

Multus network status에서 pod interface 이름을 읽고
primary와 secondary interface를 계산하고
기존 status 중 spec에 없는 항목도 보존한다

또 API 타입의 VirtualMachineInstanceNetworkInterface에는 다음이 들어 있다.

guest IP
MAC
네트워크 이름
Pod interface 이름
VM 내부 interface 이름
info source

특히 infoSource는 이 정보가 guest-agent에서 왔는지, domain에서 왔는지, multus-status에서 왔는지를 구분한다. 이 설계 덕분에 운영자는 "이 IP는 guest 내부에서 보고된 값인지, CNI가 보고한 값인지"를 따져 볼 수 있다.

5. guest agent는 guest 내부 정보를 꺼내오는 창구다

pkg/virt-launcher/virtwrap/manager.go의 DomainManager 인터페이스를 보면 guest 관련 메서드가 꽤 많다.

GetGuestInfo
GetUsers
GetFilesystems
GetGuestOSInfo
GuestPing

이건 중요한 신호다. KubeVirt는 libvirt와 QEMU 레벨 상태만으로는 부족하다고 보고, guest agent를 통해 OS 안쪽 정보를 따로 수집한다.

pkg/virt-handler/rest/lifecycle.go는 이 데이터를 launcher client를 통해 받아서 API 응답으로 내보낸다. 즉 운영자가 보는 guest 정보는 결국:

virt-handler REST endpoint
launcher client RPC
virt-launcher 내부 domain manager
qemu guest agent

를 거쳐 나온 결과다.

6. domain stats는 host 관측성과 guest 관측성의 중간층이다

같은 DomainManager 인터페이스에는 GetDomainStats와 GetDomainDirtyRateStats도 있다. 이는 guest agent와 별개로 libvirt가 보고하는 domain-level 통계를 끌어온다는 뜻이다.

이 계층은 guest 내부에서 agent가 응답하지 않아도 여전히 볼 수 있는 정보가 많다.

CPU 사용량
메모리 상태
block I/O
네트워크 트래픽
dirty page 비율

즉 guest agent는 guest 내부 의미를 알려 주고, domain stats는 hypervisor가 관찰한 실행 사실을 알려 준다. 둘은 경쟁 관계가 아니라 서로 보완 관계다.

7. Prometheus metrics는 virt-handler에서 많이 나온다

pkg/monitoring/metrics/virt-handler/domainstats를 보면 CPU, memory, block, vcpu 등 도메인 통계를 Prometheus metric으로 변환하는 collector가 있다.

이 구조는 꽤 현실적이다.

실제 VM 프로세스와 가장 가까운 곳은 node다.
node에서 domain stats를 수집하기 가장 쉽다.
그래서 metrics export도 virt-handler에 가깝게 붙는다.

즉 KubeVirt 관측성은 중앙 controller보다 node-local agent에서 더 많은 실행 사실을 수집하는 구조에 가깝다.

8. guest agent가 없으면 무엇이 줄어드는가

guest agent가 없다고 VM이 안 뜨는 것은 아니다. 하지만 운영자가 볼 수 있는 의미 있는 정보가 많이 줄어든다.

guest 내부 사용자 목록
filesystem 목록
OS pretty name
interface 이름과 일부 guest IP 정보

즉 guest agent는 필수 boot dependency라기보다, 운영 가시성과 자동화를 풍부하게 만드는 확장 계층이다.

그래서 "Pod는 정상인데 VM 내부가 안 보인다"라는 상황에서는 guest agent 설치와 연결 상태를 가장 먼저 의심해야 한다.

9. 디버깅은 control plane, node, guest를 나눠서 봐야 한다

KubeVirt 문제를 볼 때 가장 흔한 실수는 계층을 섞는 것이다. 다음처럼 쪼개서 보는 편이 좋다.

control plane에서 볼 것

VMI phase
conditions
migrationState
activePods
이벤트와 migration CR 상태

node에서 볼 것

virt-handler 로그
virt-launcher 로그
libvirt domain 상태
domain stats
Pod 네트워크와 TAP 상태

guest에서 볼 것

qemu guest agent 응답 여부
guest OS info
users
filesystems
실제 서비스 health

즉 KubeVirt 디버깅은 결국 "어느 계층의 truth를 보고 있는가"를 구분하는 작업이다.

10. status가 항상 진실을 즉시 반영하지는 않는다

API 타입 주석에 아예 적혀 있듯, VirtualMachineInstanceStatus는 실제 시스템 상태를 뒤따를 수 있다. 이건 매우 중요한 운영 포인트다.

왜냐하면 status는 informer, controller, launcher, libvirt, guest agent를 거쳐 갱신되기 때문이다. 따라서 아주 짧은 순간에는:

Pod는 이미 바뀌었는데 status가 늦을 수 있고
migration target은 떴는데 phase가 아직 예전 값일 수 있고
guest agent는 죽었는데 domain은 Running일 수 있다

즉 KubeVirt는 강한 일관성 대신, 여러 관측면을 조합해 판단해야 하는 시스템이다.

운영자가 기억해야 할 핵심

phase만 보면 부족하다. conditions, reason, migrationState를 함께 봐야 한다.
activePods는 migration 중 source와 target Pod를 읽는 데 중요하다.
네트워크 상태는 Multus, domain, guest-agent 정보가 합쳐진 결과다.
guest agent와 domain stats는 서로 대체재가 아니라 보완재다.

마무리

KubeVirt의 관측성은 단일 상태 값이 아니라 여러 계층의 정보를 합쳐 만든다. VMI status는 Kubernetes 리소스 관점의 현재 상태를 보여 주고, guest agent는 guest 내부 의미를 드러내며, domain stats와 Prometheus metrics는 실제 실행 데이터 plane을 관찰하게 해 준다. 그래서 KubeVirt 운영은 "VM이 떴는가"를 묻는 일보다, "어느 계층에서 어떤 신호가 깨졌는가"를 구분하는 일에 더 가깝다.

다음 글에서는 이 관측 모델을 바탕으로, drain, eviction, migration failure, non-migratable condition 같은 실제 실패 모드를 정리해 보겠다.

VMI Status, Metrics, Guest Agent, Debugging: How KubeVirt Exposes Internal State

Introduction
1. VMI Status Is the Most Important Operational Surface
2. Phase Alone Is Insufficient -- Conditions Must Be Checked Together
3. activePods Is Especially Important During Migration
4. Network Status Combines Pod Annotation and Guest Information
5. Guest Agent Is the Window into Guest Internal Information
6. Domain Stats Is the Middle Layer Between Host and Guest Observability
7. Prometheus Metrics Largely Come from virt-handler
8. What Is Reduced When Guest Agent Is Absent
9. Debugging Must Separate Control Plane, Node, and Guest
10. Status Does Not Always Immediately Reflect Reality
Key Points for Operators
Conclusion

Introduction

When operating KubeVirt, the hardest question is this: "How far is this VM really alive right now?" The Pod may be Running but the guest may have stopped. The guest may be alive but migration may be on the verge of failure. That is why KubeVirt collects state from multiple layers, not just one.

Kubernetes object state
libvirt domain state
Guest internal information reported by the guest agent
Network status and migration status
Prometheus metrics

This post examines how these observation layers connect.

1. VMI Status Is the Most Important Operational Surface

Looking at VirtualMachineInstanceStatus in staging/src/kubevirt.io/api/core/v1/types.go, quite a lot of information operators want to see is included.

phase
conditions
interfaces
guestOSInfo
migrationState
qosClass
activePods
selinuxContext
memory
currentCPUTopology

Reading just this type reveals that KubeVirt does not view state as simply "on or off." VM state is the combined result of Kubernetes phase, guest internal information, migration progress, and network interface status.

2. Phase Alone Is Insufficient -- Conditions Must Be Checked Together

phase summarizes the high-level flow. Values like Pending, Scheduling, Scheduled, Running, Succeeded, Failed, and Unknown show the general direction.

But actual operational decisions come from conditions, reason, and message. The API types predefine conditions and reasons such as:

LiveMigratable
StorageLiveMigratable
MigrationRequired
EvictionRequested
DataVolumesReady
DisksNotLiveMigratable
InterfaceNotLiveMigratable
HostDeviceNotLiveMigratable
SEVNotLiveMigratable
SecureExecutionNotLiveMigratable

KubeVirt does not just say "it cannot be done" -- it has standardized in the type system why live migration is not possible.

3. activePods Is Especially Important During Migration

VirtualMachineInstanceStatus.ActivePods is a mapping of pod UIDs to node names. As noted in the comments, during migration, multiple Pods can be associated with a single VMI simultaneously.

This field is important for reading "which virt-launcher Pod is currently the source and which is the target." In practice, migration timing confusion almost always starts here. What you thought was a single VM has a brief window where both source and target launchers exist simultaneously from the control plane's perspective.

In other words, activePods is a hidden key field in migration debugging.

4. Network Status Combines Pod Annotation and Guest Information

Looking at pkg/network/controllers/vmi.go, VMI status interfaces do not come from just one source.

Pod Multus network status is read for pod interface names
Primary and secondary interfaces are calculated
Existing status entries not in the spec are also preserved

The API type VirtualMachineInstanceNetworkInterface contains:

Guest IP
MAC
Network name
Pod interface name
VM internal interface name
Info source

In particular, infoSource distinguishes whether information came from the guest-agent, domain, or multus-status. Thanks to this design, operators can determine "whether this IP is a value reported from inside the guest or a value reported by CNI."

5. Guest Agent Is the Window into Guest Internal Information

Looking at the DomainManager interface in pkg/virt-launcher/virtwrap/manager.go, there are quite a few guest-related methods.

GetGuestInfo
GetUsers
GetFilesystems
GetGuestOSInfo
GuestPing

This is an important signal. KubeVirt considers libvirt and QEMU level state alone insufficient, and separately collects information from inside the OS via the guest agent.

pkg/virt-handler/rest/lifecycle.go receives this data through the launcher client and exposes it as API responses. In other words, the guest information operators see ultimately passes through:

virt-handler REST endpoint
Launcher client RPC
virt-launcher internal domain manager
QEMU guest agent

6. Domain Stats Is the Middle Layer Between Host and Guest Observability

The same DomainManager interface also has GetDomainStats and GetDomainDirtyRateStats. This means it pulls domain-level statistics reported by libvirt separately from the guest agent.

This layer provides a lot of information visible even when the guest agent inside the guest does not respond.

CPU usage
Memory state
Block I/O
Network traffic
Dirty page rate

In other words, the guest agent tells you the meaning inside the guest, while domain stats tells you the execution facts observed by the hypervisor. They are not competitors but complementary.

7. Prometheus Metrics Largely Come from virt-handler

Looking at pkg/monitoring/metrics/virt-handler/domainstats, there are collectors that convert domain statistics like CPU, memory, block, and vcpu into Prometheus metrics.

This structure is quite practical.

The closest point to the actual VM process is the node.
Collecting domain stats is easiest from the node.
So metrics export is also attached close to virt-handler.

In other words, KubeVirt observability is closer to a structure where the node-local agent collects more execution facts than the central controller.

8. What Is Reduced When Guest Agent Is Absent

The VM does not fail to start without a guest agent. But the meaningful information available to operators is significantly reduced.

Guest internal user list
Filesystem list
OS pretty name
Interface names and some guest IP information

In other words, the guest agent is not a required boot dependency but an extension layer that enriches operational visibility and automation.

Therefore, in situations where "Pod is normal but VM internals are not visible," guest agent installation and connection status should be suspected first.

9. Debugging Must Separate Control Plane, Node, and Guest

The most common mistake when looking at KubeVirt problems is mixing layers. It is better to split them as follows.

What to look at in the control plane

VMI phase
conditions
migrationState
activePods
Events and migration CR status

What to look at on the node

virt-handler logs
virt-launcher logs
libvirt domain state
Domain stats
Pod network and TAP state

What to look at in the guest

QEMU guest agent response status
Guest OS info
Users
Filesystems
Actual service health

In other words, KubeVirt debugging is ultimately the work of distinguishing "which layer's truth am I looking at?"

10. Status Does Not Always Immediately Reflect Reality

As explicitly noted in the API type comments, VirtualMachineInstanceStatus can lag behind the actual system state. This is a very important operational point.

Because status is updated through informers, controllers, launcher, libvirt, and guest agent, in very brief moments:

The Pod may have already changed but status is delayed
The migration target is up but phase still has the old value
The guest agent is dead but the domain shows Running

In other words, KubeVirt is a system that requires combining multiple observation surfaces for judgment rather than strong consistency.

Key Points for Operators

phase alone is insufficient. conditions, reason, and migrationState must be checked together.
activePods is important for reading source and target Pods during migration.
Network status is the combined result of Multus, domain, and guest-agent information.
Guest agent and domain stats are not substitutes but complements.

Conclusion

KubeVirt's observability is built by combining information from multiple layers, not a single state value. VMI status shows the current state from a Kubernetes resource perspective, the guest agent reveals meaning inside the guest, and domain stats with Prometheus metrics allow observing the actual execution data plane. Therefore, operating KubeVirt is less about asking "is the VM up" and more about distinguishing "which signal broke at which layer."

In the next post, we will use this observation model to organize actual failure modes such as drain, eviction, migration failure, and non-migratable conditions.