Skip to content
Published on

VMI Status, Metrics, Guest Agent, Debugging: How KubeVirt Exposes Internal State

Authors

Introduction

When operating KubeVirt, the hardest question is this: "How far is this VM really alive right now?" The Pod may be Running but the guest may have stopped. The guest may be alive but migration may be on the verge of failure. That is why KubeVirt collects state from multiple layers, not just one.

  • Kubernetes object state
  • libvirt domain state
  • Guest internal information reported by the guest agent
  • Network status and migration status
  • Prometheus metrics

This post examines how these observation layers connect.

1. VMI Status Is the Most Important Operational Surface

Looking at VirtualMachineInstanceStatus in staging/src/kubevirt.io/api/core/v1/types.go, quite a lot of information operators want to see is included.

  • phase
  • conditions
  • interfaces
  • guestOSInfo
  • migrationState
  • qosClass
  • activePods
  • selinuxContext
  • memory
  • currentCPUTopology

Reading just this type reveals that KubeVirt does not view state as simply "on or off." VM state is the combined result of Kubernetes phase, guest internal information, migration progress, and network interface status.

2. Phase Alone Is Insufficient -- Conditions Must Be Checked Together

phase summarizes the high-level flow. Values like Pending, Scheduling, Scheduled, Running, Succeeded, Failed, and Unknown show the general direction.

But actual operational decisions come from conditions, reason, and message. The API types predefine conditions and reasons such as:

  • LiveMigratable
  • StorageLiveMigratable
  • MigrationRequired
  • EvictionRequested
  • DataVolumesReady
  • DisksNotLiveMigratable
  • InterfaceNotLiveMigratable
  • HostDeviceNotLiveMigratable
  • SEVNotLiveMigratable
  • SecureExecutionNotLiveMigratable

KubeVirt does not just say "it cannot be done" -- it has standardized in the type system why live migration is not possible.

3. activePods Is Especially Important During Migration

VirtualMachineInstanceStatus.ActivePods is a mapping of pod UIDs to node names. As noted in the comments, during migration, multiple Pods can be associated with a single VMI simultaneously.

This field is important for reading "which virt-launcher Pod is currently the source and which is the target." In practice, migration timing confusion almost always starts here. What you thought was a single VM has a brief window where both source and target launchers exist simultaneously from the control plane's perspective.

In other words, activePods is a hidden key field in migration debugging.

4. Network Status Combines Pod Annotation and Guest Information

Looking at pkg/network/controllers/vmi.go, VMI status interfaces do not come from just one source.

  • Pod Multus network status is read for pod interface names
  • Primary and secondary interfaces are calculated
  • Existing status entries not in the spec are also preserved

The API type VirtualMachineInstanceNetworkInterface contains:

  • Guest IP
  • MAC
  • Network name
  • Pod interface name
  • VM internal interface name
  • Info source

In particular, infoSource distinguishes whether information came from the guest-agent, domain, or multus-status. Thanks to this design, operators can determine "whether this IP is a value reported from inside the guest or a value reported by CNI."

5. Guest Agent Is the Window into Guest Internal Information

Looking at the DomainManager interface in pkg/virt-launcher/virtwrap/manager.go, there are quite a few guest-related methods.

  • GetGuestInfo
  • GetUsers
  • GetFilesystems
  • GetGuestOSInfo
  • GuestPing

This is an important signal. KubeVirt considers libvirt and QEMU level state alone insufficient, and separately collects information from inside the OS via the guest agent.

pkg/virt-handler/rest/lifecycle.go receives this data through the launcher client and exposes it as API responses. In other words, the guest information operators see ultimately passes through:

  • virt-handler REST endpoint
  • Launcher client RPC
  • virt-launcher internal domain manager
  • QEMU guest agent

6. Domain Stats Is the Middle Layer Between Host and Guest Observability

The same DomainManager interface also has GetDomainStats and GetDomainDirtyRateStats. This means it pulls domain-level statistics reported by libvirt separately from the guest agent.

This layer provides a lot of information visible even when the guest agent inside the guest does not respond.

  • CPU usage
  • Memory state
  • Block I/O
  • Network traffic
  • Dirty page rate

In other words, the guest agent tells you the meaning inside the guest, while domain stats tells you the execution facts observed by the hypervisor. They are not competitors but complementary.

7. Prometheus Metrics Largely Come from virt-handler

Looking at pkg/monitoring/metrics/virt-handler/domainstats, there are collectors that convert domain statistics like CPU, memory, block, and vcpu into Prometheus metrics.

This structure is quite practical.

  • The closest point to the actual VM process is the node.
  • Collecting domain stats is easiest from the node.
  • So metrics export is also attached close to virt-handler.

In other words, KubeVirt observability is closer to a structure where the node-local agent collects more execution facts than the central controller.

8. What Is Reduced When Guest Agent Is Absent

The VM does not fail to start without a guest agent. But the meaningful information available to operators is significantly reduced.

  • Guest internal user list
  • Filesystem list
  • OS pretty name
  • Interface names and some guest IP information

In other words, the guest agent is not a required boot dependency but an extension layer that enriches operational visibility and automation.

Therefore, in situations where "Pod is normal but VM internals are not visible," guest agent installation and connection status should be suspected first.

9. Debugging Must Separate Control Plane, Node, and Guest

The most common mistake when looking at KubeVirt problems is mixing layers. It is better to split them as follows.

What to look at in the control plane

  • VMI phase
  • conditions
  • migrationState
  • activePods
  • Events and migration CR status

What to look at on the node

  • virt-handler logs
  • virt-launcher logs
  • libvirt domain state
  • Domain stats
  • Pod network and TAP state

What to look at in the guest

  • QEMU guest agent response status
  • Guest OS info
  • Users
  • Filesystems
  • Actual service health

In other words, KubeVirt debugging is ultimately the work of distinguishing "which layer's truth am I looking at?"

10. Status Does Not Always Immediately Reflect Reality

As explicitly noted in the API type comments, VirtualMachineInstanceStatus can lag behind the actual system state. This is a very important operational point.

Because status is updated through informers, controllers, launcher, libvirt, and guest agent, in very brief moments:

  • The Pod may have already changed but status is delayed
  • The migration target is up but phase still has the old value
  • The guest agent is dead but the domain shows Running

In other words, KubeVirt is a system that requires combining multiple observation surfaces for judgment rather than strong consistency.

Key Points for Operators

  • phase alone is insufficient. conditions, reason, and migrationState must be checked together.
  • activePods is important for reading source and target Pods during migration.
  • Network status is the combined result of Multus, domain, and guest-agent information.
  • Guest agent and domain stats are not substitutes but complements.

Conclusion

KubeVirt's observability is built by combining information from multiple layers, not a single state value. VMI status shows the current state from a Kubernetes resource perspective, the guest agent reveals meaning inside the guest, and domain stats with Prometheus metrics allow observing the actual execution data plane. Therefore, operating KubeVirt is less about asking "is the VM up" and more about distinguishing "which signal broke at which layer."

In the next post, we will use this observation model to organize actual failure modes such as drain, eviction, migration failure, and non-migratable conditions.