Eviction, Drain, Migration Failure Modes: How KubeVirt Handles Failures

Introduction
1. KubeVirt Does Not Immediately Kill on Drain
2. Operators Must Check EvictionRequested and EvacuationNodeName
3. Non-Migratable Reasons Are Recorded in Advance
4. MigrationState Is the Central Axis for Failure Analysis
5. Pre-copy Failures Have Recovery Potential, but Post-copy Is More Dangerous
6. Abort Is Also a State Machine
7. Migration Failures Occur in Both the Control Plane and Data Plane
- Control plane failures
- Data plane failures
8. Secure Features Often Conflict with Migration Flexibility
9. Multiple Pod Existence During Migration Makes Failure Analysis Harder
10. Drain Strategy Should Vary Based on Workload Characteristics
Key Points for Operators
Conclusion

Introduction

From an operational perspective, the real difficulty of KubeVirt reveals itself not when launching a VM, but when handling failures. Pods can often be resolved by simply rescheduling, but VMs require considering memory state, disks, network sessions, and guest execution context together. That is why drain, eviction, and migration failures are the scenes that best expose KubeVirt's internal design.

In this post, we examine "what is considered a failure" and "how KubeVirt standardizes and exposes failures."

1. KubeVirt Does Not Immediately Kill on Drain

Looking at the EvictionStrategy annotation in the API types, it defines what strategy to use during node drain situations. KubeVirt does not treat drain as a simple Kubernetes eviction event but as an event requiring VM-specific policy decisions.

The reason is clear.

If migratable, it is better to move the VM first.
If non-migratable, suspension or deferral may be needed.
If the VM is owned by an external controller, a different handling model may be required.

In other words, drain in KubeVirt is not "emptying one Pod" but the question "how should this VM be evacuated?"

2. Operators Must Check EvictionRequested and EvacuationNodeName

VirtualMachineInstanceStatus has EvacuationNodeName, and EvictionRequested is defined among condition types. This means eviction is not just left in event logs but remains as a structured signal in the VMI status.

Operators need to check these values for the following reasons:

Whether drain has started
Which node is being evacuated from
Whether a migration should follow
Whether an external controller needs to take follow-up action

In other words, KubeVirt exposes drain in a status-first manner.

3. Non-Migratable Reasons Are Recorded in Advance

The KubeVirt API types define a large number of non-migratable reasons.

DisksNotLiveMigratable
InterfaceNotLiveMigratable
HotplugNotLiveMigratable
VirtIOFSNotLiveMigratable
HostDeviceNotLiveMigratable
SEVNotLiveMigratable
SecureExecutionNotLiveMigratable
TDXNotLiveMigratable
HypervPassthroughNotLiveMigratable
PersistentReservationNotLiveMigratable

This design is good because it does not just surface failures as runtime errors after the fact. KubeVirt tells you in advance via condition reasons "why this VM cannot be migrated."

Operators can read the approximate limitations from the API status alone, before triggering a migration and waiting for failure logs.

4. MigrationState Is the Central Axis for Failure Analysis

VirtualMachineInstanceMigrationState contains a wealth of information needed for failure analysis.

Source node and source pod
Target node and target pod
Sync address
Direct migration ports
Completed status
Failed status
Abort requested status
Abort status
Failure reason
Current migration mode

This structure reveals that KubeVirt does not view migration as a simple boolean state. Migration is a distributed protocol where source and target change over time, and failures can occur at multiple stages.

5. Pre-copy Failures Have Recovery Potential, but Post-copy Is More Dangerous

As seen in the previous post, pre-copy progressively transfers data while the source retains the original memory. In contrast, post-copy starts execution on the target first and retrieves needed pages from the source later.

Therefore, post-copy failures are much more dangerous. In pkg/virt-handler/vm.go, there is formatIrrecoverableErrorMessage, and when a post-copy failure causes the domain to enter a paused state, the message "VMI is irrecoverable due to failed post-copy migration" is generated.

This is a very strong signal. It does not just mean the migration job failed -- it means the running VM state itself may have collapsed into an unrecoverable state.

In other words, post-copy is a powerful tool for moving busy workloads, but the cost of failure is also greater.

6. Abort Is Also a State Machine

VirtualMachineInstanceMigrationState has AbortRequested and AbortStatus, and the abort state is separately modeled as Succeeded, Failed, and Aborting.

This design is realistic. Migration abort does not end instantly just by pressing a button.

Whether the abort is still possible at this stage
Whether the target has already received the handoff
How far storage or network side effects have progressed

These factors affect the outcome. KubeVirt treats abort not as a simple API cancellation but as a separate state machine.

7. Migration Failures Occur in Both the Control Plane and Data Plane

Failure causes can be broadly divided into two categories.

Control plane failures

Target Pod fails to schedule
Cannot satisfy the appropriate node selector
Blocked by quota or policy
Blocked due to utility volumes or backups

Data plane failures

Dirty page rate is too high for pre-copy to converge
Source-target synchronization breaks after post-copy transition
Migration socket or proxy path issues
Domain preparation on the target is delayed

In other words, a single "migration failed" message is insufficient -- you need to distinguish which plane the failure occurred in.

8. Secure Features Often Conflict with Migration Flexibility

As already apparent at the API level, features like SEV, Secure Execution, TDX, and host device passthrough frequently conflict with migration constraints.

This is not coincidental. These features generally require one of the following:

Strong coupling with specific host hardware
Special protection of guest memory
Use of device state that is difficult to reproduce externally

In other words, the more you strengthen security or maximize hardware performance, the easier it becomes to conflict with the characteristic of "zero-downtime movement to any node."

9. Multiple Pod Existence During Migration Makes Failure Analysis Harder

Why ActivePods is important becomes clearer in failure modes. During migration, the source and target launcher Pods briefly coexist, making it easy to confuse which Pod is the actual current source when examining logs and status.

When analyzing failures, you should look at least the following together:

VMI's activePods
Migration CR phase
Target pod name
Source pod name
VMI migrationState

Without cross-referencing this information, it is easy to mistake logs from an already cleaned up Pod as the current problem.

10. Drain Strategy Should Vary Based on Workload Characteristics

The same eviction strategy should not be applied to all VMs. For example:

Nearly stateless test VMs
Performance-sensitive VMs using SR-IOV and host devices
General workload VMs with RWX volumes and live migration capability
Memory write-intensive VMs where post-copy allowance is important

These have entirely different failure costs and acceptable responses.

In other words, drain strategy should not be an infrastructure default but an operational policy tailored to VM characteristics.

Key Points for Operators

Drain is a VM evacuation strategy issue, not Pod removal.
You must look at EvictionRequested, EvacuationNodeName, and MigrationState together.
Non-migratable reasons are pre-judgment signals, not post-failure logs.
Post-copy failures can lead to irrecoverable states and require much greater caution.

Conclusion

KubeVirt does not hide failures but structures them as state machines. Drain is connected to eviction strategies, live migration feasibility is revealed in advance via condition reasons, and actual migration progress and failure reasons accumulate in MigrationState. The fact that post-copy failures are separated as irrecoverable demonstrates that KubeVirt views VM failures differently from simple Pod restart issues.

In the next post, we will wrap up the series by organizing a source code reading map that shows the order in which to read the KubeVirt source code to understand the entire structure most quickly.