Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

An Operator is not a component you deploy once and forget. Its CRD schema evolves, its controller logic changes constantly through bug fixes and new features, and the number of workloads it manages grows from hundreds to thousands. The catch is that all of this change happens "on top of production that is already running." A misconfigured upgrade of a database Operator can shake hundreds of DB instances at once, and a botched CRD storage version change can leave you unable to read existing CRs at all.

This article assumes the 2026 Kubebuilder ecosystem (Kubernetes 1.36 / Go 1.26 support, controller-runtime v0.24.x, controller-tools v0.21.x) and covers practical patterns for upgrading and migrating an Operator with near-zero downtime. It goes beyond simply re-running `kubectl apply`: it weaves together rolling the controller itself, evolving the CRD schema, phasing the transition of managed workloads, preserving the ability to roll back, and responding when something goes wrong.

Here is the scope of this article up front.

1. Upgrading the Operator itself - Rolling the controller Deployment, leader election, graceful shutdown

2. CRD schema evolution - Multi-version CRDs, conversion webhook, storage version

3. Safe managed workload rollout - Phased/partitioned rollout, observedGeneration, status conditions

4. Rollback strategy - Downgrade pitfalls, irreversible conversion, field loss

5. Multi-version operation - Serving multiple API versions, deprecation policy

6. Canary Operator - Deploying a new version to a subset of namespaces

7. Version compatibility - K8s 1.36 / Go 1.26, controller-runtime v0.24.x, version skew

8. Large-scale CR migration - One-time batch job, storage version migrator, rate limit

9. Incident response - When something breaks mid-rollout

10. Checklist - Pre/post deployment items

Why Operator Upgrades Differ from Ordinary App Upgrades

For a typical stateless web application, a rolling update of the Deployment is all you need. Old Pods are replaced by new ones, traffic flows to the new version, and you are done. Operators are different.

First, an Operator **keeps its state externally.** The real state lives in the CRs (Custom Resources) stored in etcd and in the actual workloads they manage. The controller Pod itself is nearly stateless, but the data the controller interprets (the CRD schema) changes.

Second, an Operator **reconciles continuously.** When a new controller version comes up, it begins re-reconciling every existing CR. If the logic changed, every workload can be affected simultaneously.

Third, **a CRD is a cluster-wide resource.** Changing a CRD affects every namespace and every CR that uses it. It is not isolated per namespace.

The table below summarizes the difference.

| Aspect | Stateless app | Operator |

| --- | --- | --- |

| State location | External DB | CRs in etcd + managed workloads |

| Upgrade unit | Deployment roll | Controller + CRD + managed workloads |

| Schema-change blast radius | Confined to the app | Cluster-wide CRD |

| Concurrency | Many replicas in parallel | Only one leader reconciles |

| Rollback difficulty | Revert the image | Storage version, data loss to consider |

Because of these differences, an Operator upgrade must roll three axes — "controller," "schema," and "managed workloads" — separately yet consistently.

1. Rolling the Controller Itself Safely

Leader election: only one active reconciler

Even with multiple replicas, two controllers reconciling the same CR at once cause conflicts. controller-runtime provides leader election out of the box. Run two or three replicas for HA, but only one leader actually reconciles while the others stand by.

package main

"crypto/tls"

"flag"

"os"

ctrl "sigs.k8s.io/controller-runtime"

"sigs.k8s.io/controller-runtime/pkg/healthz"

"sigs.k8s.io/controller-runtime/pkg/log/zap"

metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server"

"sigs.k8s.io/controller-runtime/pkg/metrics/filters"

)

func main() {

var enableLeaderElection bool

var probeAddr string

flag.BoolVar(&enableLeaderElection, "leader-elect", true,

"Enable leader election for controller manager.")

flag.StringVar(&probeAddr, "health-probe-bind-address", ":8081",

"The address the probe endpoint binds to.")

flag.Parse()

ctrl.SetLogger(zap.New(zap.UseDevMode(false)))

mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{

Scheme: scheme,

Metrics: metricsserver.Options{

BindAddress: ":8443",

SecureServing: true,

// 2026: kube-rbac-proxy removed. AuthN/AuthZ built into the manager.

FilterProvider: filters.WithAuthenticationAndAuthorization,

TLSOpts: []func(*tls.Config){

func(c *tls.Config) { c.MinVersion = tls.VersionTLS13 },

HealthProbeBindAddress: probeAddr,

LeaderElection: enableLeaderElection,

LeaderElectionID: "db-operator.example.com",

// Tune the lease so a standby takes over quickly if the leader dies.

LeaseDuration: durationPtr(15 * time.Second),

RenewDeadline: durationPtr(10 * time.Second),

RetryPeriod: durationPtr(2 * time.Second),

// Release leadership immediately on shutdown to reduce downtime.

LeaderElectionReleaseOnCancel: true,

})

if err != nil {

os.Exit(1)

}

_ = mgr.AddHealthzCheck("healthz", healthz.Ping)

_ = mgr.AddReadyzCheck("readyz", healthz.Ping)

if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil {

os.Exit(1)

}

The key here is `LeaderElectionReleaseOnCancel: true`. With this option on, the controller releases the leader lease immediately when it receives SIGTERM and shuts down. The standby replica then becomes leader without waiting for `LeaseDuration` and resumes reconciliation. This single line shrinks the reconcile gap during a rolling update from tens of seconds to a few.

Deployment rolling strategy

Because only one active leader works, piling on replicas does not increase throughput. Configure them for availability instead.

apiVersion: apps/v1

kind: Deployment

metadata:

namespace: db-operator-system

spec:

replicas: 2

strategy:

type: RollingUpdate

rollingUpdate:

maxSurge: 1

maxUnavailable: 0

template:

spec:

terminationGracePeriodSeconds: 30

containers:

- name: manager

image: registry.example.com/db-operator:v1.4.0

args:

- --leader-elect=true

- --health-probe-bind-address=:8081

livenessProbe:

httpGet:

path: /healthz

port: 8081

initialDelaySeconds: 15

periodSeconds: 20

readinessProbe:

httpGet:

path: /readyz

port: 8081

initialDelaySeconds: 5

periodSeconds: 10

resources:

requests:

cpu: 100m

memory: 128Mi

limits:

memory: 512Mi

lifecycle:

preStop:

exec:

command: ["/bin/sh", "-c", "sleep 5"]

Using `maxUnavailable: 0` together with `maxSurge: 1` ensures the old Pod terminates only after the new Pod becomes Ready. `terminationGracePeriodSeconds: 30` gives in-flight reconciliations time to wind down after SIGTERM. The short `sleep` in `preStop` mitigates race conditions caused by endpoint-propagation delays.

Graceful shutdown

controller-runtime's `ctrl.SetupSignalHandler()` cancels the manager context when it receives SIGTERM/SIGINT. At that point, any in-flight reconciliation should detect the cancellation and exit cleanly. Whenever you make an external call inside a reconcile function (e.g., a DB API or a cloud SDK), always pass `ctx` so the shutdown signal propagates.

func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {

log := ctrl.LoggerFrom(ctx)

var db examplev1.Database

if err := r.Get(ctx, req.NamespacedName, &db); err != nil {

return ctrl.Result{}, client.IgnoreNotFound(err)

}

// Passing ctx to external calls cancels them at shutdown, guaranteeing graceful exit.

if err := r.provisioner.EnsureReady(ctx, &db); err != nil {

if ctx.Err() != nil {

// Shutting down: return quietly so the next leader reprocesses.

return ctrl.Result{}, nil

}

log.Error(err, "failed to ensure database ready")

return ctrl.Result{RequeueAfter: 30 * time.Second}, nil

}

return ctrl.Result{}, nil

}

2. CRD Schema Evolution — Multi-Version and Conversion Webhooks

The trickiest part is changing the CRD schema. Adding a field is compatible, but renaming a field or restructuring it breaks compatibility with existing CRs. That is when you reach for multi-version CRDs and conversion webhooks.

served version and storage version

Each version of a CRD has two properties.

- **served**: whether the API accepts requests in this version (usable by kubectl and clients)

- **storage**: the version actually stored in etcd (exactly one is true)

The core rule is this. etcd always stores data in a single storage version, and requests arriving in another version pass through the conversion webhook to be converted into the storage version before being stored. On read, the conversion runs in the opposite direction.

+----------------------------------------------+

| API Server |

v1alpha1 -->| served:true -+ |

v1beta1 -->| served:true +-> conversion webhook -> etcd| (storage: v1)

v1 -->| served:true, | |

| storage:true --+ |

+----------------------------------------------+

The hub-and-spoke conversion model

If every version converts to and from every other version, conversion functions explode to N x N. Kubebuilder recommends the hub-and-spoke model. Pick one version as the "hub," and every other version (the spokes) implements conversion only against the hub. Converting from v1alpha1 to v1beta1 then follows the path v1alpha1 -> hub -> v1beta1.

v1alpha1 (spoke) v1 (hub, storage) v1beta1 (spoke)

| ^ | |

| ConvertTo(hub) -----------+ +---- ConvertFrom(hub) <-----+

+---------- ConvertFrom(hub) <--- ConvertTo(hub) -------------+

The hub version (v1 here) only needs a `Hub()` marker method.

package v1

// Hub marks this type as the conversion hub. No further implementation is needed.

func (*Database) Hub() {}

The spoke version (v1beta1) implements bidirectional conversion against the hub.

package v1beta1

"sigs.k8s.io/controller-runtime/pkg/conversion"

dbv1 "github.com/example/db-operator/api/v1"

)

// ConvertTo converts this spoke (v1beta1) to the hub (v1).

func (src *Database) ConvertTo(dstRaw conversion.Hub) error {

dst := dstRaw.(*dbv1.Database)

dst.ObjectMeta = src.ObjectMeta

// Copy simple fields straight across.

dst.Spec.Engine = src.Spec.Engine

dst.Spec.Replicas = src.Spec.Replicas

// Restructure v1beta1 StorageGB(int) into v1 Storage(struct).

dst.Spec.Storage = dbv1.StorageSpec{

SizeGB: src.Spec.StorageGB,

ClassName: "standard", // default for the new field

}

// Convert status as well.

dst.Status.Phase = src.Status.Phase

dst.Status.ObservedGeneration = src.Status.ObservedGeneration

return nil

}

// ConvertFrom converts the hub (v1) into this spoke (v1beta1).

func (dst *Database) ConvertFrom(srcRaw conversion.Hub) error {

src := srcRaw.(*dbv1.Database)

dst.ObjectMeta = src.ObjectMeta

dst.Spec.Engine = src.Spec.Engine

dst.Spec.Replicas = src.Spec.Replicas

// Collapse v1 Storage(struct) back to v1beta1 StorageGB(int).

// Note: ClassName cannot be represented in v1beta1, so it is lost.

dst.Spec.StorageGB = src.Spec.Storage.SizeGB

dst.Status.Phase = src.Status.Phase

dst.Status.ObservedGeneration = src.Status.ObservedGeneration

return nil

}

Notice the comment in `ConvertFrom`. v1's `ClassName` has no corresponding field in v1beta1, so it is lost during conversion. This irreversibility becomes a decisive pitfall later in the rollback strategy, so record it explicitly when you write the conversion code.

Registering the conversion webhook and the CRD manifest

To enable the webhook, run a webhook server in the manager and register the type.

func (r *Database) SetupWebhookWithManager(mgr ctrl.Manager) error {

return ctrl.NewWebhookManagedBy(mgr).

For(r).

Complete()

}

In the CRD manifest, set the `conversion` strategy to Webhook.

apiVersion: apiextensions.k8s.io/v1

kind: CustomResourceDefinition

metadata:

spec:

group: example.com

names:

kind: Database

plural: databases

scope: Namespaced

conversion:

strategy: Webhook

webhook:

conversionReviewVersions: ["v1"]

clientConfig:

service:

namespace: db-operator-system

path: /convert

versions:

- name: v1alpha1

served: true

storage: false

deprecated: true

deprecationWarning: "example.com/v1alpha1 Database is deprecated; use v1"

schema:

openAPIV3Schema:

type: object

properties:

spec:

type: object

- name: v1beta1

served: true

storage: false

schema:

openAPIV3Schema:

type: object

properties:

spec:

type: object

- name: v1

served: true

storage: true

schema:

openAPIV3Schema:

type: object

properties:

spec:

type: object

The safe order of CRD evolution

The order matters when evolving the schema. The flow for adding a new version is as follows.

Step 1: Add v1 with served:true, storage:false (still storing v1beta1)

Deploy conversion webhook + verify conversion functions

Step 2: Deploy the new controller version (recognizes v1)

Step 3: Switch the storage version to v1 (v1beta1 storage:false, v1 storage:true)

Step 4: Re-store existing CRs as v1 with the storage version migrator

Step 5: Mark v1alpha1/v1beta1 deprecated, then served:false after a grace period

Step 6: Remove the old versions once nothing references them

Before switching the storage version, you must confirm the conversion webhook works reliably. If the webhook fails, all reads and writes for that CRD are blocked, which becomes a direct outage for cluster operations.

3. Safe Phased Rollout of Managed Workloads

When the controller switches to new logic, there is a temptation to change every managed workload at once. But if you do, a bug in the new logic takes everything down simultaneously. It is safer to make the Operator drive a phased (partitioned) rollout itself.

Tracking progress with observedGeneration

`metadata.generation` increments every time the spec changes. The controller records the generation it finished processing in `status.observedGeneration`. Comparing the two tells you whether you have processed the spec you observed.

func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {

var db examplev1.Database

if err := r.Get(ctx, req.NamespacedName, &db); err != nil {

return ctrl.Result{}, client.IgnoreNotFound(err)

}

// Nothing to do if we already processed the latest generation.

if db.Status.ObservedGeneration == db.Generation &&

meta.IsStatusConditionTrue(db.Status.Conditions, "Ready") {

return ctrl.Result{}, nil

}

// ... actual reconcile logic ...

db.Status.ObservedGeneration = db.Generation

meta.SetStatusCondition(&db.Status.Conditions, metav1.Condition{

Type: "Ready",

Status: metav1.ConditionTrue,

ObservedGeneration: db.Generation,

Reason: "Reconciled",

Message: "database is ready",

})

if err := r.Status().Update(ctx, &db); err != nil {

return ctrl.Result{}, err

}

return ctrl.Result{}, nil

}

Partition-based progressive rollout

When the Operator applies a new workload version, it changes only N% at a time, verifies the result, then proceeds to the next step. Put the rollout policy in the CR spec and have the controller interpret it.

// RolloutPolicy is defined in the CR spec and controls the phased transition.

type RolloutPolicy struct {

// MaxUnavailable: percentage of instances that may be replaced at once.

MaxUnavailable int `json:"maxUnavailable"`

// Partition: instances below this index are not switched to the new version (gradual expansion).

Partition int `json:"partition"`

// Paused: if true, stop the rollout and hold the current state.

Paused bool `json:"paused"`

}

func (r *DatabaseReconciler) rolloutManagedPods(

ctx context.Context, db *examplev1.Database, desiredImage string,

) (bool, error) {

pods, err := r.listManagedPods(ctx, db)

if err != nil {

return false, err

}

// Change nothing while paused.

if db.Spec.Rollout.Paused {

return false, nil

}

// Walk indices in descending order, targeting only those at or above the partition.

updating := 0

maxConcurrent := percentToCount(db.Spec.Rollout.MaxUnavailable, len(pods))

for i := len(pods) - 1; i >= db.Spec.Rollout.Partition; i-- {

p := pods[i]

if podImage(p) == desiredImage {

continue // already on the new version

}

if !isHealthy(p) {

// If the previously replaced Pod is not yet healthy, do not proceed.

return true, nil

}

if updating >= maxConcurrent {

return true, nil // concurrency limit reached; continue next reconcile

}

if err := r.recreatePodWithImage(ctx, p, desiredImage); err != nil {

return false, err

}

updating++

}

// Done if every target is on the new version.

return updating > 0, nil

}

There are two keys to this pattern. First, **if any Pod is unhealthy, stop progressing.** A problem in the new version halts at the first Pod, so the damage is localized. Second, **the operator adjusts `Partition`** to expand gradually. Start with a high partition so only a few change, then lower it to 0 once stability is confirmed.

Pausing reconciliation

Sometimes you need to stop reconciliation immediately when a problem is detected. A pause switch via annotation is a common pattern.

func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {

var db examplev1.Database

if err := r.Get(ctx, req.NamespacedName, &db); err != nil {

return ctrl.Result{}, client.IgnoreNotFound(err)

}

// If the operator hits the emergency stop, skip reconciliation.

if db.Annotations["example.com/reconcile-paused"] == "true" {

ctrl.LoggerFrom(ctx).Info("reconcile paused by annotation")

return ctrl.Result{}, nil

}

// ...

return ctrl.Result{}, nil

}

One `kubectl annotate database mydb example.com/reconcile-paused=true` stops automatic reconciliation for that CR. It is a safety valve that prevents the controller from continuing to push a bad state during an incident.

4. Rollback Strategy — Recognizing What Cannot Be Undone

When an upgrade goes wrong, you have to roll back. But rolling back an Operator does not end with reverting the controller image to a previous tag. Things that cannot be undone hide in many places.

Pitfall 1: downgrading after raising the storage version

Suppose you raise the CRD storage version from v1beta1 to v1, and some CRs are already stored as v1. If you now revert the controller to an old version that does not know v1, the old controller may be unable to read objects stored as v1 in etcd. If the conversion webhook is still alive to convert v1 to v1beta1, reads still work, but if you rolled back the webhook too, you lose access to the data.

[Correct rollback order when the storage version was raised]

1. Keep the conversion webhook in place (never take it down first)

2. Roll back only the controller to the previous version

3. Carefully decide whether to revert the storage version to one the old controller knows

4. If CRs are already stored as v1, re-store them back to the old version with the migrator

Pitfall 2: data loss from removed fields

If the new schema removed a field, that field disappears forever during conversion. Recall how `ClassName` was lost in the earlier `ConvertFrom`. Once an object is converted from v1 to v1beta1 and stored, raising it back to v1 will not restore `ClassName` (it is merely filled with a default).

Pitfall 3: conversion is irreversible

Just because conversion is bidirectional does not mean a round trip is always lossless. When you write conversion functions, you must include round-trip tests that verify losslessness.

func TestConversionRoundTrip(t *testing.T) {

original := &v1beta1.Database{

Spec: v1beta1.DatabaseSpec{

Engine: "postgres",

Replicas: 3,

StorageGB: 100,

}

// v1beta1 -> v1 -> v1beta1

hub := &v1.Database{}

if err := original.ConvertTo(hub); err != nil {

t.Fatal(err)

}

roundTripped := &v1beta1.Database{}

if err := roundTripped.ConvertFrom(hub); err != nil {

t.Fatal(err)

}

// Fields v1beta1 can represent must be lossless.

if roundTripped.Spec.StorageGB != original.Spec.StorageGB {

t.Errorf("StorageGB lost: got %d want %d",

roundTripped.Spec.StorageGB, original.Spec.StorageGB)

}

The most realistic way to guarantee rollback capability is to **separate the storage version switch from the controller upgrade.** Run the controller long enough to confirm stability, then raise the storage version much later. That way a controller rollback is not entangled with the storage version.

5. Running Multiple Versions Concurrently and Deprecation

In reality, different teams create CRs in different API versions. Some CRs were created in v1alpha1, others in v1beta1. Keeping several served versions lets you support them all at once.

Deprecation should never be abrupt. Stage it after the model of Kubernetes' API deprecation policy.

| Stage | State | Behavior |

| --- | --- | --- |

| Normal | served, unmarked | Use freely |

| Discouraged | served + deprecated marker | Warning message on use |

| Stopped serving | served:false | API requests rejected, etcd data retained |

| Removed | deleted from versions | No longer exists |

Setting `deprecated: true` and `deprecationWarning` on the CRD prints a warning for every kubectl command that uses that version. It is the gentlest way to encourage users to migrate. Before flipping to served:false, always confirm that no CR still references that version (or that the storage version migrator has re-stored them all as the latest version).

6. Canary Operator — Applying a New Version to a Subset

Because a CRD is cluster-wide, canarying is hard, but **the controller's scope of processing can be narrowed to a namespace.** Run the new controller version so it watches only a specific namespace, and let the existing controller handle the rest.

Scope the controller-runtime manager cache to a namespace.

mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{

Scheme: scheme,

Cache: cache.Options{

// This controller watches only the canary namespace.

DefaultNamespaces: map[string]cache.Config{

"team-canary": {},

// The canary uses a separate leader-election ID so it never conflicts with the existing controller.

LeaderElection: true,

LeaderElectionID: "db-operator-canary.example.com",

})

There is a caveat, though. **Both controller versions share the same CRD,** so if the new version comes with a CRD schema change, the old controller must also tolerate that schema. Therefore canary is especially well suited to upgrades where "the schema stays the same and only the logic changes." If the schema also changes, it is safer to limit yourself to cases where the new schema is compatible with the old version (such as adding fields).

7. Version Compatibility — K8s, Go, controller-runtime

Here is the version matrix as of 2026.

| Component | Recommended version | Notes |

| --- | --- | --- |

| Kubernetes | 1.36 | Latest supported target |

| Go | 1.26 | Kubebuilder scaffolding baseline |

| controller-runtime | v0.24.x | Manager/client API |

| controller-tools | v0.21.x | CRD/RBAC marker generation |

The most common mistake during an upgrade is bumping the controller-runtime version without aligning the client-go/apimachinery versions. Each minor version of controller-runtime assumes a specific client-go version, so you must align them together in `go.mod`. Also, under Kubernetes' **version skew policy**, the gap between the control plane and the client library is bounded. Attaching an Operator built with a too-old client-go to a recent API server may leave some APIs non-functional.

[Compatibility check order]

1. Align controller-runtime, client-go, apimachinery versions in go.mod

2. Confirm the target cluster's K8s version is within the supported range

3. Check whether you use APIs scheduled for removal (e.g., an old admissionregistration version)

4. Bump controller-tools, regenerate CRDs, and review the schema diff

5. Run integration tests with envtest to catch regressions

After an upgrade, always regenerate the CRD manifests and diff them against the existing ones. When the controller-tools version changes, the generated OpenAPI schema can shift subtly, which can introduce unintended validation changes.

8. Large-Scale CR Migration — Safely in Batches

After raising the storage version, you must re-store existing CRs as the new storage version. If you leave them, etcd still holds objects in the old version, which blocks you later when removing the old version. Reading an object once and writing it back (a no-op update) makes the API server re-store it in the storage version.

Touching thousands of CRs at once piles load on the API server and the conversion webhook. Handle it with a one-time migration job that has a rate limit.

package main

"context"

"time"

"golang.org/x/time/rate"

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

"sigs.k8s.io/controller-runtime/pkg/client"

)

// migrateStorageVersion does a single no-op update on every Database CR

// to re-store it in the current storage version.

func migrateStorageVersion(ctx context.Context, c client.Client) error {

// Limit to 10 per second (protect the API server/webhook).

limiter := rate.NewLimiter(rate.Limit(10), 1)

var continueToken string

for {

var list examplev1.DatabaseList

opts := []client.ListOption{client.Limit(100)}

if continueToken != "" {

opts = append(opts, client.Continue(continueToken))

}

if err := c.List(ctx, &list, opts...); err != nil {

return err

}

for i := range list.Items {

if err := limiter.Wait(ctx); err != nil {

return err

}

db := &list.Items[i]

// Toggle a no-op annotation to trigger an update and force a re-store.

if db.Annotations == nil {

db.Annotations = map[string]string{}

}

db.Annotations["example.com/storage-migrated-at"] =

time.Now().UTC().Format(time.RFC3339)

if err := c.Update(ctx, db); err != nil {

// Log conflicts only so they retry on the next round.

continue

}

if list.Continue == "" {

break

}

continueToken = list.Continue

}

return nil

}

var _ = metav1.ObjectMeta{} // mark the import as used

Kubernetes also provides a **storage version migrator** that automates this. Creating a `StorageVersionMigration` resource makes the control plane re-store every object of that resource in the storage version. During migration, however, the conversion webhook must stay reliably up, and on large clusters you should proceed while monitoring the load.

A migration job follows these principles.

1. Rate limit is mandatory - Prevent API server/webhook overload (N per second)

2. Pagination - Use Limit + Continue on List to avoid memory blowup

3. Idempotency - Design so a rerun after a crash is safe

4. Tolerate conflicts - Concurrent-update conflicts retry on the next round

5. Progress visibility - Expose processed/failed counts as metrics

6. Webhook health check - Abort immediately if conversion fails

9. Incident Response — When Something Breaks Mid-Rollout

Here is the standard response procedure when a problem is detected during an upgrade. The core is "stop further spread, and undo the reversible things first."

[Incident response playbook]

Symptom detected

1) Stop the rollout immediately

- reconcile-paused=true annotation on the CR (block spread)

- Or Rollout.Paused=true to halt the phased transition

2) Assess the blast radius

- Use status conditions / observedGeneration to see how far it progressed

- Count workloads switched to the new version

3) Judge reversibility

- Storage version not yet raised? -> Rolling back the controller image is enough

- Already raised? -> Roll back carefully while keeping the conversion webhook

4) Roll back the controller

- Set the Deployment image to the last known-good tag

- But do not casually take down the CRD/webhook with it

5) Verify data consistency

- Check whether loss-prone fields (ClassName, etc.) were affected

- Restore from backup if needed

6) Postmortem

- Record root causes: missing round-trip test, unchecked webhook health, etc.

The most important principle is **do not blindly roll back the CRD and conversion webhook together with the controller.** The controller is nearly stateless and safe to roll back, but the CRD/webhook directly govern access to etcd data, so taking them down incorrectly blocks all CR reads.

10. Pre/Post Deployment Checklist

[Before the upgrade]

[] Confirm controller-runtime/client-go/apimachinery versions align in go.mod

[] Confirm the target K8s version is within the supported range (1.36)

[] Regenerate CRDs and review the schema diff against the existing ones

[] Pass round-trip tests for conversion functions

[] Pass envtest integration tests

[] Confirm leader election + LeaderElectionReleaseOnCancel are set

[] Confirm Deployment maxUnavailable=0 / maxSurge=1

[] Document the rollback procedure + confirm backups

[] Judge whether the new schema is compatible with the old controller (canary feasibility)

[During the upgrade]

[] Add the new version first as served:true, storage:false

[] Confirm conversion webhook health

[] Confirm leader handover works after the controller roll

[] Expand managed workloads gradually via partition

[] Track progress with status conditions / observedGeneration

[After the upgrade]

[] Switch the storage version separately, only after the controller stabilizes

[] Re-store existing CRs with the storage version migrator (rate limited)

[] Mark old versions deprecated and grant a grace period

[] Confirm metrics (reconcile error rate, webhook latency) are within normal range

[] Move unreferenced old versions to served:false, then remove them

Closing

The essence of an Operator upgrade is "separating three timelines." The controller logic, the CRD schema, and the managed workloads change at different speeds with different degrees of reversibility. Resist the temptation to change them all at once: roll the reversible thing (the controller) first to confirm stability, then apply the irreversible things (storage version, field removal) last and carefully. That is the heart of evolving without downtime.

Three things to remember in particular. First, minimize the reconcile gap during a controller roll with leader election and `LeaderElectionReleaseOnCancel`. Second, the conversion webhook is the lifeline for etcd data access, so never take it down carelessly around a storage version switch. Third, add round-trip tests to every conversion function to catch irreversible loss at build time. Keeping just these three prevents most production upgrade incidents.

References

- [Kubebuilder Book](https://book.kubebuilder.io/)

- [Kubebuilder Multi-Version Tutorial](https://book.kubebuilder.io/multiversion-tutorial/tutorial.html)

- [Operator SDK](https://sdk.operatorframework.io/)

- [controller-runtime (pkg.go.dev)](https://pkg.go.dev/sigs.k8s.io/controller-runtime)

- [Kubernetes: Operator pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/)

- [CRD versioning & conversion](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/)

- [Kubernetes version skew policy](https://kubernetes.io/releases/version-skew-policy/)

- [kubernetes-sigs/kubebuilder (GitHub)](https://github.com/kubernetes-sigs/kubebuilder)

- [kubernetes-sigs/controller-runtime (GitHub)](https://github.com/kubernetes-sigs/controller-runtime)

- [Operator Lifecycle Manager (OLM)](https://olm.operatorframework.io/)