Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction — Why Databases Are the Hardest

Kubernetes handles stateless applications well. Set replicas on a Deployment and Pods come up, die, come back, and it makes no difference which one died. Databases are different.

Database Pods are not equal. One is the primary that accepts writes; the rest are read-only replicas. Delete a Pod and its data may vanish with it. When the primary dies, someone must be promoted — and a wrong promotion splits the data (split-brain). Scaling down is not just reducing Pods; it entails data redistribution. Operating without backups is unthinkable.

All of these procedures are exactly the operational knowledge a human DBA carried in their head. In this article we encode that knowledge into an Operator, step by step, using Kubebuilder and controller-runtime. We will not implement a full production database engine, but the goal is to internalize **the core patterns of operating a stateful system with an Operator**.

For reference, as of 2026 Kubebuilder supports Kubernetes 1.36 / Go 1.26, controller-runtime is on the v0.24 line, and controller-tools on the v0.21 line. The kube-rbac-proxy sidecar of the past has been removed; instead the controller-runtime metrics server's WithAuthenticationAndAuthorization is used.

Design — Define the CR Spec First

Operator development starts not with code but with API design. You must first decide what the user gets to declare. Our Database CR looks like this.

apiVersion: db.example.com/v1alpha1

kind: Database

metadata:

spec:

replicas: 3

version: "16.3"

storage:

size: 50Gi

storageClassName: fast-ssd

backup:

enabled: true

schedule: "0 3 * * *"

retention: 14

destination: s3://my-backups/orders-db

status:

phase: Running

primary: orders-db-0

readyReplicas: 3

lastBackupTime: "2026-06-15T03:00:12Z"

The design principle is a **small API**. Expose only the minimal knobs the user needs to know: replicas, version, storage, backup. Never expose internal implementation details (the StatefulSet name, the replication protocol) in the spec. Those are the Operator's responsibility, not the user's concern.

In Go we define the types as follows. Kubebuilder markers generate the CRD's OpenAPI schema and validation.

// DatabaseSpec defines the desired state of Database.

type DatabaseSpec struct {

// +kubebuilder:validation:Minimum=1

// +kubebuilder:validation:Maximum=9

// +kubebuilder:default=1

Replicas int32 `json:"replicas"`

// +kubebuilder:validation:Required

Version string `json:"version"`

Storage StorageSpec `json:"storage"`

// +optional

Backup *BackupSpec `json:"backup,omitempty"`

}

type StorageSpec struct {

Size string `json:"size"`

StorageClassName *string `json:"storageClassName,omitempty"`

}

type BackupSpec struct {

Enabled bool `json:"enabled"`

Schedule string `json:"schedule"`

Retention int32 `json:"retention"`

Destination string `json:"destination"`

}

// DatabaseStatus defines the observed state of Database.

type DatabaseStatus struct {

Phase string `json:"phase,omitempty"`

Primary string `json:"primary,omitempty"`

ReadyReplicas int32 `json:"readyReplicas,omitempty"`

LastBackupTime *metav1.Time `json:"lastBackupTime,omitempty"`

Conditions []metav1.Condition `json:"conditions,omitempty"`

}

Limiting replicas to 1-9 is deliberate — validation to encourage odd counts (quorum) and prevent reckless scaling. Mistakes you can block at the API level are best blocked at the API.

reconcile — The Core Loop

The heart of an Operator is the Reconcile function. It takes the declaration "I want the Database CR to be in this state" and makes reality match it. The most important principle is **idempotency**: calling it any number of times with the same input must yield the same result.

func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {

log := log.FromContext(ctx)

// 1. Fetch the target CR

var db dbv1alpha1.Database

if err := r.Get(ctx, req.NamespacedName, &db); err != nil {

// Already deleted: ignore

return ctrl.Result{}, client.IgnoreNotFound(err)

}

// 2. Handle deletion (finalizer)

if !db.DeletionTimestamp.IsZero() {

return r.reconcileDelete(ctx, &db)

}

if !controllerutil.ContainsFinalizer(&db, dbFinalizer) {

controllerutil.AddFinalizer(&db, dbFinalizer)

if err := r.Update(ctx, &db); err != nil {

return ctrl.Result{}, err

}

// 3. Ensure desired resources in order

if err := r.reconcileHeadlessService(ctx, &db); err != nil {

return ctrl.Result{}, err

}

if err := r.reconcileStatefulSet(ctx, &db); err != nil {

return ctrl.Result{}, err

}

if err := r.reconcileBackupCronJob(ctx, &db); err != nil {

return ctrl.Result{}, err

}

// 4. Update status

if err := r.updateStatus(ctx, &db); err != nil {

return ctrl.Result{}, err

}

return ctrl.Result{}, nil

}

Each reconcileXxx function follows the same pattern: "describe the desired object, create it if missing, reconcile it if it differs." controller-runtime's CreateOrUpdate helper expresses this cleanly.

func (r *DatabaseReconciler) reconcileStatefulSet(ctx context.Context, db *dbv1alpha1.Database) error {

sts := &appsv1.StatefulSet{

ObjectMeta: metav1.ObjectMeta{

Name: db.Name,

Namespace: db.Namespace,

}

_, err := controllerutil.CreateOrUpdate(ctx, r.Client, sts, func() error {

// Fill in the desired spec (idempotent)

sts.Spec.Replicas = &db.Spec.Replicas

sts.Spec.ServiceName = db.Name + "-headless"

sts.Spec.Selector = &metav1.LabelSelector{

MatchLabels: labelsFor(db),

}

sts.Spec.Template = podTemplateFor(db)

sts.Spec.VolumeClaimTemplates = pvcTemplatesFor(db)

// owner reference: GC the StatefulSet when the CR is deleted

return controllerutil.SetControllerReference(db, sts, r.Scheme)

})

return err

}

The **owner reference** is key here. Setting the StatefulSet's owner to the Database CR means Kubernetes's garbage collector cleans up the StatefulSet automatically when the CR is deleted. We also set up a watch so that StatefulSet change events re-trigger reconcile.

StatefulSet, Service, PVC — Why These Three

When operating a stateful database on Kubernetes, these three resources form the skeleton.

+------------------------------------------+

| Database CR |

+-------------------+----------------------+

| reconcile

+-----------+-----------+-----------+

v v v

StatefulSet Headless CronJob

(N Pods) Service (backup)

| |

v v

stable IDs DNS records

db-0, db-1 db-0.db-headless

db-2 ... (fixed per-Pod addr)

VolumeClaimTemplate

-> dedicated PVC per Pod

-> data survives Pod restarts

There are three reasons to use a StatefulSet. First, **stable network identity**. Pod names are fixed in order (db-0, db-1), and combined with the headless Service they get fixed DNS like db-0.db-headless. A replica needs this stable address to find the primary.

Second, **ordered creation and deletion**. db-1 comes up only after db-0 is Ready. In database cluster bootstrap it is natural to initialize db-0 as primary and have the rest attach.

Third, **stable storage**. VolumeClaimTemplate attaches a dedicated PVC to each Pod, and the same PVC follows even after rescheduling. Data is preserved.

A headless Service is one with clusterIP None, providing per-Pod DNS records instead of load balancing.

apiVersion: v1

kind: Service

metadata:

spec:

clusterIP: None

selector:

app: orders-db

ports:

- port: 5432

Keep a separate write Service that targets only the primary by adding a role label to its selector. On failover the Operator moves this label to route traffic to the new primary.

Periodic Backups — Creating a CronJob

Backups can be delegated to a CronJob the Operator creates. If spec.backup is enabled, reconcile a backup CronJob; if disabled, delete it.

func (r *DatabaseReconciler) reconcileBackupCronJob(ctx context.Context, db *dbv1alpha1.Database) error {

name := db.Name + "-backup"

cj := &batchv1.CronJob{

ObjectMeta: metav1.ObjectMeta{Name: name, Namespace: db.Namespace},

}

// Clean up the CronJob if backups are off

if db.Spec.Backup == nil || !db.Spec.Backup.Enabled {

err := r.Delete(ctx, cj)

return client.IgnoreNotFound(err)

}

_, err := controllerutil.CreateOrUpdate(ctx, r.Client, cj, func() error {

cj.Spec.Schedule = db.Spec.Backup.Schedule

cj.Spec.JobTemplate.Spec.Template.Spec = backupPodSpec(db)

return controllerutil.SetControllerReference(db, cj, r.Scheme)

})

return err

}

The backup Pod runs pg_dump or a physical backup tool and uploads the result to spec.backup.destination (e.g., S3). The retention policy is implemented by the backup script pruning old backups.

An alternative is an in-Operator scheduler instead of relying on CronJob: compute the requeue timing and re-invoke reconcile at the next backup time. But using CronJob is simpler and lets Kubernetes handle scheduling and retries for you, so it is recommended.

Failover and Leader Election — Concepts

Handling a dead primary is the hardest part of a database Operator. The core risk is **split-brain**: if the old primary "appears dead" due to a transient network partition and then recovers, two primaries accept writes simultaneously and the data diverges.

The principles of safe failover are as follows.

Failover procedure (with safeguards)

1. Monitor primary health (e.g., liveness probe every 5s)

2. N consecutive failures -> mark "suspect" (no immediate promotion)

3. Fencing: block writes from the old primary

- remove old primary from the write Service selector

- or isolate at the network/storage level

4. Pick the most up-to-date replica (least replication lag)

5. Promote that replica to primary

6. Move the write Service label to the new primary

7. Reconnect remaining replicas to the new primary

8. Update status.primary, record an event

Leader election itself is safer to delegate to a proven consensus mechanism than to implement yourself. Real production Operators use one of:

- **Built-in Raft/consensus library**: systems like etcd or Consul with consensus baked in.

- **A dedicated HA agent like Patroni**: this is how the Zalando Postgres Operator works. Patroni uses etcd/Kubernetes as a distributed lock store to elect the leader.

- **Kubernetes Lease objects**: leveraging Kubernetes's own leader-election mechanism (coordination.k8s.io/Lease).

You must admit that implementing split-brain prevention yourself is PhD-thesis-grade difficulty in distributed systems. Hence the field's consensus: "do not reinvent database failover." Our Operator too should leave consensus to Patroni or the engine's built-in mechanism, and focus the Operator on observing that state and adjusting Kubernetes resources (Service labels, etc.).

Version Upgrades — Sequential Rolling

You must not swap all Pods at once during an upgrade. Replace one at a time while preserving availability. The StatefulSet RollingUpdate strategy provides the base framework, but databases add constraints.

Upgrade order (Postgres-style)

1. Verify all replicas are healthy (abort if not)

2. Force one backup (for rollback safety)

3. Upgrade replicas first (db-2 -> db-1 ...)

- upgrade one -> verify Ready -> verify caught up -> next

4. Finally the primary:

- promote a replica to the new version (planned failover)

- restart the old primary on the new version, rejoin as replica

5. Update status.version

The key is a **health gate between each step**. After bringing up one Pod, do not blindly move on — verify replication has caught up before proceeding. The Operator's reconcile records this progress in status and advances each step via requeue.

// Pseudo-code that advances the upgrade step by step

func (r *DatabaseReconciler) reconcileUpgrade(ctx context.Context, db *dbv1alpha1.Database) (ctrl.Result, error) {

target := db.Spec.Version

// Find the highest-index out-of-date replica

pod, found := r.findOutdatedReplica(ctx, db, target)

if !found {

return ctrl.Result{}, nil // all up to date

}

if !r.isHealthyAndCaughtUp(ctx, db) {

// still catching up -> recheck shortly

return ctrl.Result{RequeueAfter: 15 * time.Second}, nil

}

if err := r.upgradePod(ctx, pod, target); err != nil {

return ctrl.Result{}, err

}

// handled one -> reconcile again

return ctrl.Result{RequeueAfter: 5 * time.Second}, nil

}

It is wise to add validation that forbids skipping minor versions. For instance, a Postgres major-version upgrade requires a separate procedure like pg_upgrade, so the Operator should reject a "direct jump from 16 to 18" and force going through 17 or require an explicit procedure.

status — Exposing Health Outward

At the end of each reconcile, update status so cluster users and other tools can know the database's state. The standard pattern is an array of metav1.Condition.

func (r *DatabaseReconciler) updateStatus(ctx context.Context, db *dbv1alpha1.Database) error {

var sts appsv1.StatefulSet

if err := r.Get(ctx, types.NamespacedName{Name: db.Name, Namespace: db.Namespace}, &sts); err != nil {

return client.IgnoreNotFound(err)

}

db.Status.ReadyReplicas = sts.Status.ReadyReplicas

if sts.Status.ReadyReplicas == db.Spec.Replicas {

db.Status.Phase = "Running"

meta.SetStatusCondition(&db.Status.Conditions, metav1.Condition{

Type: "Ready",

Status: metav1.ConditionTrue,

Reason: "AllReplicasReady",

Message: "all database replicas are ready",

})

} else {

db.Status.Phase = "Progressing"

meta.SetStatusCondition(&db.Status.Conditions, metav1.Condition{

Type: "Ready",

Status: metav1.ConditionFalse,

Reason: "ReplicasNotReady",

Message: "waiting for replicas to become ready",

})

}

return r.Status().Update(ctx, db)

}

A crucial principle: **only the Operator writes status.** Users write spec and read status. Using status as an input channel like spec is a common anti-pattern. Also enable the status subresource (via a Kubebuilder marker) so spec and status updates do not collide.

You can define print columns via markers for nicer kubectl output.

// +kubebuilder:printcolumn:name="Phase",type=string,JSONPath=`.status.phase`

// +kubebuilder:printcolumn:name="Replicas",type=integer,JSONPath=`.status.readyReplicas`

// +kubebuilder:printcolumn:name="Primary",type=string,JSONPath=`.status.primary`

// +kubebuilder:subresource:status

Observability — Metrics and Events

An operable Operator makes its own state observable. controller-runtime provides metrics like reconcile count, queue depth, and processing time out of the box. We add domain metrics (e.g., failover count, backup success/failure).

var (

failoverTotal = prometheus.NewCounterVec(

prometheus.CounterOpts{

Name: "database_failover_total",

Help: "Number of failovers performed per database",

[]string{"database", "namespace"},

)

func init() {

metrics.Registry.MustRegister(failoverTotal)

}

Also record significant events as Kubernetes events so users can trace what happened via kubectl describe.

r.Recorder.Event(db, corev1.EventTypeNormal, "FailoverCompleted",

fmt.Sprintf("promoted %s to primary", newPrimary))

Operational Scenarios — How It Actually Runs

Let me summarize, as scenarios, what the built Operator faces in the field.

Scenario 1. Primary Pod lost to node failure

-> scheduler recreates the Pod on another node, PVC follows

-> meanwhile the HA agent promotes a replica, Operator moves the write Service label

-> when the old Pod comes up on the new node, it rejoins as a replica

Scenario 2. Disk full

-> surface a warning condition in status, emit an event

-> user increases spec.storage.size -> Operator expands the PVC (if allowed)

Scenario 3. Load increase needs more read replicas

-> spec.replicas 3 -> 5 -> StatefulSet scales -> new replica syncs from a base backup

Scenario 4. Scheduled backup fails (expired S3 credentials)

-> backup Job fails, status.lastBackupTime not updated

-> an alert rule detects "no backup for X hours" and pages

The value of an Operator is that these scenarios run without a human's 3 a.m. response. But as in Scenario 4, a **path to call a human when automation fails** (alerting) must always accompany it. Silently failing automation is the most dangerous kind.

Off-the-Shelf vs. DIY — Trade-offs

By now one thing is clear: **building a proper database Operator is very hard.** So is it even worth building?

| Aspect | DIY | Off-the-shelf (CloudNativePG, etc.) |

| --- | --- | --- |

| Upfront cost | Very high | Low (helm install) |

| Split-brain safety | Must verify yourself (risky) | Years of production hardening |

| Backup/PITR | Build yourself | Built-in, proven |

| Customization | Unlimited | Within provided scope |

| Maintenance | Forever yours | Shared by the community |

| Learning value | Very high | Low |

The conclusion is clear. **For a production database, use an off-the-shelf Operator.** CloudNativePG, the Zalando Postgres Operator, and Strimzi have already refined the pitfalls of split-brain, backup integrity, and version upgrades over years. Reinventing them is almost always a loss.

DIY is worthwhile in two cases. First, **for learning** — nothing teaches the Operator pattern like a database. Second, **an org-specific stateful system with no off-the-shelf solution** — when moving a peculiar internal data store or legacy engine onto Kubernetes.

Pitfalls — What You Will Definitely Hit If You Build One

[ ] Carelessly deleting PVCs — tie PVCs to the owner reference and the data

evaporates when the CR is deleted. If you need data retention, exclude

PVCs from GC and handle them under a separate policy.

[ ] Blocking in reconcile — synchronously waiting on long work (backups) inside

reconcile stalls the worker queue. Delegate to a separate Job/CronJob.

[ ] Idempotency violations — creating a new backup Job on every reconcile call

multiplies infinitely. Always check "does it already exist" first.

[ ] Using status as input — relying on status alone to judge split-brain is risky.

[ ] Infinite requeue — retrying immediately on error causes a storm. Use backoff.

[ ] Excessive RBAC — a database Operator should not demand cluster-admin.

[ ] Ignoring version skips — failing to block major-version jumps corrupts data.

The first one — PVCs and data retention — is the most fatal. "I deleted the Operator and production data went with it" is a real incident. Always have a policy that decouples CR deletion from data deletion (e.g., deletionPolicy: Retain).

Conclusion

The journey of building a database Operator reveals both the true power and the limits of the reconcile loop. The pattern of reconciling StatefulSet/Service/PVC, delegating backups to a CronJob, and exposing health via status applies directly to any stateful Operator. That is the core muscle to take from this article.

But before the deep swamp of distributed systems — split-brain and data integrity — you must be humble. It is wise to leave production databases to proven off-the-shelf Operators and limit DIY to learning or genuinely alternative-less in-house systems. In the next article, including this kind of judgment, we will lay out the best practices and anti-patterns that separate good Operators from bad ones.

References

- Kubebuilder Book: https://book.kubebuilder.io/

- Operator SDK: https://sdk.operatorframework.io/

- controller-runtime (pkg.go.dev): https://pkg.go.dev/sigs.k8s.io/controller-runtime

- Kubernetes Operator pattern: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/

- kubebuilder GitHub: https://github.com/kubernetes-sigs/kubebuilder

- controller-runtime GitHub: https://github.com/kubernetes-sigs/controller-runtime

- StatefulSet docs: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/

- CloudNativePG (reference implementation): https://cloudnative-pg.io/

- Zalando Postgres Operator (Patroni-based): https://github.com/zalando/postgres-operator

- Kubernetes leader election (Lease): https://kubernetes.io/docs/concepts/architecture/leases/

- Operator Capability Levels: https://sdk.operatorframework.io/docs/overview/operator-capabilities/