Introduction — Why Databases Are the Hardest
Kubernetes handles stateless applications well. Set replicas on a Deployment and Pods come up, die, come back, and it makes no difference which one died. Databases are different.
Database Pods are not equal. One is the primary that accepts writes; the rest are read-only replicas. Delete a Pod and its data may vanish with it. When the primary dies, someone must be promoted — and a wrong promotion splits the data (split-brain). Scaling down is not just reducing Pods; it entails data redistribution. Operating without backups is unthinkable.
All of these procedures are exactly the operational knowledge a human DBA carried in their head. In this article we encode that knowledge into an Operator, step by step, using Kubebuilder and controller-runtime. We will not implement a full production database engine, but the goal is to internalize **the core patterns of operating a stateful system with an Operator**.
For reference, as of 2026 Kubebuilder supports Kubernetes 1.36 / Go 1.26, controller-runtime is on the v0.24 line, and controller-tools on the v0.21 line. The kube-rbac-proxy sidecar of the past has been removed; instead the controller-runtime metrics server's WithAuthenticationAndAuthorization is used.
Design — Define the CR Spec First
Operator development starts not with code but with API design. You must first decide what the user gets to declare. Our Database CR looks like this.
apiVersion: db.example.com/v1alpha1
kind: Database
metadata:
name: orders-db
spec:
replicas: 3
version: "16.3"
storage:
size: 50Gi
storageClassName: fast-ssd
backup:
enabled: true
schedule: "0 3 * * *"
retention: 14
destination: s3://my-backups/orders-db
status:
phase: Running
primary: orders-db-0
readyReplicas: 3
lastBackupTime: "2026-06-15T03:00:12Z"
The design principle is a **small API**. Expose only the minimal knobs the user needs to know: replicas, version, storage, backup. Never expose internal implementation details (the StatefulSet name, the replication protocol) in the spec. Those are the Operator's responsibility, not the user's concern.
In Go we define the types as follows. Kubebuilder markers generate the CRD's OpenAPI schema and validation.
// DatabaseSpec defines the desired state of Database.
type DatabaseSpec struct {
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=9
// +kubebuilder:default=1
Replicas int32 `json:"replicas"`
// +kubebuilder:validation:Required
Version string `json:"version"`
Storage StorageSpec `json:"storage"`
// +optional
Backup *BackupSpec `json:"backup,omitempty"`
}
type StorageSpec struct {
Size string `json:"size"`
StorageClassName *string `json:"storageClassName,omitempty"`
}
type BackupSpec struct {
Enabled bool `json:"enabled"`
Schedule string `json:"schedule"`
Retention int32 `json:"retention"`
Destination string `json:"destination"`
}
// DatabaseStatus defines the observed state of Database.
type DatabaseStatus struct {
Phase string `json:"phase,omitempty"`
Primary string `json:"primary,omitempty"`
ReadyReplicas int32 `json:"readyReplicas,omitempty"`
LastBackupTime *metav1.Time `json:"lastBackupTime,omitempty"`
Conditions []metav1.Condition `json:"conditions,omitempty"`
}
Limiting replicas to 1-9 is deliberate — validation to encourage odd counts (quorum) and prevent reckless scaling. Mistakes you can block at the API level are best blocked at the API.
reconcile — The Core Loop
The heart of an Operator is the Reconcile function. It takes the declaration "I want the Database CR to be in this state" and makes reality match it. The most important principle is **idempotency**: calling it any number of times with the same input must yield the same result.
func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
// 1. Fetch the target CR
var db dbv1alpha1.Database
if err := r.Get(ctx, req.NamespacedName, &db); err != nil {
// Already deleted: ignore
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// 2. Handle deletion (finalizer)
if !db.DeletionTimestamp.IsZero() {
return r.reconcileDelete(ctx, &db)
}
if !controllerutil.ContainsFinalizer(&db, dbFinalizer) {
controllerutil.AddFinalizer(&db, dbFinalizer)
if err := r.Update(ctx, &db); err != nil {
return ctrl.Result{}, err
}
}
// 3. Ensure desired resources in order
if err := r.reconcileHeadlessService(ctx, &db); err != nil {
return ctrl.Result{}, err
}
if err := r.reconcileStatefulSet(ctx, &db); err != nil {
return ctrl.Result{}, err
}
if err := r.reconcileBackupCronJob(ctx, &db); err != nil {
return ctrl.Result{}, err
}
// 4. Update status
if err := r.updateStatus(ctx, &db); err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{}, nil
}
Each reconcileXxx function follows the same pattern: "describe the desired object, create it if missing, reconcile it if it differs." controller-runtime's CreateOrUpdate helper expresses this cleanly.
func (r *DatabaseReconciler) reconcileStatefulSet(ctx context.Context, db *dbv1alpha1.Database) error {
sts := &appsv1.StatefulSet{
ObjectMeta: metav1.ObjectMeta{
Name: db.Name,
Namespace: db.Namespace,
},
}
_, err := controllerutil.CreateOrUpdate(ctx, r.Client, sts, func() error {
// Fill in the desired spec (idempotent)
sts.Spec.Replicas = &db.Spec.Replicas
sts.Spec.ServiceName = db.Name + "-headless"
sts.Spec.Selector = &metav1.LabelSelector{
MatchLabels: labelsFor(db),
}
sts.Spec.Template = podTemplateFor(db)
sts.Spec.VolumeClaimTemplates = pvcTemplatesFor(db)
// owner reference: GC the StatefulSet when the CR is deleted
return controllerutil.SetControllerReference(db, sts, r.Scheme)
})
return err
}
The **owner reference** is key here. Setting the StatefulSet's owner to the Database CR means Kubernetes's garbage collector cleans up the StatefulSet automatically when the CR is deleted. We also set up a watch so that StatefulSet change events re-trigger reconcile.
StatefulSet, Service, PVC — Why These Three
When operating a stateful database on Kubernetes, these three resources form the skeleton.
+------------------------------------------+
| Database CR |
+-------------------+----------------------+
| reconcile
+-----------+-----------+-----------+
v v v
StatefulSet Headless CronJob
(N Pods) Service (backup)
| |
v v
stable IDs DNS records
db-0, db-1 db-0.db-headless
db-2 ... (fixed per-Pod addr)
|
v
VolumeClaimTemplate
-> dedicated PVC per Pod
-> data survives Pod restarts
There are three reasons to use a StatefulSet. First, **stable network identity**. Pod names are fixed in order (db-0, db-1), and combined with the headless Service they get fixed DNS like db-0.db-headless. A replica needs this stable address to find the primary.
Second, **ordered creation and deletion**. db-1 comes up only after db-0 is Ready. In database cluster bootstrap it is natural to initialize db-0 as primary and have the rest attach.
Third, **stable storage**. VolumeClaimTemplate attaches a dedicated PVC to each Pod, and the same PVC follows even after rescheduling. Data is preserved.
A headless Service is one with clusterIP None, providing per-Pod DNS records instead of load balancing.
apiVersion: v1
kind: Service
metadata:
name: orders-db-headless
spec:
clusterIP: None
selector:
app: orders-db
ports:
- port: 5432
name: postgres
Keep a separate write Service that targets only the primary by adding a role label to its selector. On failover the Operator moves this label to route traffic to the new primary.
Periodic Backups — Creating a CronJob
Backups can be delegated to a CronJob the Operator creates. If spec.backup is enabled, reconcile a backup CronJob; if disabled, delete it.
func (r *DatabaseReconciler) reconcileBackupCronJob(ctx context.Context, db *dbv1alpha1.Database) error {
name := db.Name + "-backup"
cj := &batchv1.CronJob{
ObjectMeta: metav1.ObjectMeta{Name: name, Namespace: db.Namespace},
}
// Clean up the CronJob if backups are off
if db.Spec.Backup == nil || !db.Spec.Backup.Enabled {
err := r.Delete(ctx, cj)
return client.IgnoreNotFound(err)
}
_, err := controllerutil.CreateOrUpdate(ctx, r.Client, cj, func() error {
cj.Spec.Schedule = db.Spec.Backup.Schedule
cj.Spec.JobTemplate.Spec.Template.Spec = backupPodSpec(db)
return controllerutil.SetControllerReference(db, cj, r.Scheme)
})
return err
}
The backup Pod runs pg_dump or a physical backup tool and uploads the result to spec.backup.destination (e.g., S3). The retention policy is implemented by the backup script pruning old backups.
An alternative is an in-Operator scheduler instead of relying on CronJob: compute the requeue timing and re-invoke reconcile at the next backup time. But using CronJob is simpler and lets Kubernetes handle scheduling and retries for you, so it is recommended.
Failover and Leader Election — Concepts
Handling a dead primary is the hardest part of a database Operator. The core risk is **split-brain**: if the old primary "appears dead" due to a transient network partition and then recovers, two primaries accept writes simultaneously and the data diverges.
The principles of safe failover are as follows.
Failover procedure (with safeguards)
1. Monitor primary health (e.g., liveness probe every 5s)
2. N consecutive failures -> mark "suspect" (no immediate promotion)
3. Fencing: block writes from the old primary
- remove old primary from the write Service selector
- or isolate at the network/storage level
4. Pick the most up-to-date replica (least replication lag)
5. Promote that replica to primary
6. Move the write Service label to the new primary
7. Reconnect remaining replicas to the new primary
8. Update status.primary, record an event
Leader election itself is safer to delegate to a proven consensus mechanism than to implement yourself. Real production Operators use one of:
- **Built-in Raft/consensus library**: systems like etcd or Consul with consensus baked in.
- **A dedicated HA agent like Patroni**: this is how the Zalando Postgres Operator works. Patroni uses etcd/Kubernetes as a distributed lock store to elect the leader.
- **Kubernetes Lease objects**: leveraging Kubernetes's own leader-election mechanism (coordination.k8s.io/Lease).
You must admit that implementing split-brain prevention yourself is PhD-thesis-grade difficulty in distributed systems. Hence the field's consensus: "do not reinvent database failover." Our Operator too should leave consensus to Patroni or the engine's built-in mechanism, and focus the Operator on observing that state and adjusting Kubernetes resources (Service labels, etc.).
Version Upgrades — Sequential Rolling
You must not swap all Pods at once during an upgrade. Replace one at a time while preserving availability. The StatefulSet RollingUpdate strategy provides the base framework, but databases add constraints.
Upgrade order (Postgres-style)
1. Verify all replicas are healthy (abort if not)
2. Force one backup (for rollback safety)
3. Upgrade replicas first (db-2 -> db-1 ...)
- upgrade one -> verify Ready -> verify caught up -> next
4. Finally the primary:
- promote a replica to the new version (planned failover)
- restart the old primary on the new version, rejoin as replica
5. Update status.version
The key is a **health gate between each step**. After bringing up one Pod, do not blindly move on — verify replication has caught up before proceeding. The Operator's reconcile records this progress in status and advances each step via requeue.
// Pseudo-code that advances the upgrade step by step
func (r *DatabaseReconciler) reconcileUpgrade(ctx context.Context, db *dbv1alpha1.Database) (ctrl.Result, error) {
target := db.Spec.Version
// Find the highest-index out-of-date replica
pod, found := r.findOutdatedReplica(ctx, db, target)
if !found {
return ctrl.Result{}, nil // all up to date
}
if !r.isHealthyAndCaughtUp(ctx, db) {
// still catching up -> recheck shortly
return ctrl.Result{RequeueAfter: 15 * time.Second}, nil
}
if err := r.upgradePod(ctx, pod, target); err != nil {
return ctrl.Result{}, err
}
// handled one -> reconcile again
return ctrl.Result{RequeueAfter: 5 * time.Second}, nil
}
It is wise to add validation that forbids skipping minor versions. For instance, a Postgres major-version upgrade requires a separate procedure like pg_upgrade, so the Operator should reject a "direct jump from 16 to 18" and force going through 17 or require an explicit procedure.
status — Exposing Health Outward
At the end of each reconcile, update status so cluster users and other tools can know the database's state. The standard pattern is an array of metav1.Condition.
func (r *DatabaseReconciler) updateStatus(ctx context.Context, db *dbv1alpha1.Database) error {
var sts appsv1.StatefulSet
if err := r.Get(ctx, types.NamespacedName{Name: db.Name, Namespace: db.Namespace}, &sts); err != nil {
return client.IgnoreNotFound(err)
}
db.Status.ReadyReplicas = sts.Status.ReadyReplicas
if sts.Status.ReadyReplicas == db.Spec.Replicas {
db.Status.Phase = "Running"
meta.SetStatusCondition(&db.Status.Conditions, metav1.Condition{
Type: "Ready",
Status: metav1.ConditionTrue,
Reason: "AllReplicasReady",
Message: "all database replicas are ready",
})
} else {
db.Status.Phase = "Progressing"
meta.SetStatusCondition(&db.Status.Conditions, metav1.Condition{
Type: "Ready",
Status: metav1.ConditionFalse,
Reason: "ReplicasNotReady",
Message: "waiting for replicas to become ready",
})
}
return r.Status().Update(ctx, db)
}
A crucial principle: **only the Operator writes status.** Users write spec and read status. Using status as an input channel like spec is a common anti-pattern. Also enable the status subresource (via a Kubebuilder marker) so spec and status updates do not collide.
You can define print columns via markers for nicer kubectl output.
// +kubebuilder:printcolumn:name="Phase",type=string,JSONPath=`.status.phase`
// +kubebuilder:printcolumn:name="Replicas",type=integer,JSONPath=`.status.readyReplicas`
// +kubebuilder:printcolumn:name="Primary",type=string,JSONPath=`.status.primary`
// +kubebuilder:subresource:status
Observability — Metrics and Events
An operable Operator makes its own state observable. controller-runtime provides metrics like reconcile count, queue depth, and processing time out of the box. We add domain metrics (e.g., failover count, backup success/failure).
var (
failoverTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "database_failover_total",
Help: "Number of failovers performed per database",
},
[]string{"database", "namespace"},
)
)
func init() {
metrics.Registry.MustRegister(failoverTotal)
}
Also record significant events as Kubernetes events so users can trace what happened via kubectl describe.
r.Recorder.Event(db, corev1.EventTypeNormal, "FailoverCompleted",
fmt.Sprintf("promoted %s to primary", newPrimary))
Operational Scenarios — How It Actually Runs
Let me summarize, as scenarios, what the built Operator faces in the field.
Scenario 1. Primary Pod lost to node failure
-> scheduler recreates the Pod on another node, PVC follows
-> meanwhile the HA agent promotes a replica, Operator moves the write Service label
-> when the old Pod comes up on the new node, it rejoins as a replica
Scenario 2. Disk full
-> surface a warning condition in status, emit an event
-> user increases spec.storage.size -> Operator expands the PVC (if allowed)
Scenario 3. Load increase needs more read replicas
-> spec.replicas 3 -> 5 -> StatefulSet scales -> new replica syncs from a base backup
Scenario 4. Scheduled backup fails (expired S3 credentials)
-> backup Job fails, status.lastBackupTime not updated
-> an alert rule detects "no backup for X hours" and pages
The value of an Operator is that these scenarios run without a human's 3 a.m. response. But as in Scenario 4, a **path to call a human when automation fails** (alerting) must always accompany it. Silently failing automation is the most dangerous kind.
Off-the-Shelf vs. DIY — Trade-offs
By now one thing is clear: **building a proper database Operator is very hard.** So is it even worth building?
| Aspect | DIY | Off-the-shelf (CloudNativePG, etc.) |
| --- | --- | --- |
| Upfront cost | Very high | Low (helm install) |
| Split-brain safety | Must verify yourself (risky) | Years of production hardening |
| Backup/PITR | Build yourself | Built-in, proven |
| Customization | Unlimited | Within provided scope |
| Maintenance | Forever yours | Shared by the community |
| Learning value | Very high | Low |
The conclusion is clear. **For a production database, use an off-the-shelf Operator.** CloudNativePG, the Zalando Postgres Operator, and Strimzi have already refined the pitfalls of split-brain, backup integrity, and version upgrades over years. Reinventing them is almost always a loss.
DIY is worthwhile in two cases. First, **for learning** — nothing teaches the Operator pattern like a database. Second, **an org-specific stateful system with no off-the-shelf solution** — when moving a peculiar internal data store or legacy engine onto Kubernetes.
Pitfalls — What You Will Definitely Hit If You Build One
[ ] Carelessly deleting PVCs — tie PVCs to the owner reference and the data
evaporates when the CR is deleted. If you need data retention, exclude
PVCs from GC and handle them under a separate policy.
[ ] Blocking in reconcile — synchronously waiting on long work (backups) inside
reconcile stalls the worker queue. Delegate to a separate Job/CronJob.
[ ] Idempotency violations — creating a new backup Job on every reconcile call
multiplies infinitely. Always check "does it already exist" first.
[ ] Using status as input — relying on status alone to judge split-brain is risky.
[ ] Infinite requeue — retrying immediately on error causes a storm. Use backoff.
[ ] Excessive RBAC — a database Operator should not demand cluster-admin.
[ ] Ignoring version skips — failing to block major-version jumps corrupts data.
The first one — PVCs and data retention — is the most fatal. "I deleted the Operator and production data went with it" is a real incident. Always have a policy that decouples CR deletion from data deletion (e.g., deletionPolicy: Retain).
Conclusion
The journey of building a database Operator reveals both the true power and the limits of the reconcile loop. The pattern of reconciling StatefulSet/Service/PVC, delegating backups to a CronJob, and exposing health via status applies directly to any stateful Operator. That is the core muscle to take from this article.
But before the deep swamp of distributed systems — split-brain and data integrity — you must be humble. It is wise to leave production databases to proven off-the-shelf Operators and limit DIY to learning or genuinely alternative-less in-house systems. In the next article, including this kind of judgment, we will lay out the best practices and anti-patterns that separate good Operators from bad ones.
References
- Kubebuilder Book: https://book.kubebuilder.io/
- Operator SDK: https://sdk.operatorframework.io/
- controller-runtime (pkg.go.dev): https://pkg.go.dev/sigs.k8s.io/controller-runtime
- Kubernetes Operator pattern: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
- kubebuilder GitHub: https://github.com/kubernetes-sigs/kubebuilder
- controller-runtime GitHub: https://github.com/kubernetes-sigs/controller-runtime
- StatefulSet docs: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/
- CloudNativePG (reference implementation): https://cloudnative-pg.io/
- Zalando Postgres Operator (Patroni-based): https://github.com/zalando/postgres-operator
- Kubernetes leader election (Lease): https://kubernetes.io/docs/concepts/architecture/leases/
- Operator Capability Levels: https://sdk.operatorframework.io/docs/overview/operator-capabilities/
현재 단락 (1/317)
Kubernetes handles stateless applications well. Set replicas on a Deployment and Pods come up, die, ...