Skip to content
Published on

Operator Best Practices and Anti-Patterns

Authors

Introduction — An Operator Is Harder to Tame Than to Build

The first time you build an Operator, you are thrilled just to see the reconcile function work. But run it in production for about six months and you realize: building is the start; taming is the main event. After an infinite loop hammers the API server, RBAC so broad it gets flagged in a security audit, and a single upgrade shaking your entire workload, you finally grasp what "a well-built Operator" means.

This article lays out that bar. If the previous two articles covered what you can build (a catalog) and how to build it (a database Operator), this one is about how to build it well. We will go through the traits of a good Operator and the anti-patterns to avoid, with code, and make a full loop through security, testing, operations, and maturity.

Four Traits of a Good Operator

1. A Small API

A good Operator gives the user only the minimal knobs. When the spec bloats, users do not know what to fill in, the Operator must validate every combination, and evolving without breaking backward compatibility becomes hard.

# Bad — a bloated API leaking internal implementation
spec:
  statefulSetName: my-db-sts
  podAntiAffinityWeight: 100
  replicationProtocol: streaming
  walSegmentSize: 16MB
  checkpointTimeout: 300s
  # ... dozens of low-level options

# Good — declare intent only
spec:
  replicas: 3
  version: "16.3"
  storage: { size: 50Gi }
  highAvailability: true

The principle is "take only what the user wants; let the Operator decide how." If the user needs to know walSegmentSize, the abstraction has failed. If advanced users need an escape hatch, isolate it in a separate advanced field or annotation, but keep the default path simple.

2. Idempotent reconcile

reconcile must yield the same result no matter how many times it is called with the same input. This is the foundation of the controller pattern, because reconcile can be re-invoked at any time for any reason (restart, duplicate events, periodic resync).

// Bad — creates a new resource each call (non-idempotent)
func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	job := newBackupJob()             // new, uniquely named Job each time
	r.Create(ctx, job)                // 100 reconciles = 100 Jobs
	return ctrl.Result{}, nil
}

// Good — declare desired state, create only when missing
func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	desired := desiredBackupJob(req)
	_, err := controllerutil.CreateOrUpdate(ctx, r.Client, desired, mutateFn)
	return ctrl.Result{}, err
}

A simple idempotency check: "If you call reconcile twice in a row, does the second call change nothing?" If so, it is idempotent. If not, something is making a change every time — the seed of an infinite reconcile loop.

3. Observability

If operators cannot peer inside the Operator, debugging is impossible. A good Operator reveals itself through three channels.

The three channels of observability
 1. status.conditions  — current state and reason, declaratively
 2. Kubernetes events   — a timeline of incidents (kubectl describe)
 3. Prometheus metrics  — reconcile count/latency/errors, domain metrics

In particular, the Reason and Message of status.conditions must be human-readable. They must be specific, like "Ready=False, Reason=ImagePullBackOff, Message=cannot pull registry.example.com/db:16.3," so the operator can decide the next action.

4. Safe Upgrades

Both the Operator's own upgrade and the upgrade of the resources it manages must be safe. Keep CRD schemas backward-compatible and provide conversion webhooks for version translation. Replace managed workloads gradually (canary/rolling), not all at once.

Conditions for a safe upgrade
 - CRD: new fields optional + default, existing field meaning unchanged
 - Multiple API versions coexist (v1alpha1, v1beta1, v1) + conversion
 - Managed workloads: pass a health gate before the next step
 - A rollback path exists (backups, retained previous-version images)

Seven Anti-Patterns

Anti-pattern 1: Side-Effect Spam in reconcile

reconcile should be a function that "converges current state to desired state." Yet you often see code inside reconcile that sends emails, calls external APIs, and fires Slack notifications. Since reconcile can be re-invoked dozens of times, these side effects occur repeatedly.

// Bad — a notification bomb on every reconcile
func (r *Reconciler) Reconcile(...) (ctrl.Result, error) {
	r.slackNotify("deployment started!")  // spams Slack on every re-invoke
	// ...
}

External side effects must occur exactly once, only "when the state actually transitioned." Make it idempotent by keeping a flag like "already notified" in status and triggering only at the transition point.

Anti-pattern 2: Using status Like spec

status is where the Operator writes what it observed; spec is where the user writes what they want. If users write values into status, or the Operator reads input from status, the two roles blur into debugging hell.

Correct role separation
  spec   : user writes -> Operator reads (desired)
  status : Operator writes -> user reads (observed)

Broken patterns (avoid)
  - letting users edit status.replicas
  - the Operator trusting status.targetVersion as input

Enabling the status subresource separates spec and status updates, reducing conflicts and permission confusion.

Anti-pattern 3: Infinite Requeue

Unconditionally requeuing immediately on error or unmet conditions creates a storm where the controller hammers the API server thousands of times a second. This threatens the whole cluster.

// Bad — immediate infinite retry on error
if err != nil {
	return ctrl.Result{Requeue: true}, nil  // CPU/API storm
}

// Good — return the error and controller-runtime backs off exponentially
if err != nil {
	return ctrl.Result{}, err  // automatic backoff retry
}

// Intentional delay should be explicit
return ctrl.Result{RequeueAfter: 30 * time.Second}, nil

Principle: just return the error and leave it to controller-runtime's exponential backoff; specify RequeueAfter only when things are fine but you "want to look again shortly." Unconditional Requeue: true is almost always a mistake.

Anti-pattern 4: Broad RBAC

An Operator demanding cluster-admin or wildcard permissions (*) is the most common security anti-pattern. If the Operator is compromised, the whole cluster is compromised with it.

// Bad — all permissions on everything
// +kubebuilder:rbac:groups="*",resources="*",verbs="*"

// Good — only the needed verbs on the needed resources
// +kubebuilder:rbac:groups=apps,resources=statefulsets,verbs=get;list;watch;create;update;patch
// +kubebuilder:rbac:groups=core,resources=services,verbs=get;list;watch;create;update;patch
// +kubebuilder:rbac:groups=batch,resources=cronjobs,verbs=get;list;watch;create;update;patch;delete

Kubebuilder's RBAC markers let you declare exactly the needed permissions in code and generate the manifest. It is a good tool for enforcing least privilege close to the code. Audit "is this verb really needed" regularly.

Anti-pattern 5: Cluster-Scope Overuse

A cluster-scoped Operator watching all namespaces is powerful but dangerous. If one tenant's CR triggers a bug, the whole cluster is affected, and RBAC naturally widens. Start namespace-scoped when possible and expand to cluster scope only when truly necessary.

Scope selection guide
 - Single team/namespace target -> prefer namespace scope
 - Multi-tenant platform -> cluster scope + strong isolation needed
 - Limiting watch to specific namespaces also cuts memory/load

Anti-pattern 6: Tight Webhook Coupling

Validating/mutating webhooks are powerful, but if a webhook dies, creation and modification of the related resources are all blocked. The failure of a single Operator Pod can become a single point of failure that paralyzes the cluster's API operations.

Safe webhook design
 - Choose failurePolicy carefully: Fail is safe but risks availability,
   Ignore prioritizes availability but risks bypassing validation
 - Narrow webhook scope via namespaceSelector/objectSelector
 - Exclude the webhook's own namespace (avoid bootstrap deadlock)
 - Run the webhook with multiple replicas for HA

The key is designing so core functionality works even without the webhook. A webhook is a convenience (validation/defaulting), not a hard dependency.

Anti-pattern 7: Uncached Direct Reads Everywhere

controller-runtime's client reads from the informer cache by default. But querying the API server directly on every reconcile, bypassing the cache (e.g., a label-less List), causes the API server load to explode.

// Bad — List all Pods directly from the API server every time
pods := &corev1.PodList{}
r.APIReader.List(ctx, pods)  // bypasses cache + no selector

// Good — from the cache, narrowed by a selector
pods := &corev1.PodList{}
r.List(ctx, pods,
	client.InNamespace(req.Namespace),
	client.MatchingLabels{"app": req.Name})

Multi-Tenancy and Security

An Operator built by a platform team is shared by many teams. You must isolate so one tenant cannot affect another's resources.

ThreatDefense
Cross-tenant CR interferenceNamespace isolation, RBAC limiting CR access
Resource exhaustion (noisy neighbor)ResourceQuota, LimitRange, caps in the CR
Privilege escalationLeast privilege for the Operator SA, verify on delegation
Secret exposureNever write secrets to status, mask logs
Malicious inputReject dangerous specs via webhook/schema validation

In particular, not exposing secrets in status, events, or logs is a commonly missed mistake. Writing a connection string to status for debugging convenience can leak a password to a user with loose RBAC.

Do Not Impersonate User Privileges

A common trap a platform Operator falls into is performing a user-requested action with its own powerful service account. This opens a privilege-escalation path where a low-privilege user does, via the Operator, what they cannot do themselves. The safe pattern is to first verify with a SubjectAccessReview "does this user actually have permission for this action."

// Delegated authorization check for the user-requested action
sar := &authzv1.SubjectAccessReview{
	Spec: authzv1.SubjectAccessReviewSpec{
		User: requestingUser,
		ResourceAttributes: &authzv1.ResourceAttributes{
			Namespace: ns,
			Verb:      "create",
			Resource:  "databases",
			Group:     "db.example.com",
		},
	},
}
if err := r.Create(ctx, sar); err != nil {
	return err
}
if !sar.Status.Allowed {
	// user not allowed -> Operator also refuses to do it on their behalf
	return fmt.Errorf("user %s is not allowed to create databases", requestingUser)
}

This pattern matters especially in self-service platforms. The Operator must be a "checkpoint of authority," not a "proxy of authority."

Resource Efficiency — Cache and Concurrency

In large clusters an Operator can easily become a memory hog. Caching all Pods loads tens of thousands of objects into memory.

Efficiency techniques
 - Limit cache scope: watch only specific namespaces/labels
 - Filter events with predicates: don't reconcile uninteresting changes
 - Tune concurrency with MaxConcurrentReconciles (too high storms the API,
   too low delays processing)
 - Narrow List scope with field/label selectors

Predicates are especially powerful. For example, reconciling only changes where generation changed (spec changes) and ignoring status-only events drastically reduces unnecessary reconciles.

// Reconcile only on spec changes (generation increment)
builder.WithPredicates(predicate.GenerationChangedPredicate{})

Testing Culture

Operators interact with distributed systems, so testing is tricky — and therefore all the more important.

The test pyramid
 1. Unit tests: reconcile logic, helper functions (with a fake client)
 2. envtest: integration with a real API server + etcd
    (provided by controller-runtime, uses the kube-apiserver binary)
 3. e2e: deploy to a kind cluster and verify scenarios
 4. Chaos/fault injection: force Pod deletion, behavior under network partition

envtest in particular is the core of Operator testing. It spins up a real API server — not a mock — and verifies the whole flow of CR creation -> reconcile -> resource creation. Always include an "idempotency test" (verify no change after two reconcile calls) and a "deletion test" (verify finalizer cleanup).

SRE-Style Operations — Alerts and Runbooks

An Operator is a service too. To be operable it needs SLIs/SLOs, alerts, and runbooks.

Example Operator operational metrics
 - reconcile error rate (controller_runtime_reconcile_errors_total)
 - reconcile latency (workqueue wait time)
 - workqueue depth (is backlog piling up)
 - domain SLIs (e.g., failover success rate, backup freshness)

Example alert rules
 - reconcile error rate high for 5 minutes -> warning
 - workqueue keeps growing -> suspect controller stall
 - no backup for over 24 hours -> page

Always attach a runbook link to alerts. An on-call engineer paged at 3 a.m. must immediately know "when this alert fires, check this and do that." The exception that the Operator could not automate is exactly where a human intervenes, and the runbook is the bridge.

Debugging — The Order to Look When the Operator Does Nothing

The most common page in operations is "I created the CR but nothing happened." Internalizing a diagnostic order finds the cause quickly most of the time.

 1. Does the CR actually exist with the right spec?
    kubectl get database orders-db -o yaml
 2. What do status.conditions say?
    -> the clue is usually in Reason/Message
 3. Is the Operator Pod alive and reconciling?
    kubectl logs -n operator-system deploy/db-operator
 4. Is RBAC blocking it? (Forbidden logs)
    kubectl auth can-i create statefulsets --as=system:serviceaccount:operator-system:db-operator
 5. Is the workqueue backed up? (check metrics)
    workqueue_depth, reconcile_errors_total
 6. Event timeline
    kubectl describe database orders-db

The three most common causes: First, insufficient RBAC so the Operator silently fails to create resources. Second, a predicate too strict so events do not trigger reconcile. Third, a finalizer stuck because the cleanup logic returns an error and deletion never finishes. Suspect these three first and you solve half.

Finalizer Deadlock — When Deletion Never Finishes

If finalizer cleanup logic fails permanently (e.g., erroring while deleting an already-gone external resource), the CR stalls in Terminating. Cleanup logic must be idempotent so it "treats already-gone as success."

func (r *Reconciler) reconcileDelete(ctx context.Context, db *dbv1alpha1.Database) (ctrl.Result, error) {
	// Clean up external resources — already gone is not an error
	if err := r.cleanupExternalBackupBucket(ctx, db); err != nil {
		if !isNotFound(err) {
			return ctrl.Result{}, err // retry only on real errors
		}
	}
	// cleanup done -> remove finalizer -> Kubernetes actually deletes
	controllerutil.RemoveFinalizer(db, dbFinalizer)
	return ctrl.Result{}, r.Update(ctx, db)
}

In an emergency you can force-strip the finalizer (empty finalizers via kubectl patch) to proceed with deletion, but this is a last resort that may leave external resources orphaned.

Operator Maturity Roadmap

You can use the CNCF Capability Levels as a roadmap.

Level 1: Basic install
  - Install/configure via CR. Deployable with helm.

Level 2: Seamless upgrades
  - Perform managed-target version upgrades safely.

Level 3: Full lifecycle
  - Automate backup/restore, scaling, failure recovery.

Level 4: Deep insights
  - Observable via metrics/alerts/logs. Provides performance analysis.

Level 5: Auto pilot
  - Autoscaling, auto-tuning, anomaly detection, self-healing.

Not every Operator needs to target Level 5. Most in-house Operators are fine at Levels 2-3. What matters is knowing what level your Operator is at now and what it takes to reach the next.

Comprehensive Checklist

[API design]
[ ] Is the spec small and intent-centric (what)?
[ ] Are new fields optional + default to preserve backward compatibility?
[ ] Is there a multi-API-version + conversion strategy?

[reconcile]
[ ] Idempotent (second of two calls is a no-op)?
[ ] Do external side effects occur exactly once at the transition?
[ ] Are errors returned to rely on backoff (no infinite requeue)?
[ ] Is cleanup guaranteed via finalizers?

[Observability]
[ ] Is status.conditions human-readable?
[ ] Are significant incidents recorded as events?
[ ] Are domain metrics exposed?

[Security]
[ ] Is RBAC least-privilege (no wildcards)?
[ ] Are secrets kept out of status/events/logs?
[ ] Is it namespace-scoped where possible?

[Efficiency]
[ ] Is the cache/watch scope limited?
[ ] Do predicates filter unnecessary reconciles?
[ ] Is MaxConcurrentReconciles set appropriately?

[Testing/operations]
[ ] Is there integration verification via envtest?
[ ] Are there idempotency/deletion tests?
[ ] Are there alert rules and runbooks?

Conclusion

What separates a good Operator from a bad one is not flashy features but fundamentals. Idempotent reconcile, a small API, least-privilege RBAC, observability — these plain-looking principles decide whether you face a 3 a.m. incident, a security audit flag, and endless maintenance debt.

To wrap up the three-part Operator journey: the first article showed a catalog of what you can build with Operators, the second got our hands dirty building a database Operator, and this one set the bar for building well. An Operator is a powerful tool for encoding operational knowledge into code, but never forget that the code is itself another system someone must operate. A well-built Operator reduces operational burden; a poorly built one creates new burden. May this article's checklist help you choose the good side at that fork.

References