- Published on
Understanding the Kubernetes Operator Pattern — Controllers, Reconciliation, and Codifying Operational Knowledge
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction — Moving Operator Knowledge into Code
- What Operators Solve — Stateful Workloads and Day-2 Operations
- The Control Loop and Reconciliation
- CRD + Controller = Operator
- Control Plane Extension Mechanisms — CRD vs. API Aggregation
- Operator Capability Levels — Five Stages of Maturity
- When to Use an Operator, and When Not To
- The Ecosystem — OperatorHub and Frameworks
- The Shift in Thinking from Imperative to Declarative
- The 2026 Context
- The Pattern Seen Through Real Operator Examples
- The Life of a CR — From Creation to Deletion
- Limits of the Pattern and Anti-patterns
- Clearing Up Common Misconceptions
- Conclusion
- References
Introduction — Moving Operator Knowledge into Code
When you first learn Kubernetes, you work with built-in resources like Deployment, Service, and ConfigMap. These have well-defined behavior, so applying YAML makes the cluster converge to your desired state on its own. Set a Deployment's replicas to 3 and three pods stay alive; if one dies, it comes back. This "declare it and it converges" experience is the core appeal of Kubernetes.
But real production environments are full of complex operational knowledge that built-in resources cannot express. Suppose you operate a PostgreSQL cluster. It does not end with spinning up three pods. You must decide which member is primary and which are replicas, promote a replica (failover) when the primary dies, take periodic backups, and follow a strict order during version upgrades. This operational procedure usually lives in an operator's head or in a runbook document.
The Operator pattern is a way to move exactly this "operator knowledge" into code so it runs automatically inside the cluster, 24 hours a day. First proposed by CoreOS in 2016, the concept is now the standard way to extend the Kubernetes ecosystem. This article covers in depth what Operators solve, how the underlying control loop and reconciliation work, and when to adopt them and when to avoid them in practice.
What Operators Solve — Stateful Workloads and Day-2 Operations
Stateless vs. Stateful
Kubernetes' built-in controllers are excellent for stateless workloads. Like a web server, where one dying pod can simply be replaced by another, a single Deployment is enough. Pods need no identity, and termination order does not matter.
The problem is stateful workloads. Databases, message brokers, distributed caches, and consensus-based systems each have a unique identity and state per instance. StatefulSet provides stable network IDs and storage, but StatefulSet does not know "how to bootstrap a cluster, add members, and recover from failures." That is application-specific domain knowledge.
Day-1 vs. Day-2 Operations
Dividing operations along a time axis, we distinguish:
- Day-0: Design and planning
- Day-1: Installation and deployment (a one-time task)
- Day-2: Everything afterward — upgrades, scaling, backup/restore, failure recovery, configuration changes, security patches
Most tools help with Day-1 well. Installing with a Helm chart is easy. But the truly hard part is Day-2. Helm does not know what happens after a one-time install. If a human must intervene by hand when the primary dies, that is not automation.
Operators specialize in automating Day-2 operations. A controller continuously observes cluster state, and when it drifts from the desired state, it performs the action an operator would have. Keeping the pager from going off at 3 a.m. — that is the core value of an Operator.
Comparison: Built-in vs. Helm vs. Operator
| Aspect | Built-in controllers | Helm | Operator |
|---|---|---|---|
| Primary target | Stateless | Packaging/install | Stateful/complex ops |
| Day-1 install | Manual YAML | Automated | Automated |
| Day-2 operations | Limited | Almost none | Core strength |
| Continuous reconciliation | Built-in only | None (one-time template) | Always running |
| Domain knowledge | None | None | Embedded in code |
| Learning curve | Low | Low | High |
The Control Loop and Reconciliation
The Declarative Model: Desired vs. Observed
Every Kubernetes behavior reduces to one simple principle: compare the desired state with the observed state, and close the gap. When a user declares the desired state ("it should look like this"), the controller reads the cluster's observed state, compares the two, and takes action to fill the gap if there is one.
This perpetually running compare-and-adjust process is called the control loop, or reconcile loop. In prose:
loop forever:
desired = read the user-declared desired state
observed = read the actual cluster state
if desired != observed:
take action to close the gap
wait briefly, or block until an event fires
This simple loop holds up all of Kubernetes. The Deployment controller, the ReplicaSet controller, and the Node controller all follow this pattern. An Operator works on exactly the same principle, differing only in that its target is a user-defined resource.
The Flow as an ASCII Diagram
+-------------------+
| User / Git |
| desired state |
| (apply CR) |
+---------+---------+
| apply
v
+-------------------+ watch +-----------------+
| API Server |<-------------------+| Controller |
| (etcd store) | | (reconcile) |
+---------+---------+ +--------+--------+
^ |
| status update | 1. read observed
| / create-modify resources | 2. compare desired
| | 3. close the gap
+----------------------------------------+
converging actual cluster state
The key point is that the controller never touches etcd directly; it always reads and writes state through the API Server. The controller is notified of changes via the API Server's watch mechanism and calls reconcile for the object that changed.
Idempotency Is Essential
The most important principle when designing a reconcile function is idempotency: the result must be the same no matter how many times you run it on the same input. Reconcile can be called dozens or hundreds of times on the same object. Watch events may arrive as duplicates, periodic re-runs happen, and retries occur after errors.
Therefore reconcile should be written as "check that X exists in the desired shape, and fix it if not," not "create X." Compare the following pseudocode.
Wrong approach (non-idempotent):
create Deployment # second call errors with "already exists"
Correct approach (idempotent):
desired = buildDesiredDeployment()
existing = get Deployment
if not found:
create desired
else if existing != desired:
update to desired
# if already equal, do nothing
If you fail to be idempotent, the controller throws duplicate-creation errors, falls into an infinite loop, or keeps thrashing resources and destabilizing the cluster.
CRD + Controller = Operator
CRD: Defining a New Resource Type
An Operator consists of two parts. The first is the CRD (CustomResourceDefinition). A CRD registers a new resource type (kind) with the Kubernetes API. For example, define a new kind called PostgresCluster, and users can work with it exactly like a built-in resource:
apiVersion: db.example.com/v1
kind: PostgresCluster
metadata:
name: orders-db
spec:
replicas: 3
version: "16"
storage: 50Gi
status:
phase: Running
primary: orders-db-0
Once a CRD is registered, commands like kubectl get postgrescluster work, RBAC applies, and objects are stored in etcd. In other words, a CRD expands the vocabulary of the Kubernetes API. But a CRD alone does nothing; data is just stored in etcd.
Controller: Breathing Life into the Definition
The second part is the controller. The controller watches the CR (Custom Resource) defined above and, when a create, modify, or delete occurs, runs reconcile to produce real behavior. Looking at a PostgresCluster to create StatefulSets, Services, Secrets, and ConfigMaps, managing primary election, and setting up backup CronJobs — all of that lives in the controller code.
In summary, the following equation holds.
Operator = CRD (a new API type) + Controller (reconcile logic for that type)
If the CRD is "the language for saying what you want," the controller is "the operator's brain that makes it happen." Only when these two combine is domain-specific automated operation complete.
Garbage Collection via Owner Reference
Child resources created by the controller (StatefulSets, etc.) usually carry an owner reference. This way, when the parent CR is deleted, Kubernetes' garbage collector automatically cleans up the child resources. Owner references, together with idempotency, are fundamental to Operator design. They let the controller track "what I created" and keep cleanup logic simple.
Control Plane Extension Mechanisms — CRD vs. API Aggregation
There are actually two ways to extend the Kubernetes API. Operators almost always use the first.
1. The CRD Approach
CRDs borrow Kubernetes' built-in storage (etcd) and API processing pipeline as-is. Register only the schema (OpenAPI v3), and Kubernetes handles validation, version conversion, storage, and watch for you. There is no separate server to operate, making it overwhelmingly simpler. Nearly every Operator today uses the CRD approach.
2. The API Aggregation Approach
API Aggregation has you run your own extension API server and register it behind the main API Server. You can use your own storage backend and implement more complex validation and conversion logic. But you must operate, authenticate, and scale a separate server, which is a heavy burden. It is used only for special cases like metrics.k8s.io.
Comparison Table
| Aspect | CRD | API Aggregation |
|---|---|---|
| Storage | Shared etcd | Own backend possible |
| Operational burden | Low | High (server ops) |
| Validation | OpenAPI schema + webhooks | Arbitrary code |
| Typical use | Most Operators | Special cases (metrics, etc.) |
| Recommendation | Default choice | Only when truly needed |
Additionally, if you need more validation or default injection, attach admission webhooks (validating/mutating) alongside the CRD. The CRD + webhook combination allows fairly sophisticated extension without API Aggregation.
Operator Capability Levels — Five Stages of Maturity
The Operator Framework defines Operator maturity in five levels. It is a good yardstick for gauging where your Operator sits.
| Level | Name | Core capability |
|---|---|---|
| 1 | Basic Install | Automated install, basic config provisioning |
| 2 | Seamless Upgrades | Zero-downtime version upgrades, patch management |
| 3 | Full Lifecycle | Backup/restore, failover, scaling — full lifecycle |
| 4 | Deep Insights | Metrics/alerts/logs, workload analysis exposure |
| 5 | Auto Pilot | Autoscaling, auto-tuning, anomaly detection and self-healing |
- Level 1 (Basic Install): Apply one CR and the app installs. Most Operators start here.
- Level 2 (Seamless Upgrades): Change the version and it performs a rolling upgrade without downtime. For stateful apps, this is harder than it looks.
- Level 3 (Full Lifecycle): Automates real Day-2 operations like backup schedules, automatic failover, and member add/remove.
- Level 4 (Deep Insights): Exposes Prometheus metrics and emits meaningful status conditions and events.
- Level 5 (Auto Pilot): Scales by load, tunes performance, and detects anomalies to recover on its own.
Higher levels bring more value but raise implementation difficulty and maintenance cost sharply. Not every Operator needs to target Level 5.
When to Use an Operator, and When Not To
Good cases to use one
- Complex stateful applications: Databases, message brokers, and distributed storage where bootstrap, failover, and backup logic is domain-specific.
- Repetitive Day-2 operations: When procedures a human used to perform by hand from a runbook are clearly defined and frequently repeated.
- Deploying the same app across many teams/environments: Providing a standardized deployment through a single CR greatly improves consistency.
- When self-healing matters operationally: When you want to reduce the risk of a human needing to intervene during overnight failures.
Cases to avoid one
- Simple stateless apps: Building an Operator for something a Deployment + HPA covers is over-engineering.
- Install once and done: If there are almost no Day-2 operations, a Helm chart is more appropriate.
- No maintenance staff: An Operator is code. As Kubernetes versions rise, you must maintain it accordingly. Build it and neglect it, and it becomes debt.
- A good official Operator already exists: Search OperatorHub before building your own.
A one-line judgment criterion
"If there is a clear procedure an operator would have to get up at 3 a.m. to perform by hand, it repeats, and the value of automation exceeds the maintenance cost, consider an Operator."
Otherwise a simpler tool (Deployment, StatefulSet, Helm, GitOps) is almost always the right answer. Operators are powerful, but not free.
The Ecosystem — OperatorHub and Frameworks
Before building your own, it is important to check whether a proven Operator already exists.
- OperatorHub.io: A catalog of Operators published by the community and vendors. You can find Operators for most major software — PostgreSQL, Kafka, Redis, Prometheus, and more.
- OLM (Operator Lifecycle Manager): A meta-operations tool that installs, upgrades, and manages dependencies for Operators themselves. Think of it as the Operator for Operators.
- Kubebuilder: The de facto standard SDK for building Operators in Go. It provides project scaffolding and code generation on top of controller-runtime.
- Operator SDK: A higher-level tool that wraps Kubebuilder and also supports Helm- and Ansible-based Operators, not just Go.
The next article walks through building an Operator with Kubebuilder, and the one after goes deeper into the reconcile loop.
The Shift in Thinking from Imperative to Declarative
To truly understand Operators, you must change your mindset itself. Traditional automation scripts are imperative. They are a list of procedures: "create this, configure that, then run this." The problem is that if the script fails midway, the system ends up in a half-baked state. Rerun it and you get "already exists" errors or duplicates.
Reconcile is declarative. You declare only "in the end it should look like this," then compare with the current state and converge to that shape. No matter where it fails, the next reconcile runs from the beginning again and just finishes the unfinished parts. Summarizing this difference in a table:
| Aspect | Imperative script | Declarative reconcile |
|---|---|---|
| Expression | "Run this procedure" | "This should be the outcome" |
| After failure | Half-baked state, manual recovery | Next run auto-recovers |
| Re-run | Risky (duplicates/errors) | Safe (idempotent) |
| Drift | Not detected | Auto-corrected |
This shift in thinking is the essence of an Operator. Expressing in code "what should be" rather than "how to do it," and trusting a loop that continually closes the gap. Get comfortable with this perspective and all of Kubernetes starts to look like one giant declarative system.
The 2026 Context
As of 2026, the Operator pattern has fully gone mainstream. A great many of cloud vendors' managed services are internally implemented as Operators, and platform engineering teams provide self-service abstractions with CRDs and Operators when building internal developer platforms (IDPs). Kubebuilder supports Kubernetes 1.36 and Go 1.26, and controller-runtime has stabilized on the v0.24.x line.
A particularly notable change is on the security side. In the past, a separate kube-rbac-proxy sidecar was attached to protect the metrics endpoint, but now you use the authentication/authorization middleware provided by controller-runtime (WithAuthenticationAndAuthorization) to protect metrics without an extra container. The operational surface shrinks, making things simpler and safer.
The Pattern Seen Through Real Operator Examples
Rather than abstract explanation, real examples make it clearer what Operators automate. Let us look at a few representative open-source Operators.
Database Operators
A PostgreSQL or MySQL Operator is a textbook example of the Operator pattern. Listing the operations they automate:
- Bootstrapping the primary/replica topology
- Automatic failover and replica promotion on primary failure
- Periodic backups and point-in-time recovery (PITR)
- Zero-downtime minor version upgrades
- Integrated management of a connection pooler (e.g., PgBouncer)
All of this, which an operator would wake up at night to do manually, is replaced by a single CR declaration and the controller's reconcile. This is the value of Level 3 (Full Lifecycle) and above.
Monitoring Operators
The Prometheus Operator is another famous example. Here it defines CRDs like Prometheus, ServiceMonitor, and PrometheusRule, treating monitoring configuration itself as Kubernetes resources. Instead of editing a giant Prometheus config file directly, a developer creates a small CR called ServiceMonitor to declare "collect my service's metrics." The Operator gathers these declarations, converts them into the actual Prometheus config, and reloads.
Certificate Operators
cert-manager is an Operator that automates certificate issuance and renewal. Create a CR called Certificate, and the Operator talks to ACME (e.g., Let's Encrypt) to issue a certificate and automatically renews it before expiry. This is where RequeueAfter's periodic checks shine. Certificate expiry is not caught by watch events, so the Operator reconciles again periodically to check the renewal timing.
Extracting the common pattern
These examples reveal a recurring common structure.
1. The user declares the "desired outcome" with a small CR
2. The Operator converts it into multiple child resources / external calls
3. It continuously observes and corrects drift
4. Time-dependent work (renewal/backup) is handled by periodic reconcile
In other words, an Operator is an abstraction layer that "compresses complex operational intent into a simple declaration." The user states the outcome, and the Operator takes responsibility for the procedure that produces it.
The Life of a CR — From Creation to Deletion
To understand Operators deeply, it helps to follow what stages a single CR goes through from birth to death. Let us turn the abstract principle into a concrete flow.
1. Creation stage
When a user applies a CR, the API Server validates it and stores it in etcd. At that moment the controller's informer receives a create event via watch and enqueues the object key in the workqueue. The controller calls reconcile, discovers "the child resources this CR wants do not exist yet," and creates the needed Deployment, Service, and so on.
user apply
-> API Server validate/store
-> informer create event
-> reconcile pass 1: create child resources
-> reconcile pass 2: observe child state, update status
Interestingly, it usually does not finish in a single reconcile. When the first reconcile creates a Deployment, it takes time for that Deployment's pods to become ready. The controller is called again via RequeueAfter or child-resource watches, progressively updating status until readiness is achieved.
2. Steady-state operation stage
Once the CR reaches the desired state, reconcile confirms "nothing to do" and quietly ends. From here on, the controller acts as a watcher. If someone modifies the child Deployment by hand, watch detects it, reconcile runs again, and reverts it to the desired state. Even if pods die due to a node failure, the built-in controller revives the pods, and the Operator maintains domain-level consistency on top of that.
3. Change stage
When the user changes the CR's spec (e.g., replicas 3 to 5), generation increases and a new reconcile is triggered. The controller computes the new desired state and updates the child resources. Because it is written idempotently, it simply re-applies "the entire desired shape" without needing to track what changed item by item.
4. Deletion stage and finalizers
When you delete a CR, there are two paths. Simple child resources linked by owner reference are cleaned up by the garbage collector automatically. But resources outside the cluster (e.g., a cloud load balancer, an external DB user, an S3 bucket) are unknown to Kubernetes and not auto-cleaned. This is where the finalizer appears.
A finalizer is a kind of "deletion lock" attached to an object. As long as a finalizer remains, the object is only marked for deletion (deletionTimestamp) and does not actually disappear. The controller sees the deletionTimestamp, performs external cleanup, and then removes the finalizer. Only when all finalizers are gone is the object actually deleted from etcd.
delete request
-> deletionTimestamp set (object still exists)
-> reconcile: perform external resource cleanup
-> remove finalizer
-> object actually deleted
Thanks to finalizers, an Operator can guarantee it does not miss "cleanup that must happen on deletion." For an Operator that handles external resources, finalizers are not optional but mandatory.
Limits of the Pattern and Anti-patterns
Operators are powerful, but misused they make a system complex and fragile instead. Here are anti-patterns that show up often.
Anti-pattern 1: Everything as an Operator
The urge to wrap even simple deployments in an Operator is a common mistake. Introduce a CRD and controller for something that ends at Deployment + ConfigMap, and you only add abstraction operators must learn while making debugging harder. Ask "can't I do this with Helm?" first.
Anti-pattern 2: A side-effect bomb in reconcile
Reconcile must be idempotent. So what happens if you send an email inside reconcile, make a non-idempotent request to an external system, or increment a counter? Since reconcile can be called dozens of times, dozens of emails go out and the counter is wrecked. Side effects must be designed idempotently, or if you need a "once only" guarantee, leave a completion marker in status to prevent duplicates.
Anti-pattern 3: A giant single reconcile
Cram hundreds of lines of branching into one reconcile and it becomes impossible to reason about and test. As with the staged design seen earlier, it is better to split each stage into small idempotent functions.
Anti-pattern 4: Status as the source of truth
Status is just a cache of what the controller observed, not the source of truth. The real state is always in the actual resources (the observed state). Branch on status alone and you make wrong decisions when status is stale.
A pre-adoption checklist
Before deciding to build an Operator, ask yourself the following.
[ ] Does an official/community Operator already exist? (check OperatorHub)
[ ] Is Helm + GitOps really not enough?
[ ] Is the Day-2 operation to automate clearly defined?
[ ] Are there the people and the will to maintain it long term?
[ ] Can you design reconcile idempotently?
If you can confidently answer "yes" to these five questions, adopting an Operator is a good choice. If even one gives you pause, it is worth reconsidering the simpler path once more.
Clearing Up Common Misconceptions
Let us address a few misconceptions that often arise when first encountering Operators.
| Misconception | Reality |
|---|---|
| "An Operator is magical automation" | In the end it is just operational logic a human wrote into the reconcile function. What the code does not know is not automated. |
| "Making a CRD makes it an Operator" | A CRD is just a data definition; without a controller it does nothing. |
| "Build it once and you are done" | It is living code that needs maintenance alongside Kubernetes version upgrades. |
| "Operators must always be written in Go" | Go is standard, but other approaches like Helm, Ansible, and Python (kopf) exist. |
| "Operators are only for stateful workloads" | Mostly, but there are other uses too, like complex config reconciliation or policy enforcement. |
The first misconception is especially important. An Operator is human judgment moved into code, not something smarter than a human. Put faulty operational logic into the reconcile function, and that mistake repeats automatically, 24 hours a day. That is why the quality and testing of Operator code can matter even more than for a regular application.
Conclusion
The essence of the Operator pattern is not flashy technology but the simple idea of "codifying operator knowledge as a reconcile loop over a declarative model." A loop that compares desired and observed to close the gap, the discipline of writing that loop idempotently, and the structure of expanding vocabulary with a CRD and breathing life with a controller — master just these three, and you can see through how any Operator works.
Just do not forget that an Operator is not the answer to every problem. Keep simple things simple, and reach for an Operator only when you truly need to automate complex Day-2 operations — that is the mature choice.
In the articles that follow, we move these concepts into real code. The next article builds your first Operator with Kubebuilder, and the one after digs deep into the internals of the reconcile loop and performance tuning. Now that you understand the principles, this is the best time to harden that understanding by building one with your own hands.
References
- Operator pattern (Kubernetes official docs)
- Extending with CustomResourceDefinitions (Kubernetes official docs)
- Kubebuilder Book
- Operator SDK documentation
- controller-runtime (pkg.go.dev)
- kubernetes-sigs/kubebuilder (GitHub)
- kubernetes-sigs/controller-runtime (GitHub)
- Operator Lifecycle Manager documentation
- OperatorHub.io
- Operator Capability Levels