- Published on
What You Can Build with Operators — A Real-World Catalog
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction — What Does an Operator Actually Automate?
- How Operators Work (a 3-Minute Refresher)
- 1. Databases — Where Operators Shine Most
- 2. Messaging — Strimzi (Apache Kafka)
- 3. Caching — Redis Operator
- 4. Monitoring — Prometheus Operator
- 5. Certificates — cert-manager
- 6. Secrets — External Secrets Operator
- 7. GitOps — Argo CD
- 8. Backup — Velero
- 9. Service Mesh — Istio
- 10. Machine Learning — Kubeflow
- Comparison Table — What Gets Automated, and How Mature
- Ten In-House Operator Ideas Worth Building
- Is It Worth Building? — Decision Criteria
- What to Check When Choosing an Off-the-Shelf Operator
- Conclusion
- References
Introduction — What Does an Operator Actually Automate?
If you have used Kubernetes for any length of time, you have heard the word "Operator" countless times. Yet when asked "what is an Operator," most people give the abstract answer "a CRD plus a controller." The definition is correct, but it does not convey what problem it actually solves.
This article takes the case-study approach instead of definitions. Databases, messaging, caching, monitoring, certificates, secrets, GitOps, backup, service mesh, machine learning — we organize the most widely used Operators by domain and show concretely what operational labor each one replaced with code. After reading this catalog, you will have a tangible sense of "ah, this is what Operators are for."
The essence, stated up front, is this: an Operator is the knowledge of a human operator encoded into software. Procedures like "when the Postgres primary dies, promote a replica, update DNS, and verify the backup" — instead of a human waking at 3 a.m. to perform them, a controller performs them automatically in a reconcile loop. That is why stateful systems with complex operational procedures, like databases and message brokers, gain the most value from Operators.
How Operators Work (a 3-Minute Refresher)
Before the cases, let me briefly summarize the working principle every Operator shares.
User declares a CR (Custom Resource)
|
v
+------------------------+
| Operator (controller) |
| reconcile loop |
| desired vs actual |
+-----------+------------+
|
v create/update/delete what's missing
+------------------------+
| StatefulSet / Service |
| ConfigMap / PVC / Job |
| (real Kubernetes objs) |
+------------------------+
The user declares only the desired state: "a Postgres cluster, 3 replicas, version 16, daily backups." The Operator observes the actual state and acts to close the gap. This loop runs idempotently and repeatedly, converging the system toward the declared state.
This model is powerful because failure recovery comes for free. If a replica dies and actual drops to 2, the next reconcile notices the gap and brings it back to 3. No human intervention required.
1. Databases — Where Operators Shine Most
Stateful databases are the killer application for Operators. Backup, restore, failover, scaling, version upgrades — all are delicate and risky operational tasks, which is precisely why automating them yields the greatest value.
CloudNativePG (PostgreSQL)
CloudNativePG, a CNCF Sandbox project, is designed to run Postgres "Kubernetes-natively." Notably it manages Pods directly rather than using a StatefulSet, for finer control during failover.
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: pg-prod
spec:
instances: 3
imageName: ghcr.io/cloudnative-pg/postgresql:16.3
storage:
size: 100Gi
backup:
barmanObjectStore:
destinationPath: s3://my-backups/pg-prod
s3Credentials:
accessKeyId:
name: backup-creds
key: ACCESS_KEY_ID
secretAccessKey:
name: backup-creds
key: SECRET_ACCESS_KEY
retentionPolicy: "30d"
What this single YAML automates: a 1 primary + 2 replica topology, streaming replication setup, automatic failover (replica promotion) on primary failure, continuous backup to S3 (WAL archiving), and a 30-day retention policy. Every procedure once done by hand becomes a declaration.
Zalando Postgres Operator
Built by Zalando while running thousands of their own Postgres clusters. It embeds Patroni (an HA solution) and its manifests are concise.
apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
name: acid-minimal-cluster
spec:
teamId: "acid"
numberOfInstances: 2
postgresql:
version: "16"
volume:
size: 10Gi
Vitess (MySQL Sharding)
Vitess horizontally shards MySQL to petabyte scale; it originated at YouTube. The Vitess Operator exposes Vitess-specific concepts — shards, cells, tablets — as CRDs. It shines at scales a single MySQL cannot handle.
2. Messaging — Strimzi (Apache Kafka)
Anyone who has operated Kafka knows it: brokers, ZooKeeper (or KRaft), topics, partitions, rebalancing, rolling upgrades — none of it is easy. Strimzi abstracts all of this into CRDs.
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: my-cluster
spec:
kafka:
version: 3.9.0
replicas: 3
listeners:
- name: tls
port: 9093
type: internal
tls: true
config:
offsets.topic.replication.factor: 3
min.insync.replicas: 2
storage:
type: persistent-claim
size: 100Gi
entityOperator:
topicOperator: {}
userOperator: {}
Interestingly, Strimzi embeds Operators inside an Operator. The Topic Operator and User Operator let you declare even topics as CRDs.
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: orders
labels:
strimzi.io/cluster: my-cluster
spec:
partitions: 12
replicas: 3
config:
retention.ms: 604800000
Now topic creation can be managed via GitOps. "Add a topic" becomes "merge a PR."
3. Caching — Redis Operator
Redis too becomes far more complex once you move beyond a single instance to Sentinel-based HA or Cluster-mode sharding. Several Redis Operators (e.g., Spotahome, OT-CONTAINER-KIT) automate this.
apiVersion: databases.spotahome.com/v1
kind: RedisFailover
metadata:
name: redisfailover
spec:
sentinel:
replicas: 3
redis:
replicas: 3
storage:
persistentVolumeClaim:
metadata:
name: redisfailover-data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
Three Sentinels watch the master; on detecting a failure they promote a replica, and the application asks Sentinel for the new master. The Operator owns this entire failover orchestration.
4. Monitoring — Prometheus Operator
The Prometheus Operator is among the most widely used Operators. As the core of kube-prometheus-stack, it lets you declare monitoring targets via CRDs instead of editing Prometheus config files directly.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
labels:
release: kube-prometheus-stack
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 30s
path: /metrics
Create a ServiceMonitor and the Operator updates Prometheus's scrape config automatically. Alert rules are declared via the PrometheusRule CRD.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-rules
spec:
groups:
- name: my-app
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{code=~"5.."}[5m]) > 0.05
for: 10m
labels:
severity: critical
annotations:
summary: "5xx error rate is high"
Alert rules become version-controlled code, and alert configuration ships alongside application deployments.
5. Certificates — cert-manager
cert-manager automates the issuance, renewal, and distribution of TLS certificates. It supports issuers like Let's Encrypt (ACME), Vault, and internal CAs, with automatic renewal before expiry as its core value.
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: example-tls
spec:
secretName: example-tls-secret
dnsNames:
- app.example.com
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
cert-manager eradicates the classic "forgot to renew the cert and the site went down" incident. Declare a Certificate and the Operator runs the ACME challenge, fills the Secret with the certificate, and renews before expiry.
6. Secrets — External Secrets Operator
You cannot store secrets in Git as plaintext. The External Secrets Operator (ESO) syncs secrets from external stores like AWS Secrets Manager, HashiCorp Vault, and GCP Secret Manager into Kubernetes Secrets.
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: db-credentials
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: SecretStore
target:
name: db-credentials
data:
- secretKey: password
remoteRef:
key: prod/db/password
In Git you declare only "which secret to fetch"; the actual value stays in the external vault. The Operator syncs periodically, so rotating a secret externally is reflected automatically.
7. GitOps — Argo CD
Argo CD itself works via the Operator pattern. The Application CRD declares "which path of which Git repo to sync to which cluster."
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-app
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/my-org/my-app-config
targetRevision: main
path: overlays/prod
destination:
server: https://kubernetes.default.svc
namespace: my-app
syncPolicy:
automated:
prune: true
selfHeal: true
With selfHeal on, even if someone manually changes a resource in the cluster, Argo CD reverts it to the Git state. Git becomes the single source of truth.
8. Backup — Velero
Velero backs up and restores cluster resources and persistent volumes. It provides CRDs like Backup, Restore, and Schedule.
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup
namespace: velero
spec:
schedule: "0 2 * * *"
template:
includedNamespaces:
- production
storageLocation: aws-s3
ttl: 720h0m0s
It backs up the production namespace daily at 2 a.m. and retains for 30 days. A core tool for cluster migration and DR scenarios.
9. Service Mesh — Istio
Istio intercepts traffic via sidecars (or ambient mode) to provide routing, mTLS, and observability. Traffic policy is declared via CRDs like VirtualService, DestinationRule, and Gateway.
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
name: reviews
spec:
hosts:
- reviews
http:
- route:
- destination:
host: reviews
subset: v1
weight: 90
- destination:
host: reviews
subset: v2
weight: 10
Declare a canary as "90 to 10 traffic" and the mesh builds it. Istiod acts as the controller that translates these CRDs into Envoy configuration.
10. Machine Learning — Kubeflow
Kubeflow is a platform for running ML workflows on Kubernetes, and it is a collection of Operators. The Training Operator, for instance, handles distributed training jobs via CRDs.
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-dist
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
template:
spec:
containers:
- name: pytorch
image: my-registry/pytorch-train:latest
Worker:
replicas: 4
template:
spec:
containers:
- name: pytorch
image: my-registry/pytorch-train:latest
The Operator handles the master-worker topology of distributed training, environment injection (RANK, WORLD_SIZE, etc.), and restarting failed workers. Data scientists can run distributed training without knowing the infrastructure.
Comparison Table — What Gets Automated, and How Mature
The CNCF Operator Capability Levels classify maturity into five stages, from Level 1 (basic install) to Level 5 (auto pilot: autoscaling, auto-tuning, anomaly detection).
| Operator | Domain | Key automation | Representative CRD | Maturity tendency |
|---|---|---|---|---|
| CloudNativePG | Database | Failover, backup, PITR | Cluster | Level 4-5 |
| Zalando Postgres | Database | HA (Patroni), replication | postgresql | Level 4 |
| Vitess | Database | Sharding, scale-out | VitessCluster | Level 4-5 |
| Strimzi | Messaging | Brokers, topics, upgrade | Kafka, KafkaTopic | Level 4-5 |
| Redis Operator | Cache | Sentinel failover | RedisFailover | Level 3-4 |
| Prometheus | Monitoring | Scrape config, alerts | ServiceMonitor | Level 4 |
| cert-manager | Certificates | Issuance, auto-renewal | Certificate | Level 5 |
| External Secrets | Secrets | External vault sync | ExternalSecret | Level 4 |
| Argo CD | GitOps | Sync, selfHeal | Application | Level 4 |
| Velero | Backup | Scheduled backup, restore | Schedule, Backup | Level 4 |
| Istio | Service mesh | Traffic, mTLS | VirtualService | Level 4 |
| Kubeflow Training | ML | Distributed training | PyTorchJob | Level 3-4 |
Note that maturity is a tendency, not an absolute grade. The same Operator feels different depending on which features you use.
Ten In-House Operator Ideas Worth Building
Beyond using off-the-shelf Operators, you can encode your organization's own operational knowledge into an Operator. Here are ideas worth building in-house.
1. Onboarding Operator
- One Tenant CR creates namespace + RBAC + quota + default NetworkPolicy
2. Certificate/Domain Operator
- On service registration, provision DNS record + cert + Ingress together
3. Cost-label Enforcement Operator
- Reject or auto-tag workloads missing a cost-center label
4. Nightly Sleep Operator
- Scale dev namespaces to replicas=0 outside business hours
5. Self-service Database Operator
- A DevDatabase CR spins up an isolated test DB and deletes it on expiry
6. Secret Rotation Operator
- Rotate API keys per internal policy and rolling-restart workloads
7. Backup Verification Operator
- Restore a backup to a throwaway cluster to auto-verify integrity
8. Canary Analysis Operator
- Evaluate post-deploy metrics to auto-promote or roll back
9. Compliance Scan Operator
- Surface image vuln-scan results in CR status and block policy violations
10. Multi-cluster Placement Operator
- A PlacementPolicy CR distributes workloads across clusters
What these ideas share: they are repetitive operational tasks with clear procedures that humans easily get wrong. That is exactly the Operator sweet spot.
Is It Worth Building? — Decision Criteria
Reinventing an existing Operator is waste; building one that isn't worth it becomes maintenance debt. Judge by the following criteria.
[Strong signals to build one]
- The procedure is clear and repetitive (a runbook already exists)
- Humans often make mistakes in it (3 a.m. responses, human error)
- It is org-specific domain with no off-the-shelf Operator
- It maps naturally to a declarative form (clear desired state)
- It converges via reconcile (you can make it idempotent)
[Signals NOT to build one]
- It is a one-off task (a Job or script suffices)
- The procedure requires human judgment each time (automation is risky)
- A mature off-the-shelf Operator exists (don't reinvent cert-manager)
- There is no state and nothing to reconcile (a webhook/controller suffices)
- The team lacks controller-operation capacity (it becomes debt)
The last point matters most. An Operator is harder to maintain long-term than to build. CRD versioning, keeping up with Kubernetes upgrades, the risk of a reconcile bug mangling resources en masse — an ownerless Operator becomes a time bomb.
What to Check When Choosing an Off-the-Shelf Operator
When adopting rather than building, check the following.
| Check item | Why it matters |
|---|---|
| CNCF/official status | Signal of governance and longevity |
| Release cadence | Does it track new K8s versions quickly? |
| Capability Level | Does it automate failover/backup too? |
| CRD version policy | Has it moved past v1alpha1? |
| Security (RBAC scope) | Does it demand full cluster-admin? |
| Production references | Are there large-scale references? |
Conclusion
An Operator in one sentence is "operational knowledge embedded in a reconcile loop." Database failover, certificate auto-renewal, secret sync — all of it once meant a human waking at 3 a.m., and now a controller does it quietly.
If you take one thing from this catalog, let it be the question: "What is the most frequently repeated, most frequently botched operational procedure in my organization?" If the answer is clear and expressible declaratively, that is the candidate for your first Operator. In the next article, we will build the most challenging of them all — a database Operator — from scratch.
References
- Kubernetes Operator pattern docs: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
- Operator Framework: https://sdk.operatorframework.io/
- OperatorHub.io (Operator catalog): https://operatorhub.io/
- CloudNativePG: https://cloudnative-pg.io/
- Zalando Postgres Operator: https://github.com/zalando/postgres-operator
- Vitess: https://vitess.io/
- Strimzi (Kafka Operator): https://strimzi.io/
- Prometheus Operator: https://prometheus-operator.dev/
- cert-manager: https://cert-manager.io/
- External Secrets Operator: https://external-secrets.io/
- Argo CD: https://argo-cd.readthedocs.io/
- Velero: https://velero.io/
- Istio: https://istio.io/latest/docs/
- Kubeflow Training Operator: https://www.kubeflow.org/docs/components/training/
- Operator Capability Levels: https://sdk.operatorframework.io/docs/overview/operator-capabilities/