Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction — What Does an Operator Actually Automate?

If you have used Kubernetes for any length of time, you have heard the word "Operator" countless times. Yet when asked "what is an Operator," most people give the abstract answer "a CRD plus a controller." The definition is correct, but it does not convey what problem it actually solves.

This article takes the case-study approach instead of definitions. Databases, messaging, caching, monitoring, certificates, secrets, GitOps, backup, service mesh, machine learning — we organize the most widely used Operators by domain and show concretely **what operational labor each one replaced with code**. After reading this catalog, you will have a tangible sense of "ah, this is what Operators are for."

The essence, stated up front, is this: an Operator is **the knowledge of a human operator encoded into software**. Procedures like "when the Postgres primary dies, promote a replica, update DNS, and verify the backup" — instead of a human waking at 3 a.m. to perform them, a controller performs them automatically in a reconcile loop. That is why **stateful systems with complex operational procedures**, like databases and message brokers, gain the most value from Operators.

How Operators Work (a 3-Minute Refresher)

Before the cases, let me briefly summarize the working principle every Operator shares.

User declares a CR (Custom Resource)

+------------------------+

| Operator (controller) |

| reconcile loop |

| desired vs actual |

+-----------+------------+

v create/update/delete what's missing

+------------------------+

| StatefulSet / Service |

| ConfigMap / PVC / Job |

| (real Kubernetes objs) |

+------------------------+

The user declares only the **desired state**: "a Postgres cluster, 3 replicas, version 16, daily backups." The Operator observes the actual state and acts to close the gap. This loop runs idempotently and repeatedly, converging the system toward the declared state.

This model is powerful because **failure recovery comes for free**. If a replica dies and actual drops to 2, the next reconcile notices the gap and brings it back to 3. No human intervention required.

1. Databases — Where Operators Shine Most

Stateful databases are the killer application for Operators. Backup, restore, failover, scaling, version upgrades — all are delicate and risky operational tasks, which is precisely why automating them yields the greatest value.

CloudNativePG (PostgreSQL)

CloudNativePG, a CNCF Sandbox project, is designed to run Postgres "Kubernetes-natively." Notably it manages Pods directly rather than using a StatefulSet, for finer control during failover.

apiVersion: postgresql.cnpg.io/v1

kind: Cluster

metadata:

spec:

instances: 3

imageName: ghcr.io/cloudnative-pg/postgresql:16.3

storage:

size: 100Gi

backup:

barmanObjectStore:

destinationPath: s3://my-backups/pg-prod

s3Credentials:

accessKeyId:

key: ACCESS_KEY_ID

secretAccessKey:

key: SECRET_ACCESS_KEY

retentionPolicy: "30d"

What this single YAML automates: a 1 primary + 2 replica topology, streaming replication setup, automatic failover (replica promotion) on primary failure, continuous backup to S3 (WAL archiving), and a 30-day retention policy. Every procedure once done by hand becomes a declaration.

Zalando Postgres Operator

Built by Zalando while running thousands of their own Postgres clusters. It embeds Patroni (an HA solution) and its manifests are concise.

apiVersion: acid.zalan.do/v1

kind: postgresql

metadata:

spec:

teamId: "acid"

numberOfInstances: 2

postgresql:

version: "16"

volume:

size: 10Gi

Vitess (MySQL Sharding)

Vitess horizontally shards MySQL to petabyte scale; it originated at YouTube. The Vitess Operator exposes Vitess-specific concepts — shards, cells, tablets — as CRDs. It shines at scales a single MySQL cannot handle.

2. Messaging — Strimzi (Apache Kafka)

Anyone who has operated Kafka knows it: brokers, ZooKeeper (or KRaft), topics, partitions, rebalancing, rolling upgrades — none of it is easy. Strimzi abstracts all of this into CRDs.

apiVersion: kafka.strimzi.io/v1beta2

kind: Kafka

metadata:

spec:

kafka:

version: 3.9.0

replicas: 3

listeners:

- name: tls

port: 9093

type: internal

tls: true

config:

offsets.topic.replication.factor: 3

min.insync.replicas: 2

storage:

type: persistent-claim

size: 100Gi

entityOperator:

topicOperator: {}

userOperator: {}

Interestingly, Strimzi **embeds Operators inside an Operator**. The Topic Operator and User Operator let you declare even topics as CRDs.

apiVersion: kafka.strimzi.io/v1beta2

kind: KafkaTopic

metadata:

labels:

strimzi.io/cluster: my-cluster

spec:

partitions: 12

replicas: 3

config:

retention.ms: 604800000

Now topic creation can be managed via GitOps. "Add a topic" becomes "merge a PR."

3. Caching — Redis Operator

Redis too becomes far more complex once you move beyond a single instance to Sentinel-based HA or Cluster-mode sharding. Several Redis Operators (e.g., Spotahome, OT-CONTAINER-KIT) automate this.

apiVersion: databases.spotahome.com/v1

kind: RedisFailover

metadata:

spec:

sentinel:

replicas: 3

redis:

replicas: 3

storage:

persistentVolumeClaim:

metadata:

spec:

accessModes: ["ReadWriteOnce"]

resources:

requests:

storage: 10Gi

Three Sentinels watch the master; on detecting a failure they promote a replica, and the application asks Sentinel for the new master. The Operator owns this entire failover orchestration.

4. Monitoring — Prometheus Operator

The Prometheus Operator is among the most widely used Operators. As the core of kube-prometheus-stack, it lets you declare monitoring targets via CRDs instead of editing Prometheus config files directly.

apiVersion: monitoring.coreos.com/v1

kind: ServiceMonitor

metadata:

labels:

release: kube-prometheus-stack

spec:

selector:

matchLabels:

app: my-app

endpoints:

- port: metrics

interval: 30s

path: /metrics

Create a ServiceMonitor and the Operator updates Prometheus's scrape config automatically. Alert rules are declared via the PrometheusRule CRD.

apiVersion: monitoring.coreos.com/v1

kind: PrometheusRule

metadata:

spec:

groups:

- name: my-app

rules:

- alert: HighErrorRate

expr: rate(http_requests_total{code=~"5.."}[5m]) > 0.05

for: 10m

labels:

severity: critical

annotations:

summary: "5xx error rate is high"

Alert rules become version-controlled code, and alert configuration ships alongside application deployments.

5. Certificates — cert-manager

cert-manager automates the issuance, renewal, and distribution of TLS certificates. It supports issuers like Let's Encrypt (ACME), Vault, and internal CAs, with automatic renewal before expiry as its core value.

apiVersion: cert-manager.io/v1

kind: Certificate

metadata:

spec:

secretName: example-tls-secret

dnsNames:

- app.example.com

issuerRef:

kind: ClusterIssuer

cert-manager eradicates the classic "forgot to renew the cert and the site went down" incident. Declare a Certificate and the Operator runs the ACME challenge, fills the Secret with the certificate, and renews before expiry.

6. Secrets — External Secrets Operator

You cannot store secrets in Git as plaintext. The External Secrets Operator (ESO) syncs secrets from external stores like AWS Secrets Manager, HashiCorp Vault, and GCP Secret Manager into Kubernetes Secrets.

apiVersion: external-secrets.io/v1beta1

kind: ExternalSecret

metadata:

spec:

refreshInterval: 1h

secretStoreRef:

kind: SecretStore

target:

data:

- secretKey: password

remoteRef:

key: prod/db/password

In Git you declare only "which secret to fetch"; the actual value stays in the external vault. The Operator syncs periodically, so rotating a secret externally is reflected automatically.

7. GitOps — Argo CD

Argo CD itself works via the Operator pattern. The Application CRD declares "which path of which Git repo to sync to which cluster."

apiVersion: argoproj.io/v1alpha1

kind: Application

metadata:

namespace: argocd

spec:

project: default

source:

repoURL: https://github.com/my-org/my-app-config

targetRevision: main

path: overlays/prod

destination:

server: https://kubernetes.default.svc

namespace: my-app

syncPolicy:

automated:

prune: true

selfHeal: true

With selfHeal on, even if someone manually changes a resource in the cluster, Argo CD reverts it to the Git state. Git becomes the single source of truth.

8. Backup — Velero

Velero backs up and restores cluster resources and persistent volumes. It provides CRDs like Backup, Restore, and Schedule.

apiVersion: velero.io/v1

kind: Schedule

metadata:

namespace: velero

spec:

schedule: "0 2 * * *"

template:

includedNamespaces:

- production

storageLocation: aws-s3

ttl: 720h0m0s

It backs up the production namespace daily at 2 a.m. and retains for 30 days. A core tool for cluster migration and DR scenarios.

9. Service Mesh — Istio

Istio intercepts traffic via sidecars (or ambient mode) to provide routing, mTLS, and observability. Traffic policy is declared via CRDs like VirtualService, DestinationRule, and Gateway.

apiVersion: networking.istio.io/v1

kind: VirtualService

metadata:

spec:

hosts:

- reviews

http:

- route:

- destination:

host: reviews

subset: v1

weight: 90

- destination:

host: reviews

subset: v2

weight: 10

Declare a canary as "90 to 10 traffic" and the mesh builds it. Istiod acts as the controller that translates these CRDs into Envoy configuration.

10. Machine Learning — Kubeflow

Kubeflow is a platform for running ML workflows on Kubernetes, and it is a collection of Operators. The Training Operator, for instance, handles distributed training jobs via CRDs.

apiVersion: kubeflow.org/v1

kind: PyTorchJob

metadata:

spec:

pytorchReplicaSpecs:

Master:

replicas: 1

template:

spec:

containers:

- name: pytorch

image: my-registry/pytorch-train:latest

Worker:

replicas: 4

template:

spec:

containers:

- name: pytorch

image: my-registry/pytorch-train:latest

The Operator handles the master-worker topology of distributed training, environment injection (RANK, WORLD_SIZE, etc.), and restarting failed workers. Data scientists can run distributed training without knowing the infrastructure.

Comparison Table — What Gets Automated, and How Mature

The CNCF Operator Capability Levels classify maturity into five stages, from Level 1 (basic install) to Level 5 (auto pilot: autoscaling, auto-tuning, anomaly detection).

| --- | --- | --- | --- | --- |

Note that maturity is a tendency, not an absolute grade. The same Operator feels different depending on which features you use.

Ten In-House Operator Ideas Worth Building

Beyond using off-the-shelf Operators, you can encode your organization's own operational knowledge into an Operator. Here are ideas worth building in-house.

1. Onboarding Operator

- One Tenant CR creates namespace + RBAC + quota + default NetworkPolicy

2. Certificate/Domain Operator

- On service registration, provision DNS record + cert + Ingress together

3. Cost-label Enforcement Operator

- Reject or auto-tag workloads missing a cost-center label

4. Nightly Sleep Operator

- Scale dev namespaces to replicas=0 outside business hours

5. Self-service Database Operator

- A DevDatabase CR spins up an isolated test DB and deletes it on expiry

6. Secret Rotation Operator

- Rotate API keys per internal policy and rolling-restart workloads

7. Backup Verification Operator

- Restore a backup to a throwaway cluster to auto-verify integrity

8. Canary Analysis Operator

- Evaluate post-deploy metrics to auto-promote or roll back

9. Compliance Scan Operator

- Surface image vuln-scan results in CR status and block policy violations

10. Multi-cluster Placement Operator

- A PlacementPolicy CR distributes workloads across clusters

What these ideas share: they are **repetitive operational tasks with clear procedures that humans easily get wrong**. That is exactly the Operator sweet spot.

Is It Worth Building? — Decision Criteria

Reinventing an existing Operator is waste; building one that isn't worth it becomes maintenance debt. Judge by the following criteria.

[Strong signals to build one]

- The procedure is clear and repetitive (a runbook already exists)

- Humans often make mistakes in it (3 a.m. responses, human error)

- It is org-specific domain with no off-the-shelf Operator

- It maps naturally to a declarative form (clear desired state)

- It converges via reconcile (you can make it idempotent)

[Signals NOT to build one]

- It is a one-off task (a Job or script suffices)

- The procedure requires human judgment each time (automation is risky)

- A mature off-the-shelf Operator exists (don't reinvent cert-manager)

- There is no state and nothing to reconcile (a webhook/controller suffices)

- The team lacks controller-operation capacity (it becomes debt)

The last point matters most. An Operator is harder to **maintain long-term** than to build. CRD versioning, keeping up with Kubernetes upgrades, the risk of a reconcile bug mangling resources en masse — an ownerless Operator becomes a time bomb.

What to Check When Choosing an Off-the-Shelf Operator

When adopting rather than building, check the following.

| Check item | Why it matters |

| --- | --- |

| CNCF/official status | Signal of governance and longevity |

| Release cadence | Does it track new K8s versions quickly? |

| Capability Level | Does it automate failover/backup too? |

| CRD version policy | Has it moved past v1alpha1? |

| Security (RBAC scope) | Does it demand full cluster-admin? |

| Production references | Are there large-scale references? |

Conclusion

An Operator in one sentence is **"operational knowledge embedded in a reconcile loop."** Database failover, certificate auto-renewal, secret sync — all of it once meant a human waking at 3 a.m., and now a controller does it quietly.

If you take one thing from this catalog, let it be the question: "What is the most frequently repeated, most frequently botched operational procedure in my organization?" If the answer is clear and expressible declaratively, that is the candidate for your first Operator. In the next article, we will build the most challenging of them all — a database Operator — from scratch.

References

- Kubernetes Operator pattern docs: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/

- Operator Framework: https://sdk.operatorframework.io/

- OperatorHub.io (Operator catalog): https://operatorhub.io/

- CloudNativePG: https://cloudnative-pg.io/

- Zalando Postgres Operator: https://github.com/zalando/postgres-operator

- Vitess: https://vitess.io/

- Strimzi (Kafka Operator): https://strimzi.io/

- Prometheus Operator: https://prometheus-operator.dev/

- cert-manager: https://cert-manager.io/

- External Secrets Operator: https://external-secrets.io/

- Argo CD: https://argo-cd.readthedocs.io/

- Velero: https://velero.io/

- Istio: https://istio.io/latest/docs/

- Kubeflow Training Operator: https://www.kubeflow.org/docs/components/training/

- Operator Capability Levels: https://sdk.operatorframework.io/docs/overview/operator-capabilities/