Skip to content
Published on

What You Can Build with Operators — A Real-World Catalog

Authors

Introduction — What Does an Operator Actually Automate?

If you have used Kubernetes for any length of time, you have heard the word "Operator" countless times. Yet when asked "what is an Operator," most people give the abstract answer "a CRD plus a controller." The definition is correct, but it does not convey what problem it actually solves.

This article takes the case-study approach instead of definitions. Databases, messaging, caching, monitoring, certificates, secrets, GitOps, backup, service mesh, machine learning — we organize the most widely used Operators by domain and show concretely what operational labor each one replaced with code. After reading this catalog, you will have a tangible sense of "ah, this is what Operators are for."

The essence, stated up front, is this: an Operator is the knowledge of a human operator encoded into software. Procedures like "when the Postgres primary dies, promote a replica, update DNS, and verify the backup" — instead of a human waking at 3 a.m. to perform them, a controller performs them automatically in a reconcile loop. That is why stateful systems with complex operational procedures, like databases and message brokers, gain the most value from Operators.

How Operators Work (a 3-Minute Refresher)

Before the cases, let me briefly summarize the working principle every Operator shares.

User declares a CR (Custom Resource)
        |
        v
+------------------------+
|  Operator (controller) |
|  reconcile loop        |
|  desired vs actual     |
+-----------+------------+
        |
        v  create/update/delete what's missing
+------------------------+
| StatefulSet / Service  |
| ConfigMap / PVC / Job  |
| (real Kubernetes objs) |
+------------------------+

The user declares only the desired state: "a Postgres cluster, 3 replicas, version 16, daily backups." The Operator observes the actual state and acts to close the gap. This loop runs idempotently and repeatedly, converging the system toward the declared state.

This model is powerful because failure recovery comes for free. If a replica dies and actual drops to 2, the next reconcile notices the gap and brings it back to 3. No human intervention required.

1. Databases — Where Operators Shine Most

Stateful databases are the killer application for Operators. Backup, restore, failover, scaling, version upgrades — all are delicate and risky operational tasks, which is precisely why automating them yields the greatest value.

CloudNativePG (PostgreSQL)

CloudNativePG, a CNCF Sandbox project, is designed to run Postgres "Kubernetes-natively." Notably it manages Pods directly rather than using a StatefulSet, for finer control during failover.

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: pg-prod
spec:
  instances: 3
  imageName: ghcr.io/cloudnative-pg/postgresql:16.3
  storage:
    size: 100Gi
  backup:
    barmanObjectStore:
      destinationPath: s3://my-backups/pg-prod
      s3Credentials:
        accessKeyId:
          name: backup-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: backup-creds
          key: SECRET_ACCESS_KEY
    retentionPolicy: "30d"

What this single YAML automates: a 1 primary + 2 replica topology, streaming replication setup, automatic failover (replica promotion) on primary failure, continuous backup to S3 (WAL archiving), and a 30-day retention policy. Every procedure once done by hand becomes a declaration.

Zalando Postgres Operator

Built by Zalando while running thousands of their own Postgres clusters. It embeds Patroni (an HA solution) and its manifests are concise.

apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
  name: acid-minimal-cluster
spec:
  teamId: "acid"
  numberOfInstances: 2
  postgresql:
    version: "16"
  volume:
    size: 10Gi

Vitess (MySQL Sharding)

Vitess horizontally shards MySQL to petabyte scale; it originated at YouTube. The Vitess Operator exposes Vitess-specific concepts — shards, cells, tablets — as CRDs. It shines at scales a single MySQL cannot handle.

2. Messaging — Strimzi (Apache Kafka)

Anyone who has operated Kafka knows it: brokers, ZooKeeper (or KRaft), topics, partitions, rebalancing, rolling upgrades — none of it is easy. Strimzi abstracts all of this into CRDs.

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: my-cluster
spec:
  kafka:
    version: 3.9.0
    replicas: 3
    listeners:
      - name: tls
        port: 9093
        type: internal
        tls: true
    config:
      offsets.topic.replication.factor: 3
      min.insync.replicas: 2
    storage:
      type: persistent-claim
      size: 100Gi
  entityOperator:
    topicOperator: {}
    userOperator: {}

Interestingly, Strimzi embeds Operators inside an Operator. The Topic Operator and User Operator let you declare even topics as CRDs.

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  name: orders
  labels:
    strimzi.io/cluster: my-cluster
spec:
  partitions: 12
  replicas: 3
  config:
    retention.ms: 604800000

Now topic creation can be managed via GitOps. "Add a topic" becomes "merge a PR."

3. Caching — Redis Operator

Redis too becomes far more complex once you move beyond a single instance to Sentinel-based HA or Cluster-mode sharding. Several Redis Operators (e.g., Spotahome, OT-CONTAINER-KIT) automate this.

apiVersion: databases.spotahome.com/v1
kind: RedisFailover
metadata:
  name: redisfailover
spec:
  sentinel:
    replicas: 3
  redis:
    replicas: 3
    storage:
      persistentVolumeClaim:
        metadata:
          name: redisfailover-data
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi

Three Sentinels watch the master; on detecting a failure they promote a replica, and the application asks Sentinel for the new master. The Operator owns this entire failover orchestration.

4. Monitoring — Prometheus Operator

The Prometheus Operator is among the most widely used Operators. As the core of kube-prometheus-stack, it lets you declare monitoring targets via CRDs instead of editing Prometheus config files directly.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  labels:
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

Create a ServiceMonitor and the Operator updates Prometheus's scrape config automatically. Alert rules are declared via the PrometheusRule CRD.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-app-rules
spec:
  groups:
    - name: my-app
      rules:
        - alert: HighErrorRate
          expr: rate(http_requests_total{code=~"5.."}[5m]) > 0.05
          for: 10m
          labels:
            severity: critical
          annotations:
            summary: "5xx error rate is high"

Alert rules become version-controlled code, and alert configuration ships alongside application deployments.

5. Certificates — cert-manager

cert-manager automates the issuance, renewal, and distribution of TLS certificates. It supports issuers like Let's Encrypt (ACME), Vault, and internal CAs, with automatic renewal before expiry as its core value.

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: example-tls
spec:
  secretName: example-tls-secret
  dnsNames:
    - app.example.com
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer

cert-manager eradicates the classic "forgot to renew the cert and the site went down" incident. Declare a Certificate and the Operator runs the ACME challenge, fills the Secret with the certificate, and renews before expiry.

6. Secrets — External Secrets Operator

You cannot store secrets in Git as plaintext. The External Secrets Operator (ESO) syncs secrets from external stores like AWS Secrets Manager, HashiCorp Vault, and GCP Secret Manager into Kubernetes Secrets.

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: db-credentials
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: SecretStore
  target:
    name: db-credentials
  data:
    - secretKey: password
      remoteRef:
        key: prod/db/password

In Git you declare only "which secret to fetch"; the actual value stays in the external vault. The Operator syncs periodically, so rotating a secret externally is reflected automatically.

7. GitOps — Argo CD

Argo CD itself works via the Operator pattern. The Application CRD declares "which path of which Git repo to sync to which cluster."

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/my-org/my-app-config
    targetRevision: main
    path: overlays/prod
  destination:
    server: https://kubernetes.default.svc
    namespace: my-app
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

With selfHeal on, even if someone manually changes a resource in the cluster, Argo CD reverts it to the Git state. Git becomes the single source of truth.

8. Backup — Velero

Velero backs up and restores cluster resources and persistent volumes. It provides CRDs like Backup, Restore, and Schedule.

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"
  template:
    includedNamespaces:
      - production
    storageLocation: aws-s3
    ttl: 720h0m0s

It backs up the production namespace daily at 2 a.m. and retains for 30 days. A core tool for cluster migration and DR scenarios.

9. Service Mesh — Istio

Istio intercepts traffic via sidecars (or ambient mode) to provide routing, mTLS, and observability. Traffic policy is declared via CRDs like VirtualService, DestinationRule, and Gateway.

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
    - reviews
  http:
    - route:
        - destination:
            host: reviews
            subset: v1
          weight: 90
        - destination:
            host: reviews
            subset: v2
          weight: 10

Declare a canary as "90 to 10 traffic" and the mesh builds it. Istiod acts as the controller that translates these CRDs into Envoy configuration.

10. Machine Learning — Kubeflow

Kubeflow is a platform for running ML workflows on Kubernetes, and it is a collection of Operators. The Training Operator, for instance, handles distributed training jobs via CRDs.

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-dist
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
            - name: pytorch
              image: my-registry/pytorch-train:latest
    Worker:
      replicas: 4
      template:
        spec:
          containers:
            - name: pytorch
              image: my-registry/pytorch-train:latest

The Operator handles the master-worker topology of distributed training, environment injection (RANK, WORLD_SIZE, etc.), and restarting failed workers. Data scientists can run distributed training without knowing the infrastructure.

Comparison Table — What Gets Automated, and How Mature

The CNCF Operator Capability Levels classify maturity into five stages, from Level 1 (basic install) to Level 5 (auto pilot: autoscaling, auto-tuning, anomaly detection).

OperatorDomainKey automationRepresentative CRDMaturity tendency
CloudNativePGDatabaseFailover, backup, PITRClusterLevel 4-5
Zalando PostgresDatabaseHA (Patroni), replicationpostgresqlLevel 4
VitessDatabaseSharding, scale-outVitessClusterLevel 4-5
StrimziMessagingBrokers, topics, upgradeKafka, KafkaTopicLevel 4-5
Redis OperatorCacheSentinel failoverRedisFailoverLevel 3-4
PrometheusMonitoringScrape config, alertsServiceMonitorLevel 4
cert-managerCertificatesIssuance, auto-renewalCertificateLevel 5
External SecretsSecretsExternal vault syncExternalSecretLevel 4
Argo CDGitOpsSync, selfHealApplicationLevel 4
VeleroBackupScheduled backup, restoreSchedule, BackupLevel 4
IstioService meshTraffic, mTLSVirtualServiceLevel 4
Kubeflow TrainingMLDistributed trainingPyTorchJobLevel 3-4

Note that maturity is a tendency, not an absolute grade. The same Operator feels different depending on which features you use.

Ten In-House Operator Ideas Worth Building

Beyond using off-the-shelf Operators, you can encode your organization's own operational knowledge into an Operator. Here are ideas worth building in-house.

 1. Onboarding Operator
    - One Tenant CR creates namespace + RBAC + quota + default NetworkPolicy
 2. Certificate/Domain Operator
    - On service registration, provision DNS record + cert + Ingress together
 3. Cost-label Enforcement Operator
    - Reject or auto-tag workloads missing a cost-center label
 4. Nightly Sleep Operator
    - Scale dev namespaces to replicas=0 outside business hours
 5. Self-service Database Operator
    - A DevDatabase CR spins up an isolated test DB and deletes it on expiry
 6. Secret Rotation Operator
    - Rotate API keys per internal policy and rolling-restart workloads
 7. Backup Verification Operator
    - Restore a backup to a throwaway cluster to auto-verify integrity
 8. Canary Analysis Operator
    - Evaluate post-deploy metrics to auto-promote or roll back
 9. Compliance Scan Operator
    - Surface image vuln-scan results in CR status and block policy violations
10. Multi-cluster Placement Operator
    - A PlacementPolicy CR distributes workloads across clusters

What these ideas share: they are repetitive operational tasks with clear procedures that humans easily get wrong. That is exactly the Operator sweet spot.

Is It Worth Building? — Decision Criteria

Reinventing an existing Operator is waste; building one that isn't worth it becomes maintenance debt. Judge by the following criteria.

[Strong signals to build one]
  - The procedure is clear and repetitive (a runbook already exists)
  - Humans often make mistakes in it (3 a.m. responses, human error)
  - It is org-specific domain with no off-the-shelf Operator
  - It maps naturally to a declarative form (clear desired state)
  - It converges via reconcile (you can make it idempotent)

[Signals NOT to build one]
  - It is a one-off task (a Job or script suffices)
  - The procedure requires human judgment each time (automation is risky)
  - A mature off-the-shelf Operator exists (don't reinvent cert-manager)
  - There is no state and nothing to reconcile (a webhook/controller suffices)
  - The team lacks controller-operation capacity (it becomes debt)

The last point matters most. An Operator is harder to maintain long-term than to build. CRD versioning, keeping up with Kubernetes upgrades, the risk of a reconcile bug mangling resources en masse — an ownerless Operator becomes a time bomb.

What to Check When Choosing an Off-the-Shelf Operator

When adopting rather than building, check the following.

Check itemWhy it matters
CNCF/official statusSignal of governance and longevity
Release cadenceDoes it track new K8s versions quickly?
Capability LevelDoes it automate failover/backup too?
CRD version policyHas it moved past v1alpha1?
Security (RBAC scope)Does it demand full cluster-admin?
Production referencesAre there large-scale references?

Conclusion

An Operator in one sentence is "operational knowledge embedded in a reconcile loop." Database failover, certificate auto-renewal, secret sync — all of it once meant a human waking at 3 a.m., and now a controller does it quietly.

If you take one thing from this catalog, let it be the question: "What is the most frequently repeated, most frequently botched operational procedure in my organization?" If the answer is clear and expressible declaratively, that is the candidate for your first Operator. In the next article, we will build the most challenging of them all — a database Operator — from scratch.

References