Operator Security — Least-Privilege RBAC, Multi-Tenancy, Supply Chain

Introduction — An Operator May Be the Most Dangerous Workload in Your Cluster
Why Operator Privileges Are Dangerous — Start With the Threat Model
Generating Least-Privilege RBAC With Markers
Namespace-Scoped vs Cluster-Scoped — Choosing Role and ClusterRole
- Single-Namespace Manager Setup
Securing the Metrics Endpoint — After kube-rbac-proxy Removal (2026)
- What Changed
- Granting the Scraper Permission
Webhook Security — TLS and cert-manager
- The API Server Calls the Webhook Directly
- Validating Webhook Code and Failure Policy
Preventing Privilege Escalation via CRs — The Most Subtle Risk
- The Shape of the Problem
- Defense Strategies
Multi-Tenant Isolation
- How Do You Draw the Boundary Between Tenants?
- Scoping With OperatorGroup (OLM)
Supply Chain Security — Image Signing and SBOM
Secret Handling
- Principles for an Operator That Handles Secrets
Audit
- What Should Be Recorded
How reconcile Idempotency Relates to Security
Security Checklist
Conclusion
References

Introduction — An Operator May Be the Most Dangerous Workload in Your Cluster

Most workloads running inside a Kubernetes cluster do their job within their own namespace. A web application owns its Pods and Services and serves traffic; a batch job reads and writes data and exits. When such a workload is compromised, the damage is usually confined to that workload and its data.

Operators are different. The essence of an Operator is an automated administrator that manipulates cluster state on your behalf. It watches the domain objects you defined as a CRD (Custom Resource Definition) and, to make them real, it creates Deployments, creates Services, reads Secrets, and sometimes even creates ClusterRoles. Doing this requires powerful privileges. And many Operators run cluster-wide (cluster-scoped).

This is where the core security proposition appears. If an Operator's ServiceAccount is compromised, the attacker inherits every privilege that Operator holds. If the Operator can read Secrets in all namespaces, so can the attacker. If the Operator can create ClusterRoleBindings, the attacker can promote themselves to cluster-admin. A single RCE found in one controller Pod can lead straight to a full cluster takeover.

This article covers the complete hardening you should apply when building an Operator with Kubebuilder (as of 2026: Kubernetes 1.36 / Go 1.26, controller-runtime v0.24.x, controller-tools v0.21.x). From least-privilege RBAC generation to metrics endpoint security, webhook TLS, privilege escalation prevention, multi-tenant isolation, and supply chain security — all centered on working code and manifests.

Why Operator Privileges Are Dangerous — Start With the Threat Model

Security design must always begin with a threat model. Viewing an Operator from an attacker's perspective reveals the following attack paths.

Attack path 1: Controller Pod compromise
  Application vulnerability (RCE) -> shell on the controller Pod
  -> steal ServiceAccount token (/var/run/secrets/...)
  -> exercise all of the Operator's RBAC privileges
  -> e.g., read Secrets in every namespace, create ClusterRoleBindings

Attack path 2: Privilege escalation via CR
  A malicious user creates/edits a CR
  -> the Operator trusts the CR and creates RBAC objects
  -> the user indirectly obtains privileges they could not hold directly

Attack path 3: Supply chain
  Compromise of the Operator image build pipeline or base image
  -> an image containing malicious code is deployed to the cluster
  -> looks like a legitimate Operator but contains a backdoor

Attack path 4: Metrics / webhook endpoints
  Unauthenticated metrics endpoint -> internal information leakage
  Webhook without TLS -> man-in-the-middle bypasses admission

Blocking each of these four paths is the practical work of Operator security. The core principle converges on one idea. Least privilege. Grant the Operator only the verbs and resources it actually needs, and deny everything else.

Generating Least-Privilege RBAC With Markers

Declare Privileges Right Next to the Code

One of the best things about a Kubebuilder-based Operator is that RBAC is declared as markers right next to the controller code. controller-tools reads those markers and generates Role/ClusterRole manifests. This reduces the common failure where "the code reads a Secret but RBAC lacks the permission, so it blows up at runtime," and conversely, an "RBAC that is far too broad for what the code actually uses" can be caught during code review.

Here are the RBAC markers of a typical Reconciler.

// WidgetReconciler reconciles a Widget custom resource.
type WidgetReconciler struct {
	client.Client
	Scheme *runtime.Scheme
}

// +kubebuilder:rbac:groups=apps.example.com,resources=widgets,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=apps.example.com,resources=widgets/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=apps.example.com,resources=widgets/finalizers,verbs=update
// +kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=core,resources=services,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=core,resources=events,verbs=create;patch
func (r *WidgetReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	log := logf.FromContext(ctx)

	var widget appsv1alpha1.Widget
	if err := r.Get(ctx, req.NamespacedName, &widget); err != nil {
		// NotFound is normal — may be a re-queue of a deleted object
		return ctrl.Result{}, client.IgnoreNotFound(err)
	}

	// Compute the desired state and ensure the Deployment
	desired := r.buildDeployment(&widget)
	if err := r.ensureDeployment(ctx, desired); err != nil {
		log.Error(err, "failed to ensure deployment")
		return ctrl.Result{}, err
	}

	return ctrl.Result{}, nil
}

Running make manifests makes controller-tools read the markers above and generate the following ClusterRole into config/rbac/role.yaml.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: widget-manager-role
rules:
  - apiGroups: ["apps.example.com"]
    resources: ["widgets"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: ["apps.example.com"]
    resources: ["widgets/status"]
    verbs: ["get", "update", "patch"]
  - apiGroups: ["apps.example.com"]
    resources: ["widgets/finalizers"]
    verbs: ["update"]
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: [""]
    resources: ["services"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["create", "patch"]

Trimming Verbs Is the Heart of Least Privilege

The most common mistake novice Operator developers make is granting every resource a verb of "all" (or the full set get/list/watch/create/update/patch/delete). In reality you must pick verbs deliberately.

What the controller does	Verbs needed	Needlessly broad
Reads and watches CRs only	get, list, watch	create, delete
Creates/updates child resources	get, list, watch, create, update, patch	delete (unneeded if GC handles it via ownerReference)
Reports status	get, update, patch (status subresource)	update on the main resource
Records events	create, patch	get, list

The delete verb in particular must be handled with care. Cleanup of child resources can usually be delegated to ownerReference and the garbage collector, so the controller often does not need delete directly. Without delete, you eliminate the scenario where a compromised controller mass-deletes resources.

Narrowing Scope by Resource Name

RBAC can restrict permissions to specific object names via resourceNames. For example, if the Operator only needs to modify a single ConfigMap or Lease used for leader election, restrict it to that name instead of granting access to all ConfigMaps. Note that controller-tools markers do not directly support resourceNames, so narrow the generated manifest with a kustomize patch, or maintain a separate Role manually.

# Leader-election-only Role — can modify only the named Lease
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: widget-leader-election-role
  namespace: widget-system
rules:
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
    resourceNames: ["widget-operator-leader"]

Namespace-Scoped vs Cluster-Scoped — Choosing Role and ClusterRole

The first security decision to make when designing an Operator is the scope of its privileges. Kubernetes RBAC has two pairs of objects.

Role + RoleBinding             -> valid only within a specific namespace
ClusterRole + ClusterRoleBinding -> valid cluster-wide
ClusterRole + RoleBinding      -> apply a ClusterRole definition, scoped to one namespace

The last combination is important. You define the permission rules as a ClusterRole, but bind it with a RoleBinding, so the permission applies only to the namespace where the RoleBinding lives. This lets you reuse the same permission definition while narrowing the scope.

Situation	Recommended scope	Reason
The CR manages cluster-scoped resources (e.g., Namespace, StorageClass)	ClusterRole + ClusterRoleBinding	Must cross namespace boundaries
The Operator watches CRs in all namespaces	ClusterRole + ClusterRoleBinding	The watch target spans all namespaces
The Operator manages CRs only in specific namespaces (multi-tenancy)	ClusterRole + per-namespace RoleBinding	Reuse definition + narrow scope
Single-namespace-only Operator	Role + RoleBinding	Narrowest scope, safest

Single-Namespace Manager Setup

The controller-runtime manager can restrict its cache scope to namespaces. This way the Operator does not even read objects from other namespaces, and RBAC can be narrowed accordingly.

mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
	Scheme: scheme,
	// Restrict the cache to specific namespaces — do not watch cluster-wide
	Cache: cache.Options{
		DefaultNamespaces: map[string]cache.Config{
			"widget-system": {},
			"tenant-a":      {},
		},
	},
	Metrics: server.Options{
		BindAddress: ":8443",
	},
})
if err != nil {
	setupLog.Error(err, "unable to create manager")
	os.Exit(1)
}

When you scope to namespaces, you can grant permissions through per-namespace Role + RoleBinding instead of a ClusterRole. A narrower permission scope means a smaller blast radius on compromise. You must ask, during the design phase, "Does this Operator really need to see every namespace?"

Securing the Metrics Endpoint — After kube-rbac-proxy Removal (2026)

What Changed

In the past, Kubebuilder scaffolding ran kube-rbac-proxy as a sidecar to protect the metrics endpoint. It placed a proxy in front of the metrics port and validated the incoming request's token with a SubjectAccessReview. But this sidecar added an extra image dependency, extra privileges, and extra operational burden.

As of 2026, Kubebuilder has removed kube-rbac-proxy. Instead, controller-runtime's metrics server provides a WithAuthenticationAndAuthorization filter that performs authentication/authorization itself. Metrics are served over HTTPS, and incoming requests are validated via TokenReview and SubjectAccessReview.

import (
	"crypto/tls"

	"sigs.k8s.io/controller-runtime/pkg/metrics/filters"
	metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server"
)

func main() {
	// ... scheme registration etc. omitted ...

	metricsServerOptions := metricsserver.Options{
		BindAddress:   ":8443",
		SecureServing: true,
		// Authenticate/authorize incoming metrics scrape requests;
		// verify access to the non-resource URL /metrics with SubjectAccessReview
		FilterProvider: filters.WithAuthenticationAndAuthorization,
		TLSOpts: []func(*tls.Config){
			func(c *tls.Config) {
				c.MinVersion = tls.VersionTLS13
			},
		},
	}

	mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
		Scheme:                 scheme,
		Metrics:                metricsServerOptions,
		HealthProbeBindAddress: ":8081",
		LeaderElection:         true,
		LeaderElectionID:       "widget-operator-leader",
	})
	if err != nil {
		setupLog.Error(err, "unable to create manager")
		os.Exit(1)
	}

	// ... register controllers, mgr.Start(...) ...
}

Granting the Scraper Permission

Now the side that wants to scrape metrics (e.g., Prometheus's ServiceAccount) must have get permission on the /metrics non-resource URL. The following ClusterRole defines that permission.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: widget-metrics-reader
rules:
  - nonResourceURLs: ["/metrics"]
    verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: widget-metrics-reader-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: widget-metrics-reader
subjects:
  - kind: ServiceAccount
    name: prometheus
    namespace: monitoring

The benefits are clear. With the sidecar gone, the attack surface shrinks; authentication/authorization is handled inside the main binary through controller-runtime's vetted code path; and you can enforce TLS 1.3. Unauthenticated requests are rejected with 401, and unauthorized requests with 403.

Webhook Security — TLS and cert-manager

The API Server Calls the Webhook Directly

When an Operator provides a ValidatingWebhook or MutatingWebhook, the API server sends admission requests to the controller over HTTPS. This communication must be protected by TLS, and the API server verifies the webhook server certificate against a CA. If the certificate expires or the CA bundle is mismatched, admission fails, and depending on the failure policy, the entire object creation can be blocked.

In practice, cert-manager issues and rotates certificates automatically. The webhook server uses a certificate issued by cert-manager, and the CA bundle is automatically injected into the ValidatingWebhookConfiguration by cert-manager's ca-injector.

# cert-manager Certificate — webhook server TLS certificate
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: widget-serving-cert
  namespace: widget-system
spec:
  dnsNames:
    - widget-webhook-service.widget-system.svc
    - widget-webhook-service.widget-system.svc.cluster.local
  issuerRef:
    kind: Issuer
    name: widget-selfsigned-issuer
  secretName: widget-webhook-server-cert
  duration: 2160h    # 90 days
  renewBefore: 360h  # renew 15 days before expiry

# Annotate so ca-injector injects the caBundle automatically
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: widget-validating-webhook
  annotations:
    cert-manager.io/inject-ca-from: widget-system/widget-serving-cert
webhooks:
  - name: vwidget.kb.io
    failurePolicy: Fail
    sideEffects: None
    admissionReviewVersions: ["v1"]
    clientConfig:
      service:
        name: widget-webhook-service
        namespace: widget-system
        path: /validate-apps-example-com-v1alpha1-widget
    rules:
      - apiGroups: ["apps.example.com"]
        apiVersions: ["v1alpha1"]
        operations: ["CREATE", "UPDATE"]
        resources: ["widgets"]

Validating Webhook Code and Failure Policy

A validating webhook is a powerful control point to verify that an object does not violate security policy. For example, you can stop a user from submitting a dangerous combination of fields.

func (v *WidgetValidator) ValidateCreate(ctx context.Context, obj runtime.Object) (admission.Warnings, error) {
	widget, ok := obj.(*appsv1alpha1.Widget)
	if !ok {
		return nil, fmt.Errorf("not a Widget object")
	}

	// Security policy: reject requests for privileged mode
	if widget.Spec.Privileged {
		return nil, field.Forbidden(
			field.NewPath("spec", "privileged"),
			"privileged mode is not allowed on this cluster",
		)
	}

	// Security policy: only allow images from trusted registries
	if !strings.HasPrefix(widget.Spec.Image, "registry.internal.example.com/") {
		return nil, field.Invalid(
			field.NewPath("spec", "image"),
			widget.Spec.Image,
			"only internal registry images are allowed",
		)
	}

	return nil, nil
}

failurePolicy must be chosen carefully from a security standpoint. Fail rejects the request when the webhook cannot respond, so the security control cannot be bypassed, but if webhook availability degrades, cluster operations may be blocked. For security-control webhooks, use Fail but also ensure high availability of the webhook itself (replicas, PDB).

Preventing Privilege Escalation via CRs — The Most Subtle Risk

The Shape of the Problem

When an Operator creates RBAC objects, a subtle but serious escalation path appears. Consider an Operator that "automatically creates a ServiceAccount and Role per tenant." A user writes the desired permissions into a CR, and the Operator creates a Role accordingly.

Here is the problem. Kubernetes RBAC has privilege escalation prevention rules. When a subject creates or modifies RBAC objects, it cannot grant another subject permissions it does not itself hold. To bypass this requires the escalate or bind verb on the rbac.authorization.k8s.io group.

The trouble arises when the Operator's ServiceAccount holds powerful privileges. A user can use the Operator to create, by proxy, a Role that they would be blocked from creating directly by the escalation rules. The user only needs to be able to create a CR; the powerful Operator performs the actual RBAC creation on their behalf.

Normal path (blocked):
  User -> tries to create a RoleBinding directly (granting cluster-admin)
  -> API server: "you do not have cluster-admin, cannot grant" -> denied

Bypass path (dangerous):
  User -> creates a CR (requesting cluster-admin in spec)
  -> Operator (powerful SA) -> trusts the CR and creates the RoleBinding
  -> the user obtains cluster-admin by proxy

Defense Strategies

Defending against this risk takes several layers.

First, minimize the Operator's own permission to create RBAC. Ask again whether you truly need to create RBAC objects, and if possible design it so the Operator only binds predefined, fixed Roles. Avoid, as much as possible, a design where the Operator dynamically creates Roles with arbitrary rules.

Second, use a validating webhook to whitelist the permissions a CR may request. If a CR requests dangerous verbs (escalate, bind, impersonate) or dangerous resources (clusterrolebindings, all secrets), the webhook rejects it.

var forbiddenVerbs = map[string]bool{
	"escalate":    true,
	"bind":        true,
	"impersonate": true,
}

func (v *TenantRoleValidator) ValidateCreate(ctx context.Context, obj runtime.Object) (admission.Warnings, error) {
	tr, ok := obj.(*appsv1alpha1.TenantRole)
	if !ok {
		return nil, fmt.Errorf("not a TenantRole object")
	}

	for _, rule := range tr.Spec.Rules {
		for _, verb := range rule.Verbs {
			if forbiddenVerbs[verb] {
				return nil, field.Forbidden(
					field.NewPath("spec", "rules"),
					fmt.Sprintf("verb %q is not allowed in a tenant Role", verb),
				)
			}
		}
		// Block requests for cluster-scoped RBAC resources
		for _, res := range rule.Resources {
			if res == "clusterroles" || res == "clusterrolebindings" {
				return nil, field.Forbidden(
					field.NewPath("spec", "rules"),
					"cluster-scoped RBAC resources are not allowed in a tenant Role",
				)
			}
		}
	}
	return nil, nil
}

Third, use aggregated ClusterRoles to eliminate dynamic creation altogether. Define small ClusterRoles that are pre-grouped by labels, and have the Operator merely join this aggregation group rather than minting new rules. This way only permission fragments an operator has reviewed get combined, preventing injection of arbitrary permissions.

Fourth, re-emphasize the least privilege of the controller ServiceAccount itself. Restrict the verbs the Operator can hold on rbac.authorization.k8s.io/roles to get/list/watch/create/update/patch, and grant delete or cluster-scoped RBAC permissions only when truly necessary.

Multi-Tenant Isolation

How Do You Draw the Boundary Between Tenants?

In a multi-tenant environment where several teams share one cluster and each uses the Operator's CRs, you must isolate tenants so one cannot affect another's resources. Isolation strategies fall into two broad camps.

Strategy A: single Operator, multiple tenants (shared Operator)
  One Operator instance handles CRs across all tenant namespaces
  Pros: operationally simple, resource-efficient
  Cons: compromise affects all tenants, weak isolation

Strategy B: per-tenant Operator (dedicated Operator)
  An independent Operator instance per tenant namespace
  Pros: strong isolation, minimal blast radius
  Cons: operationally complex, resource overhead

In highly regulated environments or where trust between tenants is low, strategy B (per-tenant Operator) is recommended. Each Operator has its cache and RBAC restricted to its tenant namespace, so even if one tenant's Operator is compromised, others are unaffected.

Scoping With OperatorGroup (OLM)

When using Operator Lifecycle Manager (OLM), the OperatorGroup determines the Operator's watch scope. Specifying the OperatorGroup's targetNamespaces makes the Operator handle only CRs in those namespaces.

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: tenant-a-operatorgroup
  namespace: tenant-a
spec:
  # This Operator watches only the tenant-a namespace
  targetNamespaces:
    - tenant-a

Alongside this, block cross-tenant traffic with network policy, and prevent any one tenant from monopolizing resources with ResourceQuota and LimitRange.

# Restrict inbound to a tenant namespace to the same tenant only
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: tenant-a-isolation
  namespace: tenant-a
spec:
  podSelector: {}
  policyTypes: ["Ingress"]
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              tenant: tenant-a

# Per-tenant resource ceilings
apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-a-quota
  namespace: tenant-a
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    count/widgets.apps.example.com: "50"

The last line is worth noting. The count of custom resources can also be capped with ResourceQuota, blocking a resource-exhaustion attack where a tenant creates unlimited CRs to overwhelm the Operator.

Supply Chain Security — Image Signing and SBOM

Can You Trust the Operator Image?

The third path in the threat model was the supply chain. No matter how tightly you narrow RBAC, it is useless if the deployed Operator image itself contains a backdoor. As of 2026, the standard tooling for supply chain security is sigstore's cosign (signing), SBOM (Software Bill of Materials), and SLSA provenance.

Sign the image in the build pipeline.

# Sign the image with cosign (keyless — uses OIDC identity)
cosign sign --yes \
  registry.internal.example.com/widget-operator:v1.4.0

# Generate an SBOM and attach it to the image
syft registry.internal.example.com/widget-operator:v1.4.0 \
  -o spdx-json > sbom.spdx.json
cosign attach sbom --sbom sbom.spdx.json \
  registry.internal.example.com/widget-operator:v1.4.0

# Sign the SBOM itself
cosign attest --yes \
  --predicate sbom.spdx.json --type spdxjson \
  registry.internal.example.com/widget-operator:v1.4.0

Enforce Signature Verification With an Admission Policy

Signing an image is not enough. It only matters if the cluster rejects unsigned images. Place sigstore policy-controller (or Kyverno, OPA Gatekeeper) at the admission stage to verify signatures.

# sigstore policy-controller — allow only images signed by a trusted identity
apiVersion: policy.sigstore.dev/v1beta1
kind: ClusterImagePolicy
metadata:
  name: require-signed-operator-images
spec:
  images:
    - glob: "registry.internal.example.com/widget-operator*"
  authorities:
    - keyless:
        identities:
          - issuer: https://token.actions.githubusercontent.com
            subject: https://github.com/example/widget-operator/.github/workflows/release.yml@refs/tags/v1.4.0

In a namespace where this policy is active, an attempt to create a Pod from an image that does not pass signature verification is rejected at admission. If you verify SLSA provenance as well, you can even guarantee that "this image was built by a trusted build system from a specific source commit."

Base Image and Dependencies

Since the Operator binary itself is a statically linked Go binary, prefer a distroless or scratch base. This minimizes the attack surface, and the absence of a shell makes post-compromise lateral movement harder.

# Multi-stage build — final image is distroless
FROM golang:1.26 AS builder
WORKDIR /workspace
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -a -o manager cmd/main.go

FROM gcr.io/distroless/static:nonroot
WORKDIR /
COPY --from=builder /workspace/manager .
USER 65532:65532
ENTRYPOINT ["/manager"]

A distroless image that runs as a nonroot user and contains no shell contributes to both supply chain and runtime security.

Secret Handling

Principles for an Operator That Handles Secrets

Many Operators read Secrets, because they deal with database passwords, API keys, TLS certificates, and the like. There are several iron rules for handling Secrets.

First, never write Secrets to logs. A common incident is logging an entire object for debugging and leaving the Secret value in cleartext in the logs. You must explicitly exclude sensitive fields in structured logging.

// Bad: logging the whole Secret — values are exposed
// log.Info("retrieved secret", "secret", secret)

// Good: log metadata only, never the values
log.Info("retrieved secret",
	"name", secret.Name,
	"namespace", secret.Namespace,
	"keys", maps.Keys(secret.Data), // key names only, exclude values
)

Second, narrow RBAC on Secrets as much as possible. Do not let the Operator read every Secret in every namespace; where possible, restrict access to Secrets in a specific namespace or with a specific name. Make active use of the resourceNames seen earlier.

Third, enable encryption at rest. This is the cluster operator's responsibility, but the Operator developer should document this premise. If the Secret stored in etcd is in cleartext, a single etcd backup leads to the leak of every Secret. Envelope encryption via a KMS provider is recommended.

# API server EncryptionConfiguration — encrypt Secrets with a KMS provider
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
  - resources: ["secrets"]
    providers:
      - kms:
          apiVersion: v2
          name: cluster-kms
          endpoint: unix:///var/run/kmsplugin/socket.sock
      - identity: {}

Fourth, where possible, integrate with an external secret manager (Vault, cloud KMS) to keep Secrets outside the cluster, syncing only when needed with a tool like the External Secrets Operator. Ideally the Operator should not hold long-lived credentials directly.

Audit

What Should Be Recorded

Since an Operator is a powerful subject that changes cluster state, its actions should appear in the audit log. Configure the Kubernetes audit policy to record the requests performed by the Operator's ServiceAccount at an appropriate level.

# API server audit policy excerpt — record Operator write operations in detail
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  # Record the Operator SA's RBAC object changes fully at RequestResponse level
  - level: RequestResponse
    users: ["system:serviceaccount:widget-system:widget-controller-manager"]
    verbs: ["create", "update", "patch", "delete"]
    resources:
      - group: "rbac.authorization.k8s.io"
        resources: ["roles", "rolebindings", "clusterroles", "clusterrolebindings"]
  # Record Secret access at Metadata level (do not log values)
  - level: Metadata
    resources:
      - group: ""
        resources: ["secrets"]
  # Record other Operator write operations at Request level
  - level: Request
    users: ["system:serviceaccount:widget-system:widget-controller-manager"]
    verbs: ["create", "update", "patch", "delete"]

In particular, the act of an Operator changing RBAC objects may be a signal of privilege escalation, so it is good to record it fully at RequestResponse level. Conversely, record Secret access only at Metadata level so values do not appear in cleartext in the audit log. The Operator itself should also record meaningful domain events (e.g., "Widget X reconcile failed") as Kubernetes Events to improve operational visibility.

How reconcile Idempotency Relates to Security

This is a slightly different angle but important. controller-runtime's reconcile should be designed to be idempotent and desired-state centric. This connects to security as well.

If reconcile works by "observing the current state and converging to the desired state," then even if someone maliciously or accidentally tampers with a resource, the Operator reverts it on the next reconcile. In other words, the reconcile loop itself becomes a kind of drift correction and a security control. Conversely, if reconcile is written imperatively (run once when an event arrives), tampering goes uncorrected and becomes a security hole.

Also, filtering events with predicates reduces unnecessary reconciles, improving resilience against load-based attacks (attempts to overwhelm the Operator with many CR changes).

// React only to spec changes via a predicate — prevents reconcile storms from status-only updates
func (r *WidgetReconciler) SetupWithManager(mgr ctrl.Manager) error {
	return ctrl.NewControllerManagedBy(mgr).
		For(&appsv1alpha1.Widget{}).
		WithEventFilter(predicate.GenerationChangedPredicate{}).
		Owns(&appsv1.Deployment{}).
		Complete(r)
}

Finalizers guarantee cleanup on deletion, and this is security-related too. If the Operator created credentials or resources in an external system, it must reliably reclaim them via a finalizer when the CR is deleted, to prevent permission leakage from orphaned resources.

Security Checklist

A single view of items to check before going to production.

[ RBAC least privilege ]
  - Do all RBAC markers grant only the verbs/resources the code actually uses
  - Removed verbs=* or resources=* wildcards
  - Is delete really needed (can ownerReference replace it)
  - Is the leader-election Lease scoped with resourceNames
  - Is a ClusterRole really needed, or would a Role suffice

[ Scope / multi-tenancy ]
  - Restricted the manager cache to the needed namespaces
  - If tenant isolation is needed, considered a per-tenant Operator
  - Applied NetworkPolicy / ResourceQuota / LimitRange
  - Put a quota on the CR count too

[ Metrics / webhook ]
  - Applied WithAuthenticationAndAuthorization on the metrics endpoint
  - Removed the kube-rbac-proxy sidecar (2026)
  - Enforced a minimum TLS 1.3 version
  - Rotating webhook certificates automatically with cert-manager
  - Security webhook failurePolicy=Fail + high availability ensured

[ Privilege escalation prevention ]
  - Designed so the Operator does not dynamically create RBAC
  - Whitelisted the permissions a CR may request via a webhook
  - Blocked the escalate/bind/impersonate verbs

[ Supply chain ]
  - Sign images with cosign
  - Generate/attach/sign an SBOM
  - Enforce signature verification with an admission policy
  - Use a distroless/nonroot base image

[ Secret / audit ]
  - Do not log Secrets in cleartext
  - Narrowed Secret RBAC by namespace/name
  - Enabled etcd encryption at rest (KMS)
  - Record the Operator's RBAC changes at RequestResponse in the audit policy

Conclusion

An Operator is the pinnacle of automation that operates a cluster, but at the same time it is the most attractive attack target because of its powerful privileges. The hardening covered here ultimately converges on one principle. Give the Operator only the minimum privileges it needs, and defend in depth so those privileges cannot be abused.

Declare permissions next to the code with RBAC markers, trim verbs deliberately, scope to namespaces, protect metrics and webhooks with authentication/TLS, block CR-based privilege escalation with webhooks, sign and verify images, handle Secrets carefully, and record every action in the audit log. Each of these is not hard on its own, but only when combined do they form a defensive line where a single compromise does not lead to a full cluster takeover.

Whenever you design a new Operator, draw the threat model first and apply the checklist above from the start. Security is most robust when it is part of the design rather than bolted on afterward.

References

Kubebuilder Book: https://kubebuilder.io/
Operator SDK: https://sdk.operatorframework.io/
controller-runtime (pkg.go.dev): https://pkg.go.dev/sigs.k8s.io/controller-runtime
Kubernetes Operator pattern: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
Kubernetes RBAC docs: https://kubernetes.io/docs/reference/access-authn-authz/rbac/
Admission Controllers: https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/
controller-tools (RBAC markers): https://github.com/kubernetes-sigs/controller-tools
Operator Lifecycle Manager (OLM): https://olm.operatorframework.io/
sigstore / cosign: https://www.sigstore.dev/
SLSA supply chain integrity: https://slsa.dev/
Kubebuilder GitHub: https://github.com/kubernetes-sigs/kubebuilder
controller-runtime GitHub: https://github.com/kubernetes-sigs/controller-runtime