Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

When you first write an Operator, you focus on a single question: "is the reconcile loop running?" But once it reaches production, the questions change. "How often is reconcile failing right now?", "Is the queue backing up?", "Why isn't the user's resource being reflected?", "Is this Operator meeting the level of trust (SLO) it promised?"

This article covers observability for Operators built with Kubebuilder (as of 2026: Kubernetes 1.36 / Go 1.26, controller-runtime v0.24.x, controller-tools v0.21.x). Observability is not just about putting up a dashboard; it is the practice of designing every signal that lets you infer the internal state of a system from the outside. We will cover five signals.

The five observability signals

┌──────────────┬───────────────────────────────────────────┐

│ Signal │ Who is it for │

├──────────────┼───────────────────────────────────────────┤

│ Metrics │ SRE / operators — trends, alerts, SLOs │

│ Events │ End users — context via kubectl describe │

│ Logging │ Developers / operators — one reconcile flow │

│ Tracing │ Developers — latency across multiple stages │

│ status/cond. │ End users / controllers — declarative state │

└──────────────┴───────────────────────────────────────────┘

Each signal serves a different audience and purpose. Metrics carry aggregated trends, events carry the context a user sees via kubectl describe, logs carry the detailed flow of a single reconcile, tracing captures distributed latency, and status/conditions express the declarative current state. You must design all five in balance to end up with an operable Operator.

controller-runtime default metrics

controller-runtime exposes a rich set of default metrics through its metrics server with no extra configuration. These metrics are the starting point for Operator observability. Let us look at the most important ones.

reconcile family

controller_runtime_reconcile_total{controller, result}

Count of reconcile calls. The result label distinguishes

success / error / requeue / requeue_after. It forms the

numerator and denominator for success rate.

controller_runtime_reconcile_errors_total{controller}

Number of times Reconcile returned an error. A steady rise

means something is failing repeatedly.

controller_runtime_reconcile_time_seconds{controller}

Histogram of how long one reconcile takes. It produces

_bucket / _sum / _count series used for quantiles

(p50/p95/p99).

controller_runtime_active_workers{controller}

Number of workers currently performing reconcile concurrently.

controller_runtime_max_concurrent_reconciles{controller}

The configured maximum concurrent reconciles

(MaxConcurrentReconciles).

The result label on reconcile_total matters a lot in operations. An abnormally high count of requeue / requeue_after can signal that reconcile is not converging to the desired state and keeps retrying.

workqueue family

The state of the workqueue that triggers reconcile directly reflects the Operator's load and health.

workqueue_depth{name}

Number of items waiting in the queue. Persistently high means

the controller cannot keep up with the incoming event rate.

workqueue_adds_total{name}

Cumulative items added to the queue. Use rate() to see arrival

rate.

workqueue_queue_duration_seconds{name}

Time items spent waiting in the queue (histogram).

workqueue_work_duration_seconds{name}

Time spent processing an item (histogram). The actual reconcile

work time.

workqueue_retries_total{name}

Cumulative retries from the queue. Rises when error-driven

retries are frequent.

workqueue_unfinished_work_seconds{name}

Time that not-yet-processed work has been sitting in the queue.

workqueue_longest_running_processor_seconds{name}

Duration of the longest-running processing task. Useful to

detect a stuck reconcile.

The relationship between these metrics looks like this.

event occurs -> predicate passes -> workqueue.Add()

│

[workqueue_adds_total]

│

▼

┌──────────────┐

│ workqueue │ <- [workqueue_depth]

│ (waiting) │ <- [queue_duration]

└──────┬───────┘

│ Get()

▼

Reconcile() runs

│ <- [active_workers]

│ <- [work_duration / reconcile_time]

▼

success -> Forget() error -> AddRateLimited()

[reconcile_total [reconcile_errors_total

result=success] workqueue_retries_total]

Metrics server and security

In controller-runtime v0.24.x, the metrics server is configured through the Metrics field of manager.Options. The kube-rbac-proxy sidecar commonly used in the past has been removed, and the metrics endpoint itself now embeds authentication and authorization.

"sigs.k8s.io/controller-runtime/pkg/manager"

metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server"

"sigs.k8s.io/controller-runtime/pkg/metrics/filters"

)

func newManager(cfg *rest.Config) (manager.Manager, error) {

return ctrl.NewManager(cfg, ctrl.Options{

Metrics: metricsserver.Options{

BindAddress: ":8443",

SecureServing: true,

// Applies authentication and authorization to the

// metrics endpoint, replacing the old kube-rbac-proxy

// sidecar.

FilterProvider: filters.WithAuthenticationAndAuthorization,

})

}

WithAuthenticationAndAuthorization validates the ServiceAccount token on incoming requests and checks RBAC permission for the nonResourceURLs path. In other words, for Prometheus to scrape the metrics it needs a ServiceAccount with permission to that path.

Adding custom metrics

Default metrics show how the controller behaves, but you must define domain-specific signals yourself. For example: "number of workloads currently managed", "time elapsed since the last sync", or "count of external API call failures".

controller-runtime exposes its own Prometheus registry (metrics.Registry), so if you register a custom collector there, it is exposed on the same /metrics endpoint.

package controller

"github.com/prometheus/client_golang/prometheus"

"sigs.k8s.io/controller-runtime/pkg/metrics"

)

var (

// Current count of managed resources by phase (gauge)

managedResources = prometheus.NewGaugeVec(

prometheus.GaugeOpts{

Name: "myoperator_managed_resources",

Help: "Number of currently managed resources, by phase label",

[]string{"namespace", "phase"},

)

// Counter for external API call outcomes

externalAPICalls = prometheus.NewCounterVec(

prometheus.CounterOpts{

Name: "myoperator_external_api_calls_total",

Help: "Number of external API calls, by result label",

[]string{"endpoint", "result"},

)

// Timestamp gauge to infer elapsed time since last successful sync

lastSyncTimestamp = prometheus.NewGaugeVec(

prometheus.GaugeOpts{

Name: "myoperator_last_successful_sync_timestamp_seconds",

Help: "Unix timestamp of the last successful sync per resource",

[]string{"namespace", "name"},

)

// Histogram of per-stage reconcile duration

reconcileStageDuration = prometheus.NewHistogramVec(

prometheus.HistogramOpts{

Name: "myoperator_reconcile_stage_duration_seconds",

Help: "Duration of each internal reconcile stage",

Buckets: prometheus.DefBuckets,

[]string{"stage"},

)

func init() {

// Register with controller-runtime's global registry so these

// are exposed on the same /metrics as the default metrics.

metrics.Registry.MustRegister(

managedResources,

externalAPICalls,

lastSyncTimestamp,

reconcileStageDuration,

)

}

Now update these metrics inside reconcile.

func (r *MyResourceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {

log := logf.FromContext(ctx)

var obj appsv1alpha1.MyResource

if err := r.Get(ctx, req.NamespacedName, &obj); err != nil {

return ctrl.Result{}, client.IgnoreNotFound(err)

}

// Start a per-stage timer

stageTimer := prometheus.NewTimer(

reconcileStageDuration.WithLabelValues("fetch_external"),

)

state, err := r.External.FetchState(ctx, obj.Spec.ID)

stageTimer.ObserveDuration()

if err != nil {

externalAPICalls.WithLabelValues("fetch", "error").Inc()

log.Error(err, "failed to fetch external state")

return ctrl.Result{}, err

}

externalAPICalls.WithLabelValues("fetch", "success").Inc()

// Apply stage that converges to the desired state

applyTimer := prometheus.NewTimer(

reconcileStageDuration.WithLabelValues("apply"),

)

phase, err := r.applyDesiredState(ctx, &obj, state)

applyTimer.ObserveDuration()

if err != nil {

return ctrl.Result{}, err

}

// Update gauges

managedResources.WithLabelValues(obj.Namespace, phase).Set(1)

lastSyncTimestamp.WithLabelValues(obj.Namespace, obj.Name).

SetToCurrentTime()

return ctrl.Result{RequeueAfter: 5 * time.Minute}, nil

}

The pitfall of gauges and deletion handling

The most common mistake with gauges is that the gauge keeps reporting its last value even after the resource is deleted. A gauge like managed_resources, whose label cardinality grows with the number of resources, must be removed with DeleteLabelValues on deletion.

// On finalizer or NotFound handling, clean up the series for that label.

managedResources.DeleteLabelValues(obj.Namespace, oldPhase)

lastSyncTimestamp.DeleteLabelValues(obj.Namespace, obj.Name)

If you forget this, "ghost" series for deleted resources linger, cardinality grows without bound, and Prometheus memory is eroded. This is also why putting a high-cardinality value like name into a label deserves careful thought.

Prometheus scraping and ServiceMonitor

Once metrics are exposed, Prometheus must collect them. In an environment using the Prometheus Operator, you declare scrape targets with a ServiceMonitor custom resource.

apiVersion: monitoring.coreos.com/v1

kind: ServiceMonitor

metadata:

namespace: myoperator-system

labels:

app.kubernetes.io/name: myoperator

release: kube-prometheus-stack

spec:

selector:

matchLabels:

control-plane: controller-manager

endpoints:

- port: https

scheme: https

path: /metrics

interval: 30s

bearerTokenSecret:

key: token

tlsConfig:

insecureSkipVerify: true

namespaceSelector:

matchNames:

- myoperator-system

Because the metrics endpoint requires authentication, the ServiceAccount Prometheus uses must be granted the following permission.

apiVersion: rbac.authorization.k8s.io/v1

kind: ClusterRole

metadata:

rules:

- nonResourceURLs:

- /metrics

verbs:

- get

If you do not use a ServiceMonitor, you can do the same thing with a static Prometheus scrape config.

scrape_configs:

- job_name: myoperator

scheme: https

metrics_path: /metrics

tls_config:

insecure_skip_verify: true

authorization:

type: Bearer

credentials_file: /var/run/secrets/tokens/metrics-token

kubernetes_sd_configs:

- role: endpoints

namespaces:

names:

- myoperator-system

relabel_configs:

- source_labels: [__meta_kubernetes_endpoint_port_name]

action: keep

regex: https

Grafana dashboards and key PromQL

Collected metrics are queried with PromQL to build dashboards. Here are the queries you should always have as panels when operating an Operator.

reconcile success rate

sum(rate(controller_runtime_reconcile_total{controller="myresource", result!="error"}[5m]))

sum(rate(controller_runtime_reconcile_total{controller="myresource"}[5m]))

This becomes the core SLO indicator. The closer to 1, the healthier. Consider alerting if it drops below 0.99.

reconcile error rate

sum(rate(controller_runtime_reconcile_errors_total{controller="myresource"}[5m]))

This shows the absolute rate of errors themselves. A sudden jump from zero suggests a new deployment or an external dependency outage.

reconcile latency quantiles (p95 / p99)

histogram_quantile(

0.95,

sum(rate(controller_runtime_reconcile_time_seconds_bucket{controller="myresource"}[5m])) by (le)

)

Change 0.95 to 0.99 for p99. Rising latency suggests a bottleneck in external API calls or the apply stage.

workqueue depth and wait time

workqueue_depth{name="myresource"}

histogram_quantile(

0.95,

sum(rate(workqueue_queue_duration_seconds_bucket{name="myresource"}[5m])) by (le)

)

If depth trends steadily upward, the controller cannot keep up with load. Increase MaxConcurrentReconciles, reduce unnecessary events with predicates, or make reconcile itself lighter.

Data freshness

Using the custom timestamp gauge, you can directly alert on "how long since the last sync".

time() - max by (namespace, name) (myoperator_last_successful_sync_timestamp_seconds)

If this exceeds, say, 900 seconds (15 minutes), the resource has not been reconciled for a while.

Recording Kubernetes Events

If metrics are for SREs, events are for end users. Users see what happened to their resource via kubectl describe. A well-crafted event immediately answers the question "why isn't my resource Ready?"

In controller-runtime you obtain an EventRecorder with mgr.GetEventRecorderFor.

type MyResourceReconciler struct {

client.Client

Scheme *runtime.Scheme

Recorder record.EventRecorder

}

// Inject the Recorder in SetupWithManager.

func (r *MyResourceReconciler) SetupWithManager(mgr ctrl.Manager) error {

r.Recorder = mgr.GetEventRecorderFor("myresource-controller")

return ctrl.NewControllerManagedBy(mgr).

For(&appsv1alpha1.MyResource{}).

Complete(r)

}

Record Normal/Warning events inside reconcile.

func (r *MyResourceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {

var obj appsv1alpha1.MyResource

if err := r.Get(ctx, req.NamespacedName, &obj); err != nil {

return ctrl.Result{}, client.IgnoreNotFound(err)

}

deployment, err := r.ensureDeployment(ctx, &obj)

if err != nil {

// User-visible Warning event: what failed and why

r.Recorder.Eventf(&obj, corev1.EventTypeWarning, "DeploymentFailed",

"failed to create child Deployment: %v", err)

return ctrl.Result{}, err

}

if deployment.Status.ReadyReplicas == *deployment.Spec.Replicas {

// Normal event announcing a healthy state

r.Recorder.Eventf(&obj, corev1.EventTypeNormal, "Ready",

"all replicas (%d) are ready", deployment.Status.ReadyReplicas)

}

return ctrl.Result{}, nil

}

Cautions when using events

Events are powerful but harmful when overused. The key principles:

Event recording principles

- Record only on state "transitions". Emitting the same event

every reconcile is noise for both etcd and the user.

- Make Reason a PascalCase word so machines can grep it.

(e.g., DeploymentFailed, Ready, Scaling, QuotaExceeded)

- Make the message a human sentence. Hinting at the next action

is even better.

- Events disappear after the default TTL (about 1 hour). Put

persistent state in status.conditions, not in events.

- Use Warning only when real attention is needed. Overuse dulls it.

Events are garbage-collected after roughly one hour by default. Therefore the persistent representation of "current state" must live in status, not in events.

Structured logging with logr

A Kubebuilder Operator uses the logr interface. The key idea is structured logging with key-value pairs, not string formatting. This lets you emit logs as JSON and search/aggregate by field.

logf "sigs.k8s.io/controller-runtime/pkg/log"

)

func (r *MyResourceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {

// Get the logger from context. controller-runtime already

// injects reconcileID, controller, namespace, name, etc.

log := logf.FromContext(ctx)

var obj appsv1alpha1.MyResource

if err := r.Get(ctx, req.NamespacedName, &obj); err != nil {

return ctrl.Result{}, client.IgnoreNotFound(err)

}

// Bind common fields that follow this reconcile throughout.

log = log.WithValues("generation", obj.Generation, "phase", obj.Status.Phase)

log.Info("starting reconcile")

if err := r.doWork(ctx, &obj); err != nil {

// Errors at Error level. Passing err first makes it structured.

log.Error(err, "failed to perform work", "step", "doWork")

return ctrl.Result{}, err

}

// Use V-levels for detailed debug info, off by default.

log.V(1).Info("computed desired state", "replicas", obj.Spec.Replicas)

log.Info("reconcile complete")

return ctrl.Result{}, nil

}

V-levels and logging policy

logr's V-levels mean "verbosity". V(0) (= Info) is always visible; higher numbers are more verbose and usually kept off.

Logging level guide

Error : reconcile failed and will be retried. Always on.

Info / V(0) : macro flow such as state transitions and

reconcile start/complete.

V(1) : decision rationale -- why this path was taken.

V(2)+ : very detailed flow inside loops, per-item.

What NOT to log

- Secret values, tokens, passwords (never)

- The same "no change" log every reconcile (noise)

- A full dump of a huge object (use WithValues for just the

fields you need)

The essence of good logging is "can you trace one reconcile's flow by reconcileID?" controller-runtime automatically adds reconcileID to the context logger, so filtering by the same ID reconstructs the full story of a single reconcile.

Distributed tracing with OpenTelemetry

When reconcile operates across multiple external systems (cloud APIs, databases, other controllers), you need tracing to know where latency originates. Instrument reconcile with OpenTelemetry and propagate context.

"go.opentelemetry.io/otel"

"go.opentelemetry.io/otel/attribute"

"go.opentelemetry.io/otel/codes"

)

var tracer = otel.Tracer("myoperator/controller")

func (r *MyResourceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {

// Start a root span wrapping the whole reconcile.

ctx, span := tracer.Start(ctx, "Reconcile",

trace.WithAttributes(

attribute.String("resource.namespace", req.Namespace),

attribute.String("resource.name", req.Name),

)

defer span.End()

var obj appsv1alpha1.MyResource

if err := r.Get(ctx, req.NamespacedName, &obj); err != nil {

span.RecordError(err)

return ctrl.Result{}, client.IgnoreNotFound(err)

}

// Wrap external calls in child spans. Pass ctx through so the

// parent-child relationship propagates.

if err := r.fetchExternal(ctx, &obj); err != nil {

span.SetStatus(codes.Error, "external fetch failed")

span.RecordError(err)

return ctrl.Result{}, err

}

return ctrl.Result{}, nil

}

func (r *MyResourceReconciler) fetchExternal(ctx context.Context, obj *appsv1alpha1.MyResource) error {

// Starting a child span from the same ctx automatically links

// it to the parent.

ctx, span := tracer.Start(ctx, "fetchExternal")

defer span.End()

span.SetAttributes(attribute.String("external.id", obj.Spec.ID))

return r.External.Call(ctx, obj.Spec.ID)

}

The essence of tracing is context propagation. If you pass ctx consistently to every call, all spans produced by one reconcile are bundled into a single trace. This lets you reach data-backed conclusions like "80% of p99 latency comes from external API calls".

A single reconcile trace

Reconcile [120ms] ──────────────────────────────────────────

├─ Get(MyResource) [3ms] ─

├─ fetchExternal [95ms] ────────────────────────────

│ └─ HTTP GET cloud-api [92ms] ──────────────────

└─ applyDesiredState [20ms] ───────

└─ Patch(Deployment) [18ms] ──────

Showing state to users with status and conditions

If events are transient occurrences, status.conditions is the persistent, declarative current state. The standard in the Kubernetes ecosystem is to keep an array of metav1.Condition in status. Users read these conditions from kubectl get output, and other controllers act on them.

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

"k8s.io/apimachinery/pkg/api/meta"

)

func (r *MyResourceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {

var obj appsv1alpha1.MyResource

if err := r.Get(ctx, req.NamespacedName, &obj); err != nil {

return ctrl.Result{}, client.IgnoreNotFound(err)

}

ready, reason, msg := r.evaluateReadiness(ctx, &obj)

// meta.SetStatusCondition updates a condition of the same Type

// or appends it if absent. It manages LastTransitionTime for you.

meta.SetStatusCondition(&obj.Status.Conditions, metav1.Condition{

Type: "Ready",

Status: conditionStatus(ready),

Reason: reason, // PascalCase, machine readable

Message: msg, // human-facing explanation

ObservedGeneration: obj.Generation, // which generation was evaluated

})

// Update the status subresource.

if err := r.Status().Update(ctx, &obj); err != nil {

return ctrl.Result{}, err

}

return ctrl.Result{}, nil

}

func conditionStatus(ok bool) metav1.ConditionStatus {

if ok {

return metav1.ConditionTrue

}

return metav1.ConditionFalse

}

ObservedGeneration is especially important. If a user changes the spec, raising generation, but conditions[].observedGeneration is smaller, the controller has not yet reflected the latest spec. This is the standard signal for judging "has my change been applied?"

status.conditions design principles

- Type is a noun-like capability/state: Ready, Available,

Progressing, Degraded. Following standard meanings improves

tool compatibility.

- Status is only True/False/Unknown.

- Reason is a PascalCase word. Message is a human sentence.

- Always fill ObservedGeneration. It is key to applied-or-not.

- Set conditions idempotently. Do not append every time.

Defining Operator SLOs

Now that we have gathered the signals, it is time to turn them into a "promise". An SLO (Service Level Objective) defines, in numbers, the level of trust the Operator must keep. Here are three SLIs (indicators) suited to Operators.

Three SLIs / example SLOs for Operators

┌────────────────┬─────────────────────────────┬──────────────┐

│ SLI │ Definition │ SLO (example)│

├────────────────┼─────────────────────────────┼──────────────┤

│ reconcile │ non-error reconcile / │ 99.5% / 30d │

│ success rate │ total reconcile │ │

├────────────────┼─────────────────────────────┼──────────────┤

│ reconcile │ p99 reconcile processing time │ p99 < 2s │

│ latency │ │ (5m window) │

├────────────────┼─────────────────────────────┼──────────────┤

│ freshness │ time since last successful │ 95% < 10m │

│ │ sync │ │

└────────────────┴─────────────────────────────┴──────────────┘

Expressing the success-rate SLO in terms of an error budget with PromQL:

1 - (

sum(rate(controller_runtime_reconcile_errors_total{controller="myresource"}[30d]))

sum(rate(controller_runtime_reconcile_total{controller="myresource"}[30d]))

)

Defining alerts with PrometheusRule

To alert as you approach an SLO violation, define a PrometheusRule.

apiVersion: monitoring.coreos.com/v1

kind: PrometheusRule

metadata:

namespace: myoperator-system

labels:

release: kube-prometheus-stack

spec:

groups:

- name: myoperator.slo

rules:

- alert: ReconcileErrorRateHigh

expr: |

sum(rate(controller_runtime_reconcile_errors_total{controller="myresource"}[5m]))

sum(rate(controller_runtime_reconcile_total{controller="myresource"}[5m]))

> 0.05

for: 10m

labels:

severity: warning

annotations:

summary: "MyResource reconcile error rate exceeded 5%"

description: "Reconcile error rate has crossed the SLO threshold over the last 5 minutes."

- alert: ReconcileLatencyHigh

expr: |

histogram_quantile(0.99,

sum(rate(controller_runtime_reconcile_time_seconds_bucket{controller="myresource"}[5m])) by (le)

) > 2

for: 15m

labels:

severity: warning

annotations:

summary: "MyResource reconcile p99 latency exceeded 2s"

- alert: ResourceStaleness

expr: |

time() - max by (namespace, name) (

myoperator_last_successful_sync_timestamp_seconds

) > 900

for: 5m

labels:

severity: critical

annotations:

summary: "A resource has not synced for over 15 minutes"

description: "A specific MyResource is violating the SLO freshness target."

- alert: WorkqueueBacklog

expr: workqueue_depth{name="myresource"} > 50

for: 10m

labels:

severity: warning

annotations:

summary: "Workqueue backlog has built up"

description: "The controller is not keeping up with incoming events."

In alert design, the for clause matters. Give it enough duration so alerts do not fire on transient spikes, but balance it so you do not learn about SLO violations too late.

Debugging workflow — "my reconcile isn't running"

The most common and frustrating situation in Operator operations is "I created the resource but nothing happens". This is the case where reconcile itself is not being called, and the cause is scattered across several places, making it tricky to trace. Here is a step-by-step diagnostic runbook.

Runbook: diagnosing "reconcile not triggered"

1) First confirm with metrics whether reconcile is running

- Is controller_runtime_reconcile_total{controller="..."}

increasing?

- Not increasing at all -> the controller is not receiving

events. Go to 2-6 below.

- Increasing but errors_total also rises -> reconcile is being

called but failing. Trace via logs/events (not this runbook).

2) Is the controller manager alive and the leader?

- Is the pod Running? Not crash-looping?

- With leader election: non-leader replicas do not run

reconcile. Check the leader lock Lease.

kubectl get lease -n myoperator-system

- Make sure you are reading the leader pod's logs (another

pod's logs may look "quiet").

3) Is information arriving in the cache (informer / RBAC)?

- Does the controller SA have RBAC to watch/list/get the

target resource? If not, the informer fails silently or

with a permission error.

kubectl auth can-i list myresources \

--as=system:serviceaccount:myoperator-system:controller-manager

- Check the logs for "failed to list" / "forbidden" errors.

4) Is a predicate filtering out the event?

- Does the predicate on WithEventFilter / Owns / Watches in

SetupWithManager pass this change?

- e.g., GenerationChangedPredicate ignores updates that only

change metadata/labels -> a status/annotation-only change

will not trigger reconcile.

- If suspicious, temporarily remove the predicate and retry.

5) Is the event source (watch) actually registered?

- Did you forget the target type in For()/Owns()/Watches()?

- When watching a different namespace / cluster-scoped

resource, is the manager limiting the cache by namespace

(check Cache.DefaultNamespaces)?

6) Did the object actually change?

- Compare generation and status.observedGeneration via

kubectl get myresource -o yaml.

- If resourceVersion is unchanged, the apiserver saw no

change -> there is no event at all.

Decision tree summary

reconcile_total not increasing?

├─ pod/leader issue -> 2)

├─ insufficient RBAC -> 3)

├─ predicate filtering -> 4)

└─ missing watch -> 5)

The key to this runbook is "use metrics (step 1) to first split on whether reconcile is being called". The problem where reconcile never runs and the problem where it runs but fails belong to entirely different causal domains. With well-designed observability, you can make that first branch from data rather than guesswork.

Operational runbook

Finally, here is an operational runbook to reference during normal operation and incidents.

Daily checks (from the dashboard, daily/weekly)

- Is reconcile success rate above the SLO (e.g., 99.5%)?

- Is reconcile p99 latency below target (e.g., 2s)?

- Does workqueue_depth converge near 0?

- Are freshness alerts quiet?

Post-deploy checks

- Does reconcile_errors_total not spike right after deploy?

- Is active_workers not pinned at max_concurrent_reconciles

(saturation signal)?

- Are there no new Errors in the new version's logs?

Incident response

- Error rate spike -> top candidate is rolling back the recent

deploy. Also check external dependency status concurrently.

- Latency spike -> use traces to pinpoint which span is slow.

Suspect external API rate limits / timeouts.

- Workqueue backlog -> raise MaxConcurrentReconciles, block

unnecessary events with predicates, lighten reconcile.

- Only a specific resource is stale -> kubectl describe that

object to check Warning events and conditions. Compare

observedGeneration.

Capacity planning

- Trend of reconcile_total rate as managed resources grow

- Monitor metric cardinality growth (especially name labels)

- Prometheus TSDB memory/disk usage

Closing

Observability is the final step that lifts an Operator from "code that works" to "a system you can operate". To recap the essentials:

- controller-runtime's default metrics (reconcile total/errors/duration, workqueue depth/adds/latency) are a powerful starting point you get for free. Understanding the meaning of each label comes first.

- Define domain-specific signals yourself with custom metrics, but watch out for gauge deletion and label cardinality.

- Events are context for users, logs are flow for developers, tracing is latency analysis, and status/conditions are declarative current state — design your signals split by audience.

- Define SLOs in numbers and wire alerts with PrometheusRule, and you can manage Operator reliability as a promise rather than a guess.

- For "reconcile isn't running", first split on whether it is called using metrics, then methodically check predicate / RBAC / watch / leader election to narrow it down quickly.

A good Operator works quietly, but good observability always distinguishes whether that quiet is "healthy silence" or "the silence of an outage".

References

- [Kubebuilder Book](https://book.kubebuilder.io/)

- [Kubebuilder Book — Metrics](https://book.kubebuilder.io/reference/metrics)

- [controller-runtime (pkg.go.dev)](https://pkg.go.dev/sigs.k8s.io/controller-runtime)

- [kubernetes-sigs/controller-runtime (GitHub)](https://github.com/kubernetes-sigs/controller-runtime)

- [kubernetes-sigs/kubebuilder (GitHub)](https://github.com/kubernetes-sigs/kubebuilder)

- [Operator pattern — Kubernetes Docs](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/)

- [Prometheus Documentation](https://prometheus.io/docs/)

- [Grafana Documentation](https://grafana.com/docs/)

- [kube-state-metrics (GitHub)](https://github.com/kubernetes/kube-state-metrics)

- [OpenTelemetry Documentation](https://opentelemetry.io/docs/)

- [go-logr/logr (GitHub)](https://github.com/go-logr/logr)

- [Kubernetes API Reference — Events](https://kubernetes.io/docs/reference/kubernetes-api/)