Building a Kubernetes Operator in Rust (kube-rs)

Introduction — Operational Knowledge as Code
The Operator Pattern — Three Pieces
The Reconcile Loop — The Heart of an Operator
kube-rs — Kubernetes from Rust
The CustomResource Derive — Defining a CRD as Code
The Reconcile Function — The Core Logic
Error Handling and Requeue — Exponential Backoff
Finalizers — A Safety Net for Cleanup
Assembling the Controller — The main Function
Why Rust Instead of Go/kubebuilder
Try It Yourself
Conclusion
References

Introduction — Operational Knowledge as Code

When you first learn Kubernetes, you deploy apps with primitives like Deployment, Service, and ConfigMap. But real operations demand more. Running a database properly means scheduling backups, failing over on incidents, migrating schemas, and expanding storage when it fills up. That procedural knowledge usually lives in someone's head, as a runbook.

An Operator is exactly that operational knowledge moved into code. Just as a human watches a cluster and repeatedly thinks "the state is A but it should be B, so I'll fix it this way," an operator does the same thing automatically as a controller program. CoreOS crystallized the concept in 2016, and today Prometheus, Cert-Manager, and countless databases are all managed by operators.

This post walks through the principles of the Operator pattern and how to build a real operator with the Rust ecosystem's kube-rs. It also makes the case for why Rust is an appealing choice in this space.

The Operator Pattern — Three Pieces

An operator is conceptually made of three parts.

CRD (Custom Resource Definition): A schema that registers a new kind of resource with Kubernetes. Where the built-in kinds are Pod and Service, a CRD lets you add your own kinds (say, Database or Backup) to the API server. Once a CRD is registered, users can declare that resource in YAML and treat it like any other, running commands such as kubectl get database.
Custom Resource (CR): An actual instance shaped by the CRD's schema. It captures the user's desired state, for example "name is my-db, replicas is 3, storage is 10Gi."
Controller: A program that watches custom resources and drives the actual state toward the desired state. This driving process is the heart of it, and it is called reconciliation.

Underneath this structure is a mindset that runs through all of Kubernetes: declarative control. The user declares only "what they want," and the controller takes responsibility for "how to get there."

The Reconcile Loop — The Heart of an Operator

The core of a controller is the reconcile loop. Its operating principle is astonishingly simple.

  1. Read the desired state (spec)
  2. Observe the actual state
  3. Compute the difference
  4. Move the actual state one step toward the desired state
  5. Go back to 1 (or re-run after some interval)

Two principles matter here.

First, reconcile must be idempotent. Running it any number of times with the same input must yield the same result. Write the reconcile function not as "create this" but as "ensure this state holds." If the desired state already exists, do nothing; if something is missing, fill it in, and that's all.

Second, reconcile is level-triggered, not edge-triggered. It decides based on "what the state is right now" (the level), not "what event just happened" (the edge). Because of this, even if a few events are dropped, or the controller restarts, it converges correctly by looking only at the current state. An operator's robustness comes from exactly this property.

kube-rs — Kubernetes from Rust

kube-rs is the de facto standard crate for building Kubernetes clients and controllers in Rust. It is organized into three main pieces.

kube::Client — the client that talks to the API server
kube::Api — type-safe access to a specific kind of resource (get, list, patch, and so on)
kube::runtime — the higher-level tooling you need to write controllers, such as Controller, watcher, and reflector

Dependencies look roughly like this. Check crates.io for the latest versions.

[dependencies]
kube = { version = "0.99", features = ["runtime", "derive", "client"] }
k8s-openapi = { version = "0.24", features = ["latest"] }
serde = { version = "1", features = ["derive"] }
serde_json = "1"
schemars = "0.8"
tokio = { version = "1", features = ["full"] }
thiserror = "2"
futures = "0.3"
tracing = "0.1"
tracing-subscriber = "0.3"

k8s-openapi provides the built-in resource types like Pod and Deployment as Rust structs, and schemars is used to auto-generate the CRD schema (JSON Schema).

The CustomResource Derive — Defining a CRD as Code

The most elegant part of kube-rs is that you define a CRD as a Rust struct. Attaching the CustomResource derive macro produces a complete custom resource type and CRD definition from a single spec struct.

use kube::CustomResource;
use schemars::JsonSchema;
use serde::{Deserialize, Serialize};

/// The "desired state" of the application we manage
#[derive(CustomResource, Debug, Clone, Deserialize, Serialize, JsonSchema)]
#[kube(
    group = "example.com",
    version = "v1",
    kind = "WebApp",
    namespaced,
    status = "WebAppStatus",
    shortname = "wa"
)]
pub struct WebAppSpec {
    /// The container image to run
    pub image: String,
    /// The desired number of replicas
    pub replicas: i32,
}

/// The "observed state" that the controller fills in
#[derive(Debug, Clone, Default, Deserialize, Serialize, JsonSchema)]
pub struct WebAppStatus {
    pub available_replicas: i32,
    pub ready: bool,
}

That one macro does several things. It creates a type called WebApp (which wraps WebAppSpec in a spec field), and calling WebApp::crd() yields the CRD manifest you register with the cluster. The status subresource is kept separate from the spec, so when the controller updates status it does not conflict with the user's spec.

Dumping the generated CRD as YAML looks roughly like this. It contains schema notation with braces and angle brackets, so it must stay inside a code block rather than in prose.

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: webapps.example.com
spec:
  group: example.com
  names:
    kind: WebApp
    plural: webapps
    shortNames: ["wa"]
  scope: Namespaced
  versions:
    - name: v1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                image: { type: string }
                replicas: { type: integer }

Users can now declare a custom resource like this.

apiVersion: example.com/v1
kind: WebApp
metadata:
  name: hello
  namespace: default
spec:
  image: nginx:1.27
  replicas: 3

The Reconcile Function — The Core Logic

Now we write the reconcile function, the heart of the controller. The kube-rs Controller calls this function when a watched resource changes (or periodically). The signature is: "take one custom resource and a shared context, return when to be called next (an Action)."

use std::sync::Arc;
use std::time::Duration;
use k8s_openapi::api::apps::v1::Deployment;
use kube::api::{Api, Patch, PatchParams};
use kube::runtime::controller::Action;
use kube::{Client, ResourceExt};

pub struct Context {
    pub client: Client,
}

async fn reconcile(obj: Arc<WebApp>, ctx: Arc<Context>) -> Result<Action, Error> {
    let ns = obj.namespace().unwrap_or_default();
    let name = obj.name_any();
    let deployments: Api<Deployment> = Api::namespaced(ctx.client.clone(), &ns);

    // Build a Deployment that matches the desired state
    let desired = build_deployment(&obj)?;

    // Use server-side apply to "ensure this state holds" (idempotent)
    let pp = PatchParams::apply("webapp-operator").force();
    deployments
        .patch(&name, &pp, &Patch::Apply(&desired))
        .await?;

    tracing::info!(%ns, %name, "reconciled WebApp");

    // On success, re-check periodically after 5 minutes
    Ok(Action::requeue(Duration::from_secs(300)))
}

The key is that it uses Patch::Apply (server-side apply). It says not "create this" but "make the result equal to this manifest," so calling it any number of times is safe. This is the concrete implementation of the idempotency described earlier.

build_deployment is a pure function that constructs a standard Deployment struct from the spec's image and replicas. You just fill in the types provided by k8s-openapi.

Error Handling and Requeue — Exponential Backoff

Reconcile can fail. The API server might briefly stop responding, you might conflict with another controller, or a transient network error might occur. Rust's Result and kube-rs's Action handle this gracefully.

First, define an error type with thiserror.

#[derive(thiserror::Error, Debug)]
pub enum Error {
    #[error("Kube API error: {0}")]
    Kube(#[from] kube::Error),

    #[error("Missing object key")]
    MissingKey,
}

Then write the error_policy that decides how to retry when reconcile fails. This is where you express the first step of exponential backoff.

fn error_policy(_obj: Arc<WebApp>, err: &Error, _ctx: Arc<Context>) -> Action {
    tracing::warn!("reconcile failed: {err}, retrying");
    // On failure, wait briefly and try again
    Action::requeue(Duration::from_secs(10))
}

Action gives you three choices.

Action::requeue(duration) — reconcile again after this duration (periodic re-check or retry).
Action::await_change() — wait until the resource actually changes (when there is nothing to do).
Returning a long requeue after success re-checks the state periodically even without events (drift detection).

When reconcile returns Err, the Controller automatically calls error_policy and retries after the returned interval. This combination absorbs failures as retries naturally.

Finalizers — A Safety Net for Cleanup

When a custom resource is deleted, you often need to clean up related external resources too (a cloud load balancer, an external DB, an object storage bucket, and so on). The problem is that once a resource is deleted, the controller can no longer see its spec. The mechanism that solves this is the finalizer.

A finalizer is a list of strings attached to an object's metadata. As long as that list is non-empty, Kubernetes does not actually delete the object; it only stamps a deletionTimestamp. In other words, the object enters a "deletion pending" state. Only once the controller finishes its cleanup and removes the finalizer does the object actually disappear. This buys you the time to run cleanup logic.

kube-rs wraps this pattern with a finalizer helper. It splits handling into two branches: apply (create/update) and cleanup (delete).

use kube::runtime::finalizer::{finalizer, Event as FinalizerEvent};

async fn reconcile(obj: Arc<WebApp>, ctx: Arc<Context>) -> Result<Action, Error> {
    let ns = obj.namespace().unwrap_or_default();
    let api: Api<WebApp> = Api::namespaced(ctx.client.clone(), &ns);

    finalizer(&api, "webapp.example.com/cleanup", obj, |event| async {
        match event {
            // On create or update: normal reconcile
            FinalizerEvent::Apply(app) => apply(app, ctx.clone()).await,
            // On delete: clean up external resources, then drop the finalizer
            FinalizerEvent::Cleanup(app) => cleanup(app, ctx.clone()).await,
        }
    })
    .await
    .map_err(|_| Error::MissingKey)
}

Because this helper handles adding and removing the finalizer for you, you only fill in "what to do on apply" and "what to clean up on delete." The finalizer is removed and the object actually deleted only after the Cleanup branch completes successfully. If cleanup fails, the finalizer stays, the object stays, and no resource is leaked.

Assembling the Controller — The main Function

Finally, we wire it all together with a Controller and run it. The Controller watches both its target (our WebApp) and the child resources the operator owns (here, Deployment). Even when a child resource changes (someone touches the Deployment), reconcile is triggered again. This is the source of self-healing.

use futures::StreamExt;
use kube::runtime::watcher::Config;
use kube::runtime::Controller;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    tracing_subscriber::fmt::init();
    let client = Client::try_default().await?;

    let webapps: Api<WebApp> = Api::all(client.clone());
    let deployments: Api<Deployment> = Api::all(client.clone());
    let ctx = Arc::new(Context { client });

    Controller::new(webapps, Config::default())
        .owns(deployments, Config::default())
        .run(reconcile, error_policy, ctx)
        .for_each(|res| async move {
            match res {
                Ok((obj, _action)) => tracing::info!("reconciled {:?}", obj.name),
                Err(e) => tracing::warn!("reconcile error: {e}"),
            }
        })
        .await;

    Ok(())
}

Internally, the Controller keeps a cache of the target resource via a watcher and reflector, and when a relevant event arrives it enqueues that object onto a work queue and calls reconcile. Even when many events arrive at once, reconciles for the same object are serialized and duplicates are merged, so you rarely have to worry about concurrency yourself.

Why Rust Instead of Go/kubebuilder

Kubernetes itself is written in Go, and the mainstream operator frameworks (controller-runtime, kubebuilder, Operator SDK) are Go-based too. For ecosystem maturity and volume of examples, Go still dominates. Even so, there are reasons Rust is appealing.

Small resource footprint. An operator is a process that lives inside the cluster and runs continuously. A Rust binary has no garbage collector, so its memory usage is small and stable, and the container image can be shrunk to a few megabytes with static linking. If you deploy operators across hundreds of clusters, those savings add up. The absence of intermittent GC pauses (stop-the-world) also helps latency-sensitive controllers.
Memory safety and strong types. Reconcile logic is subtle code that juggles the state of multiple resources. Rust's ownership model and Result-based error handling catch null dereferences, data races, and unhandled errors at compile time to a large degree. That "if it compiles, it mostly works" feeling is especially valuable in operational code.
Modeling state with expressive types. Rust's enums and pattern matching are great for expressing a resource's state transitions (for example Pending, Provisioning, Ready, Failed) precisely in the type system. You can make invalid state combinations impossible to represent.

Of course, there are trade-offs. The learning curve is steep, compile times are long, and examples are not as plentiful as Go's. If your team already knows Go and needs to churn out operators quickly, kubebuilder is pragmatic. Conversely, if operator efficiency and robustness matter, especially on resource-constrained edge or large multi-cluster environments, Rust and kube-rs are a strong choice.

Try It Yourself

An operator ultimately runs on top of Kubernetes' networking, scheduling, and resource model. Getting a feel for that foundation makes it far clearer what an operator is reconciling. If you want to experiment with how pods, services, and network policies connect inside a cluster, you can explore it visually with this site's Kubernetes Network Lab. And since what an operator deploys is ultimately containers, if you are curious how containers implement isolation and resource limits, you can examine the underlying mechanics in the Container Lab.

Conclusion

An Operator is a controller that encodes operational knowledge as code, and its heart is an idempotent reconcile loop that repeatedly narrows the gap between desired and actual state. You register a new kind of resource with Kubernetes via a CRD, and a controller watches it to keep the world in the shape you asked for.

The Rust ecosystem's kube-rs supports this pattern remarkably smoothly. You define a CRD as a type with the CustomResource derive macro, handle resources type-safely with Api, assemble the watcher and work queue with Controller, express requeue and backoff with Action, and handle cleanup-on-delete safely with the finalizer helper.

The reason to choose Rust ultimately comes down to how well it fits the nature of the operator workload. A process that lives in the cluster and runs continuously wants a small footprint and predictable performance, and subtle state-reconciliation logic wants strong types and memory safety. If you want to build a robust operator, Rust and kube-rs are a combination worth taking seriously.

References

kube-rs official site: https://kube.rs/
kube-rs on GitHub: https://github.com/kube-rs/kube
Kubernetes — Operator pattern: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
Kubernetes — Custom Resources: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/
Kubernetes — Finalizers: https://kubernetes.io/docs/concepts/overview/working-with-objects/finalizers/
k8s-openapi crate: https://docs.rs/k8s-openapi/
This site's Kubernetes Network Lab: /tools/k8s-network-lab
This site's Container Lab: /tools/container-lab