💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

There is a wide gap between someone who has written an Operator once and someone who has deployed that Operator safely to dozens of clusters and upgraded it without downtime. Writing the reconcile logic of a controller is only the beginning. The real difficulty emerges when you try to answer questions like "Does this controller really behave as intended?", "Will deploying a new version break existing workloads?", and "Can a user install and upgrade it with a single click from OperatorHub?"

This article takes a deep look at the two pillars that guarantee Operator reliability: testing and distribution. On the testing side, we walk through the test pyramid that moves from unit reconcile tests to envtest-based integration tests and finally to e2e tests on kind. On the distribution side, we cover the OLM (Operator Lifecycle Manager) concepts of ClusterServiceVersion, bundle, and catalog, along with channels, upgrade graphs, and distribution through OperatorHub.

As of 2026, Kubebuilder supports Kubernetes 1.36 and Go 1.26, running on controller-runtime v0.24.x and controller-tools v0.21.x. The kube-rbac-proxy that was previously used to protect metrics has been removed, and the recommended approach is now to use the authentication and authorization filter provided directly by controller-runtime. The examples in this article are written against this latest stack.

The Test Pyramid

Operator testing is most efficient when structured as a variant of the traditional test pyramid adapted to the Kubernetes environment. The bottom is fast and cheap but less realistic, while the top is slow and expensive but much closer to the real production environment.

/ \ e2e (kind / real cluster)

/ \ - few, slow, high cost

/------\ - full flow, real kubelet, CNI

/ \

/ \ envtest integration

/ \ - medium count, real apiserver+etcd

/ \ - reconcile <-> API interaction

/----------------\

/ \ unit reconcile tests

/ \- many, fast, low cost

/ \- fake client, logic-level checks

/________________________\

The roles of the three layers are clearly separated.

| --- | --- | --- | --- | --- |

The core principle is "catch bugs at the lowest layer that can catch them." If a failure only reproduces in e2e, the cost is very high. Therefore you should push as much logic as possible down to the unit and envtest levels, and reserve e2e for integration scenarios that genuinely require a real cluster.

Setting Up envtest

envtest is a testing tool provided by controller-runtime that launches real kube-apiserver and etcd binaries locally and attaches the controller on top of them. The kubelet, scheduler, and controller-manager are not run, so Pods are never actually scheduled or started. However, every interaction with the API server behaves exactly like the real thing. This means you can verify behaviors such as CRD validation, admission, watch events, and owner-reference-based garbage collection triggers using real API semantics.

The envtest binaries are downloaded with the setup-envtest tool.

Install envtest binaries for a specific Kubernetes version

go run sigs.k8s.io/controller-runtime/tools/setup-envtest@latest use 1.36.x

Expose the install path as an environment variable

export KUBEBUILDER_ASSETS=$(setup-envtest use 1.36.x -p path)

When scaffolded with Kubebuilder, a `suite_test.go` is generated that builds the test suite on top of Ginkgo and Gomega.

package controller

"context"

"path/filepath"

"testing"

. "github.com/onsi/ginkgo/v2"

. "github.com/onsi/gomega"

"k8s.io/client-go/kubernetes/scheme"

ctrl "sigs.k8s.io/controller-runtime"

"sigs.k8s.io/controller-runtime/pkg/client"

"sigs.k8s.io/controller-runtime/pkg/envtest"

logf "sigs.k8s.io/controller-runtime/pkg/log"

"sigs.k8s.io/controller-runtime/pkg/log/zap"

webappv1 "example.com/guestbook-operator/api/v1"

)

var (

cfg *rest.Config

k8sClient client.Client

testEnv *envtest.Environment

ctx context.Context

cancel context.CancelFunc

)

func TestControllers(t *testing.T) {

RegisterFailHandler(Fail)

RunSpecs(t, "Controller Suite")

}

var _ = BeforeSuite(func() {

logf.SetLogger(zap.New(zap.WriteTo(GinkgoWriter), zap.UseDevMode(true)))

ctx, cancel = context.WithCancel(context.TODO())

By("bootstrapping test environment")

testEnv = &envtest.Environment{

CRDDirectoryPaths: []string{filepath.Join("..", "..", "config", "crd", "bases")},

ErrorIfCRDPathMissing: true,

}

var err error

cfg, err = testEnv.Start()

Expect(err).NotTo(HaveOccurred())

Expect(cfg).NotTo(BeNil())

Expect(webappv1.AddToScheme(scheme.Scheme)).To(Succeed())

k8sClient, err = client.New(cfg, client.Options{Scheme: scheme.Scheme})

Expect(err).NotTo(HaveOccurred())

Expect(k8sClient).NotTo(BeNil())

mgr, err := ctrl.NewManager(cfg, ctrl.Options{Scheme: scheme.Scheme})

Expect(err).NotTo(HaveOccurred())

err = (&GuestbookReconciler{

Client: mgr.GetClient(),

Scheme: mgr.GetScheme(),

}).SetupWithManager(mgr)

Expect(err).NotTo(HaveOccurred())

go func() {

defer GinkgoRecover()

Expect(mgr.Start(ctx)).To(Succeed())

}()

})

var _ = AfterSuite(func() {

cancel()

By("tearing down the test environment")

Expect(testEnv.Stop()).To(Succeed())

})

The key thing this suite demonstrates is that `testEnv.Start()` brings up a real apiserver and etcd and returns their connection info via `cfg`. After starting the manager and registering the real controller, integration tests can be written by creating objects in an environment where the real reconcile loop is running and observing the results.

Below is an example integration test running on top of envtest. It verifies that creating a Guestbook resource causes the controller to create a Deployment and update the status.

var _ = Describe("Guestbook controller", func() {

const (

resourceName = "test-guestbook"

namespace = "default"

)

Context("When reconciling a Guestbook resource", func() {

It("should create a Deployment and mark it Ready", func() {

By("creating the Guestbook object")

gb := &webappv1.Guestbook{

ObjectMeta: metav1.ObjectMeta{

Name: resourceName,

Namespace: namespace,

Spec: webappv1.GuestbookSpec{

Replicas: 3,

Image: "nginx:1.27",

}

Expect(k8sClient.Create(ctx, gb)).To(Succeed())

By("checking that the controller creates a Deployment")

deployKey := types.NamespacedName{Name: resourceName, Namespace: namespace}

createdDeploy := &appsv1.Deployment{}

Eventually(func() error {

return k8sClient.Get(ctx, deployKey, createdDeploy)

}, time.Second*10, time.Millisecond*250).Should(Succeed())

Expect(*createdDeploy.Spec.Replicas).To(Equal(int32(3)))

By("verifying that the owner reference is set correctly")

Expect(createdDeploy.OwnerReferences).To(HaveLen(1))

Expect(createdDeploy.OwnerReferences[0].Kind).To(Equal("Guestbook"))

})

The `Eventually` block reflects the asynchronous nature of the controller. Immediately after creating the object, the Deployment may not exist yet, so the test polls repeatedly for a certain period until the condition is satisfied. This pattern appears almost everywhere in envtest integration tests.

Unit Reconcile Tests

Verifying every branch with envtest makes tests slow. Pure logic branches inside the reconcile function are better verified quickly with a fake client. The controller-runtime fake client emulates an in-memory object store, so it can exercise Get/Create/Update/List without launching an apiserver.

package controller

"context"

"testing"

appsv1 "k8s.io/api/apps/v1"

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

"k8s.io/apimachinery/pkg/types"

"sigs.k8s.io/controller-runtime/pkg/client/fake"

"sigs.k8s.io/controller-runtime/pkg/reconcile"

webappv1 "example.com/guestbook-operator/api/v1"

)

func TestReconcile_CreatesDeployment(t *testing.T) {

s := scheme.Scheme

_ = webappv1.AddToScheme(s)

_ = appsv1.AddToScheme(s)

gb := &webappv1.Guestbook{

ObjectMeta: metav1.ObjectMeta{Name: "gb", Namespace: "ns"},

Spec: webappv1.GuestbookSpec{Replicas: 2, Image: "nginx:1.27"},

}

cl := fake.NewClientBuilder().

WithScheme(s).

WithObjects(gb).

Build()

r := &GuestbookReconciler{Client: cl, Scheme: s}

_, err := r.Reconcile(context.Background(), reconcile.Request{

NamespacedName: types.NamespacedName{Name: "gb", Namespace: "ns"},

})

if err != nil {

t.Fatalf("reconcile failed: %v", err)

}

got := &appsv1.Deployment{}

if err := cl.Get(context.Background(),

types.NamespacedName{Name: "gb", Namespace: "ns"}, got); err != nil {

t.Fatalf("Deployment was not created: %v", err)

}

if *got.Spec.Replicas != 2 {

t.Errorf("expected replicas 2, got %d", *got.Spec.Replicas)

}

These unit tests run in milliseconds, so you can write dozens of them for edge cases (for example, an already-existing Deployment, a replica mismatch, or a deletion-in-progress state) without any burden. The fake client does not perform admission or validation webhooks like a real API server, so just remember that behaviors such as CRD validation must be confirmed in envtest.

OLM Concepts: CSV, Bundle, Catalog

Once testing is done, it is time for distribution. OLM (Operator Lifecycle Manager) is the component that declaratively manages installation, upgrade, dependencies, and permissions of Operators in a cluster. To understand OLM you need to know four core concepts.

| Concept | Role |

| --- | --- |

| ClusterServiceVersion (CSV) | A manifest describing one version of the Operator. Includes deployment, RBAC, CRD ownership, install modes, and version metadata |

| bundle | A packaging unit bundling one CSV with its CRDs and metadata |

| bundle image | An OCI container image holding the bundle content |

| catalog | An image indexing many bundles and their channel info. OLM reads the list of installable Operators from here |

The CSV is the central document in the OLM world. Here is an example with only the key fields.

apiVersion: operators.coreos.com/v1alpha1

kind: ClusterServiceVersion

metadata:

namespace: placeholder

spec:

displayName: Guestbook Operator

version: 0.2.0

replaces: guestbook-operator.v0.1.0

maturity: stable

installModes:

- type: OwnNamespace

supported: true

- type: SingleNamespace

supported: true

- type: MultiNamespace

supported: false

- type: AllNamespaces

supported: true

customresourcedefinitions:

owned:

- name: guestbooks.webapp.example.com

version: v1

kind: Guestbook

install:

strategy: deployment

spec:

deployments:

- name: guestbook-controller-manager

spec:

replicas: 1

A bundle typically has the following directory structure.

bundle/

|-- manifests/

| |-- guestbook-operator.clusterserviceversion.yaml

| |-- webapp.example.com_guestbooks.yaml

|-- metadata/

| |-- annotations.yaml

|-- bundle.Dockerfile

Building this bundle into an OCI image and pushing it to a registry produces a bundle image, and indexing many bundle images together produces a catalog image. The operator-sdk and opm tools automate this process.

Scaffold the CSV and generate bundle manifests

operator-sdk generate kustomize manifests

make bundle VERSION=0.2.0

Build and push the bundle image

make bundle-build bundle-push BUNDLE_IMG=quay.io/example/guestbook-bundle:v0.2.0

Build the catalog image (indexing the bundles)

opm index add \

--bundles quay.io/example/guestbook-bundle:v0.2.0 \

--tag quay.io/example/guestbook-catalog:latest

Channels and the Upgrade Graph

In OLM, upgrades are expressed through channels and an upgrade graph. A channel is a release stream such as stable, alpha, or fast, and within each channel a graph defines how versions connect. There are three mechanisms for defining this graph.

| Field | Meaning |

| --- | --- |

| replaces | States which immediately prior version this version replaces. Forms a linear upgrade path |

| skips | A list of versions that may be skipped. Bypasses a defective intermediate version |

| skipRange | A semver range to skip many versions at once. For example, at least 0.1.0 and below 0.5.0 |

Here is a visualization of the stable channel upgrade graph.

stable channel upgrade graph

v0.1.0 --replaces-- v0.2.0 --replaces-- v0.3.0

skips: v0.2.1 (defective)

v0.4.0 --skipRange ">=0.1.0 <0.4.0"-- v0.5.0

On install: OLM finds the channel head (latest),

then either steps up the replaces chain

or jumps straight to head via skipRange

skipRange is especially useful. If you only use a replaces chain, a v0.1.0 user must go through every intermediate version sequentially to reach v0.5.0, but with skipRange they can jump straight to the latest version. That said, every intermediate migration must remain compatible, so it must be designed carefully.

Specify skipRange via a CSV annotation

metadata:

annotations:

olm.skipRange: '>=0.1.0 <0.5.0'

spec:

replaces: guestbook-operator.v0.4.0

skips:

- guestbook-operator.v0.2.1

OperatorHub Distribution

OperatorHub.io is a public catalog of community Operators. Registering your Operator there makes it discoverable and installable on any cluster that has OLM installed. The registration process is to submit the bundle manifests as a PR to the community-operators repository.

OperatorHub distribution flow

developer --PR--> community-operators repo

| CI: bundle validation, CSV lint,

| install-mode checks, upgrade path checks

merge --> reflected in catalog image

OLM on user cluster --> shown in OperatorHub UI

installed by creating a Subscription

From the user's perspective, a single Subscription resource declares installation and automatic upgrades.

apiVersion: operators.coreos.com/v1alpha1

kind: Subscription

metadata:

namespace: operators

spec:

channel: stable

source: operatorhubio-catalog

sourceNamespace: olm

installPlanApproval: Automatic

Setting `installPlanApproval` to Manual holds the upgrade until an administrator explicitly approves it, even when a new version appears in the channel. Manual is often preferred in production.

Upgrade Safety

Upgrades are the most dangerous moment in Operator operations. If a new controller version reconciles existing CRD objects incorrectly, or if the CRD schema changes incompatibly, running workloads can be damaged. The principles for safe upgrades are as follows.

- Always change CRD schemas in a backward-compatible way. Adding fields is safe, but removing fields or changing types must be handled through a new API version and a conversion webhook.

- When serving multiple API versions simultaneously, use a conversion webhook to guarantee conversion between the stored version and the served versions.

- Keep an e2e test for each upgrade path. That is, create objects with the previous version, swap in the new version controller, and verify that the objects still reconcile correctly.

- When using skipRange, verify that the migrations of all skipped intermediate versions are cumulatively compatible.

// Conversion webhook: example of v1beta1 <-> v1 conversion

func (src *GuestbookV1Beta1) ConvertTo(dstRaw conversion.Hub) error {

dst := dstRaw.(*GuestbookV1)

dst.ObjectMeta = src.ObjectMeta

dst.Spec.Replicas = src.Spec.Count // absorb a renamed field

dst.Spec.Image = src.Spec.Image

return nil

}

func (dst *GuestbookV1Beta1) ConvertFrom(srcRaw conversion.Hub) error {

src := srcRaw.(*GuestbookV1)

dst.ObjectMeta = src.ObjectMeta

dst.Spec.Count = src.Spec.Replicas

dst.Spec.Image = src.Spec.Image

return nil

}

Least-Privilege RBAC

Operators often tend to request broad permissions across the entire cluster. From a security standpoint, however, a controller should only have the verbs it actually needs on the resources it actually needs. In Kubebuilder you declare permissions with RBAC marker comments, and controller-tools reads them to generate Role and ClusterRole manifests.

// Declare RBAC markers right above the reconcile function

//+kubebuilder:rbac:groups=webapp.example.com,resources=guestbooks,verbs=get;list;watch;create;update;patch;delete

//+kubebuilder:rbac:groups=webapp.example.com,resources=guestbooks/status,verbs=get;update;patch

//+kubebuilder:rbac:groups=webapp.example.com,resources=guestbooks/finalizers,verbs=update

//+kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete

func (r *GuestbookReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {

// reconcile logic

return ctrl.Result{}, nil

}

Running `make manifests` against these markers generates a Role like the following.

apiVersion: rbac.authorization.k8s.io/v1

kind: ClusterRole

metadata:

rules:

- apiGroups: ['webapp.example.com']

resources: ['guestbooks']

verbs: ['get', 'list', 'watch', 'create', 'update', 'patch', 'delete']

- apiGroups: ['webapp.example.com']

resources: ['guestbooks/status']

verbs: ['get', 'update', 'patch']

- apiGroups: ['apps']

resources: ['deployments']

verbs: ['get', 'list', 'watch', 'create', 'update', 'patch', 'delete']

Practical guidelines for least-privilege design are as follows.

- Avoid wildcard verbs and wildcard resources. List only the verbs you explicitly need.

- If a namespace scope is sufficient, use a Role instead of a ClusterRole.

- Metrics endpoint protection was previously handled with a kube-rbac-proxy sidecar, but as of 2026 that sidecar has been removed, and you protect it directly with controller-runtime's `WithAuthenticationAndAuthorization` filter.

// Protect the metrics server with the authn/authz filter

mgr, err := ctrl.NewManager(cfg, ctrl.Options{

Scheme: scheme,

Metrics: metricsserver.Options{

BindAddress: ":8443",

SecureServing: true,

FilterProvider: filters.WithAuthenticationAndAuthorization,

})

With this, the metrics endpoint is authenticated and authorized through TokenReview and SubjectAccessReview without a separate sidecar container. As a result the Pod structure becomes simpler and there are fewer components to manage.

Multi-Namespace / Multi-Tenant Operation

The scope in which an Operator operates is determined by OLM install modes. The `installModes` field of the CSV declares which modes are supported, and the user chooses one when creating the Subscription.

| Install mode | Watch scope | Use case |

| --- | --- | --- |

| OwnNamespace | Only the namespace where the Operator is installed | Single team, most isolated |

| SingleNamespace | A single specified namespace | Separate the Operator from its workloads |

| MultiNamespace | Several explicitly named namespaces | Selectively manage only some tenants |

| AllNamespaces | The entire cluster | Platform-level shared Operator |

In a multi-tenant environment this choice is directly tied to the security boundary. AllNamespaces mode is convenient, but because the controller holds cluster-wide permissions, the reconcile logic must strictly enforce namespace isolation so that one tenant's CR cannot affect another tenant.

Watch scope by install mode

OwnNamespace SingleNamespace AllNamespaces

+----------+ +----------+ +------------------+

| [op][cr] | +----+-----+ | +----++----++---+ |

+----------+ | watch | |ns-a||ns-b||...| |

v | +----++----++---+ |

+----------+ +--------+---------+

| ns-b | op watch all

| [cr] |

+----------+

In the controller code it is best to limit the manager's cache scope according to the install mode. For example, in SingleNamespace mode, configure the manager to cache only that namespace.

mgr, err := ctrl.NewManager(cfg, ctrl.Options{

Scheme: scheme,

Cache: cache.Options{

DefaultNamespaces: map[string]cache.Config{

"team-a-workloads": {},

})

With this, the controller does not unnecessarily watch the entire cluster, which reduces memory usage and API load and makes the permission boundary clearer.

Production Checklist

Here are the items to check just before deployment.

- Tests: Do the unit reconcile tests cover all major branches? Do the envtest integration tests verify CRD validation and owner references? Does e2e confirm actual workload startup?

- Upgrades: Does the e2e upgrade test from the previous version to the new version pass? Does the conversion webhook handle all stored versions? Are all migrations within the skipRange range compatible?

- RBAC: Are there no wildcard permissions? Is a ClusterRole truly necessary, or is a Role sufficient? Are metrics protected by the authn/authz filter?

- OLM: Do the CSV's replaces/skips/skipRange form the intended upgrade graph? Do the installModes match the actual supported scope?

- Observability: Does the controller expose metrics like reconcile error rate, queue depth, and reconcile latency? Does it record appropriate events and conditions in the status?

- Resilience: Is leader election enabled? Does the controller converge to a consistent state even after restarts (idempotency)?

- Resources: Does the controller Pod have appropriate requests/limits? Does the cache memory not explode on large clusters?

Conclusion

The value of an Operator does not lie merely in writing a working reconcile loop. It lies in owning the entire lifecycle: verifying that controller reliably, and packaging it so that users can install and upgrade it safely. The test pyramid lets you catch bugs at the fastest and cheapest layer possible, while OLM and bundle packaging make deployment and upgrades declarative and reproducible.

In particular, the 2026 stack has become simpler than before. With the kube-rbac-proxy sidecar gone, metrics protection has been integrated into controller-runtime, and the envtest and operator-sdk toolchains have matured, making it easier to automate testing and bundling. Treat the principles covered in this article as a checklist, and may your Operator grow beyond the lab into a component trusted in production.

References

- Kubebuilder Book: https://kubebuilder.io/

- Operator SDK Documentation: https://sdk.operatorframework.io/

- controller-runtime (Go reference): https://pkg.go.dev/sigs.k8s.io/controller-runtime

- Kubernetes Operator pattern: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/

- Kubebuilder source: https://github.com/kubernetes-sigs/kubebuilder

- controller-runtime source: https://github.com/kubernetes-sigs/controller-runtime

- Operator Lifecycle Manager: https://olm.operatorframework.io/

- OperatorHub.io: https://operatorhub.io/