- Published on
Cluster API Deep Dive — Managing Clusters as Kubernetes Resources
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction — The Single-Cluster Era Is Over
- Why Existing Approaches Hit Their Limits
- The Core Philosophy of Cluster API — Kubernetes Managing Kubernetes
- Architecture Dissection — Management Cluster vs Workload Cluster
- The Provider Model — The Substance of the Abstraction
- Hands-On — Running a Cluster Factory on Your Laptop with CAPD
- Operations Deep Dive — Upgrades, Scaling, Auto-Remediation
- Combining with GitOps — Governing a Cluster Fleet from Git
- Pivot, Backup/DR, Version Skew — Operating the Management Cluster Itself
- Limits and Reality — What to Know Before Adopting
- Troubleshooting — The Order to Look When a Machine Is Stuck
- Adoption Checklist
- Closing Thoughts
- References
Introduction — The Single-Cluster Era Is Over
In the early days of Kubernetes adoption, an organization ran one cluster, maybe two or three. The situation today is completely different. Separation by environment (dev/stage/prod), by region, by regulatory domain (even more so in network-segregated financial environments), dedicated clusters per tenant, and edge clusters deployed into stores, factories, and vehicles. Dozens is the baseline; telcos and retail-edge operators run hundreds to thousands of clusters.
The moment you cross ten clusters, the following questions become daily reality.
- How many days does it take to create one new cluster? Is the procedure documented, or does it live only in one engineer's head?
- What Kubernetes version is cluster number 37 running? Can you see the version distribution of the whole fleet on one screen?
- When a control plane node dies, who recovers it, when, and how?
- How many weeks does it take to move the entire fleet from v1.31 to v1.32?
If you cannot answer these questions with confidence, cluster lifecycle management is a bottleneck for your organization. Cluster API (CAPI for short), the subject of this article, is the Kubernetes community's official answer to this problem. The core idea is simple yet radical: declare the cluster itself as a Kubernetes resource, and let controllers make that declaration real.
Why Existing Approaches Hit Their Limits
Console Clicking (ClickOps)
Creating a cluster by clicking buttons in a cloud console is the fastest way to make one. But it is not reproducible. To build an identical cluster three months later, you depend on screenshots and memory, and audit trails are hard to produce. Configuration drift between clusters is inevitable, and nobody can explain why "this one cluster is different."
Terraform / OpenTofu
As infrastructure-as-code, Terraform is an excellent tool, but it has structural weaknesses for cluster lifecycle.
- It only acts at execution time. Once terraform apply finishes, state observation ends too. If someone deletes a node group in the console, nobody notices until the next apply. There is no continuous reconcile.
- The state file is a single point of failure. State locking, backend management, and recovering from corrupted state are operational burdens in themselves.
- Node-level lifecycle is hard to express. A declaration like "replace this node if it is NotReady for 10 minutes" does not fit Terraform's language.
- At the scale of hundreds of clusters, workspace/module splitting, plan duration, and drift detection all become painful.
kubeadm Scripts
kubeadm is the de facto standard low-level tool for cluster bootstrapping, but it is strictly a tool for "putting one node into a cluster." VM provisioning, LB setup, certificate renewal, node replacement, and upgrade sequencing are all left to your shell scripts. Over time, those scripts become secret rituals nobody dares to touch.
The common defect of all three approaches is the same: there is no agent continuously converging declaration and reality. The exact problem Kubernetes solved for pods remained unsolved for clusters themselves.
The Core Philosophy of Cluster API — Kubernetes Managing Kubernetes
Cluster API is an official subproject of Kubernetes SIG Cluster Lifecycle. Its philosophy can be summarized in three sentences.
- Declarative API. Clusters, machines, and control planes are all defined as CRDs (Custom Resources). Declare "3 workers, version v1.32.2" and you are done.
- Controller reconcile loops. Just as the Deployment controller maintains pod counts, CAPI controllers maintain machine counts and versions. When reality deviates from the declaration (node failure, manual deletion), they automatically converge it back.
- Provider abstraction. Whether AWS, vSphere, or bare metal, infrastructure differences are hidden behind provider plugins. The user experience remains the same kubectl and YAML.
The analogy goes like this: pods have Deployments; clusters have Cluster API. Just as a ReplicaSet stamps out pods, a MachineSet stamps out machines (node VMs); just as a Deployment rolls pods, a MachineDeployment rolls nodes. It lifts the mental model Kubernetes operators already know up to the cluster level.
Another crucial principle is immutable infrastructure. CAPI does not "fix" machines. When configuration or version changes, it creates new machines and discards old ones. SSHing into a live node to upgrade packages is an act outside the model.
Architecture Dissection — Management Cluster vs Workload Cluster
There are two kinds of clusters in the CAPI world.
- Management cluster: the cluster where CAPI controllers and CRDs are installed. The declarations (YAML) of other clusters are stored and reconciled here.
- Workload cluster: the target cluster produced by the management cluster. Real applications run here. The workload cluster itself is unaware that CAPI exists.
+----------------------------------------------------------------------+
| Management Cluster |
| |
| +----------------+ +----------------+ +------------------------+ |
| | CAPI Core | | Bootstrap | | Control Plane | |
| | Controller | | Provider | | Provider | |
| | (Cluster, | | (CABPK: | | (KCP: KubeadmControl- | |
| | Machine, MS, | | KubeadmConfig)| | Plane controller) | |
| | MD, MHC) | +----------------+ +------------------------+ |
| +----------------+ |
| +-------------------------------+ |
| | Infrastructure Provider | Declarations stored in etcd: |
| | (CAPA/CAPZ/CAPV/CAPD/Metal3) | Cluster, MachineDeployment, |
| +-------------------------------+ KubeadmControlPlane ... |
+----------------------------------------------------------------------+
| | |
| provision / reconcile | |
v v v
+----------------+ +----------------+ +----------------+
| Workload | | Workload | | Workload |
| Cluster A | | Cluster B | | Cluster C |
| (prod-seoul) | | (prod-tokyo) | | (edge-store-7) |
+----------------+ +----------------+ +----------------+
Core CRD Relationship Map
CAPI CRDs are cleanly divided by role. Let us look at the relationships as a diagram first.
+-----------+
| Cluster | umbrella resource for the whole cluster
+-----+-----+
|
+------------------+-------------------+
| controlPlaneRef | infrastructureRef
v v
+----------------------+ +---------------------------+
| KubeadmControlPlane | | (Infra)Cluster |
| (replicas, version) | | e.g. DockerCluster, |
+----------+-----------+ | AWSCluster: VPC/LB |
| +---------------------------+
| creates/manages
v
+---------+ 1:1 +------------------------+
| Machine |------------- | (Infra)Machine |
+---------+ | e.g. DockerMachine, |
^ | AWSMachine: a VM |
| +------------------------+
| creates ^ 1:1
+--------+-------+ |
| MachineSet | <-- template refs: (Infra)MachineTemplate
+--------+-------+ KubeadmConfigTemplate
^
| creates / rolling replace
+--------+----------------+ +---------------------+
| MachineDeployment | | MachineHealthCheck |
| (replicas, version, | | detect unhealthy |
| rolling strategy) | | machines → replace |
+-------------------------+ +---------------------+
The responsibilities of each resource are summarized below.
| Resource | Analogy (pod world) | Responsibility |
|---|---|---|
| Cluster | Umbrella, somewhat like a Namespace | Top-level resource binding cluster network CIDRs and control plane / infra references |
| Machine | Pod | Declaration for one node. Immutable — replaced when spec changes |
| MachineSet | ReplicaSet | Maintains replica count of identical machines |
| MachineDeployment | Deployment | Orchestrates rolling updates of MachineSets |
| KubeadmControlPlane | Closest to a StatefulSet | Manages control plane machines, etcd membership, certificates, versions |
| KubeadmConfig | cloud-init generator | Generates kubeadm init/join configuration executed at machine boot |
| MachineHealthCheck | livenessProbe plus auto-replace | Detects unhealthy nodes and triggers remediation |
KubeadmControlPlane (KCP) is especially important. Because of etcd quorum, control plane nodes cannot be swapped as casually as workers. During an upgrade, the KCP controller automates the sequence of adding one new control plane machine, joining it as an etcd member, safely removing the old machine from etcd, and then discarding it. A task that makes you sweat when done by hand finishes with a one-line declaration.
The Provider Model — The Substance of the Abstraction
The CAPI core knows nothing about infrastructure. Actual VM creation belongs to infrastructure providers, node initialization script generation to bootstrap providers, and control plane orchestration to control plane providers.
Infrastructure Providers
| Provider | Target infrastructure | Maturity / notes |
|---|---|---|
| CAPA | AWS (EC2, EKS) | Mature. Supports EKS managed control planes |
| CAPZ | Azure (VM, AKS) | Mature. Supports AKS managed topologies |
| CAPG | GCP (GCE, GKE) | Stable, but narrower feature breadth than CAPA/CAPZ |
| CAPV | vSphere | The de facto standard choice for on-prem virtualization |
| CAPO | OpenStack | Active in telco / private cloud |
| CAPD | Docker containers | Dev/test/CI only. Never production |
| Metal3 | Bare metal (Ironic-based) | Controls physical server power/images via BMC |
| BYOH | Reuse existing hosts | Enrolls hosts that already have an OS as nodes |
Bootstrap / Control Plane Providers
kubeadm is the default (CABPK + KCP), but the ecosystem is wider.
| Provider | Distribution | Characteristics |
|---|---|---|
| kubeadm (default) | Vanilla Kubernetes | Most mature, the reference implementation |
| k3s | k3s | Lightweight edge environments. Maintained by the k3s-io community |
| RKE2 | RKE2 | Rancher family, security-hardened distribution |
| Talos | Talos Linux | Immutable OS with no SSH, managed only via API. Sidero Labs |
| Managed CP | EKS/AKS/GKE | Managed control plane resources built into infra providers |
The relationship with managed Kubernetes is covered separately below, but the headline is this: with resources like CAPA's AWSManagedControlPlane, even an EKS control plane can become the target of a CAPI declaration.
Hands-On — Running a Cluster Factory on Your Laptop with CAPD
CAPD (Cluster API Provider Docker) treats Docker containers as "machines," so you can experience the full flow on a laptop without a cloud account. You need kind, Docker, clusterctl, and kubectl.
Step 1 — Prepare the Management Cluster and Run clusterctl init
# Create a kind cluster that lets CAPD use the Docker socket
cat > kind-mgmt.yaml <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: capi-mgmt
nodes:
- role: control-plane
extraMounts:
- hostPath: /var/run/docker.sock
containerPath: /var/run/docker.sock
EOF
kind create cluster --config kind-mgmt.yaml
# Enable the ClusterClass feature, then install CAPI + CAPD
export CLUSTER_TOPOLOGY=true
clusterctl init --infrastructure docker
clusterctl init installs cert-manager, the CAPI core, the kubeadm bootstrap/control-plane providers, and CAPD into the management cluster. Verify the installation:
kubectl get pods -A | grep -E "capi|capd|cert-manager"
# capd-system / capi-kubeadm-bootstrap-system /
# capi-kubeadm-control-plane-system / capi-system Running means healthy
Step 2 — The Full Workload Cluster Declaration YAML
You could generate a template with clusterctl generate cluster, but for learning purposes we will look at the core resources directly. Below is the full manifest for a cluster with 1 control plane node and 2 workers.
# dev-cluster-01.yaml — Cluster: the top-level umbrella
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: dev-cluster-01
namespace: default
labels:
env: dev
region: local
spec:
clusterNetwork:
pods:
cidrBlocks: ["192.168.0.0/16"]
serviceDomain: cluster.local
controlPlaneRef:
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
name: dev-cluster-01-cp
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: DockerCluster
name: dev-cluster-01
---
# Infra cluster: in CAPD this represents the LB container etc.
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: DockerCluster
metadata:
name: dev-cluster-01
namespace: default
---
# Infra template for control plane machines
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: DockerMachineTemplate
metadata:
name: dev-cluster-01-cp
namespace: default
spec:
template:
spec:
extraMounts:
- containerPath: /var/run/docker.sock
hostPath: /var/run/docker.sock
---
# KubeadmControlPlane: the control plane declaration
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
metadata:
name: dev-cluster-01-cp
namespace: default
spec:
replicas: 1
version: v1.31.4
machineTemplate:
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: DockerMachineTemplate
name: dev-cluster-01-cp
kubeadmConfigSpec:
clusterConfiguration:
apiServer:
certSANs: [localhost, 127.0.0.1, 0.0.0.0, host.docker.internal]
initConfiguration:
nodeRegistration:
criSocket: unix:///var/run/containerd/containerd.sock
joinConfiguration:
nodeRegistration:
criSocket: unix:///var/run/containerd/containerd.sock
---
# Infra template for worker machines
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: DockerMachineTemplate
metadata:
name: dev-cluster-01-md-0
namespace: default
spec:
template:
spec: {}
---
# Bootstrap (join) config template for worker machines
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfigTemplate
metadata:
name: dev-cluster-01-md-0
namespace: default
spec:
template:
spec:
joinConfiguration:
nodeRegistration:
criSocket: unix:///var/run/containerd/containerd.sock
---
# MachineDeployment: the worker pool declaration
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
name: dev-cluster-01-md-0
namespace: default
spec:
clusterName: dev-cluster-01
replicas: 2
selector:
matchLabels: null
template:
spec:
clusterName: dev-cluster-01
version: v1.31.4
bootstrap:
configRef:
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfigTemplate
name: dev-cluster-01-md-0
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: DockerMachineTemplate
name: dev-cluster-01-md-0
Note that the examples in this article use the widely deployed v1beta1 API. The latest CAPI release lines are transitioning to v1beta2, so check the API version of the release you run before applying.
kubectl apply -f dev-cluster-01.yaml
Step 3 — Watching Convergence
# View overall state as a tree (the single most useful command)
clusterctl describe cluster dev-cluster-01
# Track individual resources
kubectl get cluster,machinedeployment,machineset,machine
kubectl get kubeadmcontrolplane
docker ps # you will see the "machine" containers CAPD created
Watching a Machine transition through Pending → Provisioning → Provisioned → Running feels exactly like the pod lifecycle.
Step 4 — Obtain the kubeconfig and Install a CNI
Nodes in the new cluster are NotReady because there is no CNI yet. This is normal.
clusterctl get kubeconfig dev-cluster-01 > dev-01.kubeconfig
kubectl --kubeconfig dev-01.kubeconfig get nodes
# STATUS NotReady — expected before CNI installation
# Install Calico (Cilium, Flannel, anything works)
kubectl --kubeconfig dev-01.kubeconfig apply -f \
https://raw.githubusercontent.com/projectcalico/calico/v3.29.1/manifests/calico.yaml
kubectl --kubeconfig dev-01.kubeconfig get nodes
# Shortly afterwards, all nodes Ready
This is the moment of realization: you just created a cluster with a single kubectl apply. Want a hundred of them? Apply a hundred YAML sets. And those YAMLs can live in Git.
Operations Deep Dive — Upgrades, Scaling, Auto-Remediation
How Rolling Upgrades Work — Replace Nodes, Never Patch Them
CAPI upgrades are not OS-patch style but machine-replacement style. Change the version field and the controller creates new-version machines, joins them, drains the old ones, and discards them.
# Control plane upgrade: change only the KCP version
kubectl patch kubeadmcontrolplane dev-cluster-01-cp --type merge \
-p '{"spec":{"version":"v1.32.2"}}'
# Worker upgrade: change the MachineDeployment version
kubectl patch machinedeployment dev-cluster-01-md-0 --type merge \
-p '{"spec":{"template":{"spec":{"version":"v1.32.2"}}}}'
The internal sequence looks like this.
KCP upgrade (with replicas=3)
1. Create 1 new v1.32 control plane machine (4 total)
2. Join with kubeadm join --control-plane, etcd members = 4
3. Pick 1 old v1.31 machine → etcd member remove → drain → delete (3 total)
4. Repeat 1-3 until no old machines remain
5. Quorum is preserved throughout: 3 → 4 → 3 → 4 → 3
MachineDeployment upgrade (RollingUpdate strategy)
1. Create a new-version MachineSet
2. Add new machines per maxSurge/maxUnavailable, drain and delete old ones
3. Repeat until the old MachineSet reaches replicas = 0
The upgrade ordering rule follows the Kubernetes version skew policy exactly: control plane first, workers later. Since the kubelet may lag kube-apiserver by up to three minor versions, you have room to roll worker upgrades out in stages.
Scaling
# Workers 2 → 5
kubectl scale machinedeployment dev-cluster-01-md-0 --replicas=5
# Control plane 1 → 3 (switch to HA)
kubectl scale kubeadmcontrolplane dev-cluster-01-cp --replicas=3
If you need autoscaling, the cluster-autoscaler supports a CAPI provider mode that scales MachineDeployments instead of cloud node groups.
MachineHealthCheck — Automatic Node Failure Recovery
A MachineHealthCheck (MHC) watches node conditions and, when an unhealthy condition persists, deletes the machine (which means replacing it with a new one). That is remediation.
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineHealthCheck
metadata:
name: dev-cluster-01-worker-mhc
namespace: default
spec:
clusterName: dev-cluster-01
# Safety valve so too many machines are not replaced at once
maxUnhealthy: 40%
# Unhealthy if the machine fails to join as a node within this window
nodeStartupTimeout: 10m
selector:
matchLabels:
cluster.x-k8s.io/deployment-name: dev-cluster-01-md-0
unhealthyConditions:
- type: Ready
status: Unknown
timeout: 300s
- type: Ready
status: "False"
timeout: 300s
maxUnhealthy is a critical safety mechanism. For example, when a CNI outage makes every node NotReady, it prevents the catastrophe of MHC replacing all machines. Once the unhealthy ratio exceeds the threshold, remediation stops and waits for a human.
ClusterClass — Turning Clusters into Templates
Managing seven YAML documents per cluster becomes torture at even ten clusters. ClusterClass defines a "class (template)" of cluster, and individual clusters become thin declarations that only fill in variables.
apiVersion: cluster.x-k8s.io/v1beta1
kind: ClusterClass
metadata:
name: standard-dev
namespace: default
spec:
controlPlane:
ref:
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlaneTemplate
name: standard-dev-cp
machineInfrastructure:
ref:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: DockerMachineTemplate
name: standard-dev-cp-machine
infrastructure:
ref:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: DockerClusterTemplate
name: standard-dev-infra
workers:
machineDeployments:
- class: default-worker
template:
bootstrap:
ref:
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfigTemplate
name: standard-dev-worker-bootstrap
infrastructure:
ref:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: DockerMachineTemplate
name: standard-dev-worker-machine
variables:
- name: workerReplicas
required: true
schema:
openAPIV3Schema:
type: integer
default: 2
Now a single cluster shrinks to this. The topology field is the key.
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: dev-cluster-02
namespace: default
spec:
clusterNetwork:
pods:
cidrBlocks: ["192.168.0.0/16"]
topology:
class: standard-dev
version: v1.31.4
controlPlane:
replicas: 1
workers:
machineDeployments:
- class: default-worker
name: md-0
replicas: 3
variables:
- name: workerReplicas
value: 3
The real power of managed topologies is propagation of class changes. Modify the machine template in a ClusterClass, and every cluster using that class follows along (per policy) via rolling replacement. Standardizing hundreds of clusters runs on this mechanism.
Machine Image Management — image-builder
In production you use golden images with kubeadm and kubelet pre-baked. The Kubernetes SIG-maintained image-builder project builds AWS AMIs, Azure images, vSphere OVAs, and Raw/QCOW2 (bare metal) images on top of Packer + Ansible.
git clone https://github.com/kubernetes-sigs/image-builder.git
cd image-builder/images/capi
# Example: build an Ubuntu 22.04 + K8s v1.31 OVA for vSphere
make build-node-ova-vsphere-ubuntu-2204
Image versioning is node versioning. Pin "K8s version + OS patch level" into the image tag, and every node in a cluster becomes bit-for-bit identical. This is the foundation of immutable infrastructure.
Combining with GitOps — Governing a Cluster Fleet from Git
CAPI resources are, in the end, ordinary Kubernetes YAML, so syncing them into the management cluster with Argo CD or Flux completes GitOps for clusters.
+-----------+ push +-----------+ sync +---------------------+
| Platform | ------------> | Git | ------------> | Management Cluster |
| Team | PR review | fleet- | Argo CD / | CAPI controllers |
| | | repo | Flux | reconcile |
+-----------+ +-----------+ +----------+----------+
|
create/upgrade/repair v
+---------------------------+
| Workload Clusters (fleet) |
+---------------------------+
An example repository layout:
fleet-repo/
├── clusterclasses/
│ ├── standard-prod.yaml
│ └── standard-edge.yaml
├── clusters/
│ ├── prod/
│ │ ├── prod-seoul-01.yaml # thin topology-based declarations
│ │ └── prod-tokyo-01.yaml
│ ├── stage/
│ │ └── stage-seoul-01.yaml
│ └── edge/
│ ├── store-0001.yaml
│ └── store-0002.yaml
└── addons/
├── cni/ # default addons for new clusters
└── monitoring/
The operational effects of this pattern are powerful.
- Cluster creation and changes must pass PR review. The audit trail is the Git history itself.
- Version upgrades become standardized as "a PR that changes the version field in YAML."
- To auto-deploy addons like CNI and monitoring into new clusters, use the Argo CD ApplicationSet cluster generator or CAPI ClusterResourceSets.
Naming and Label Strategy at Fleet Scale
Across tens to hundreds of clusters, consistent naming and labels become the handles for automation.
Naming convention example: <purpose>-<region>-<serial>
prod-seoul-01, stage-tokyo-01, edge-store-0042
Recommended labels (set on the Cluster resource):
env: prod | stage | dev
region: seoul | tokyo | ...
tier: core | edge
team: payments | search | ...
upgrade-wave: "1" | "2" | "3" # controls upgrade waves
The upgrade-wave label is especially useful. Apply the new version to wave 1 (internal clusters) first, let it bake for a few days, then spread to waves 2 and 3 — an operation you can automate with label selectors.
Pivot, Backup/DR, Version Skew — Operating the Management Cluster Itself
Chicken and Egg — Who Creates the Management Cluster
The common bootstrap pattern looks like this.
- Create a temporary local kind cluster and run clusterctl init
- From the temporary cluster, create the "real" management cluster (cloud/on-prem) as a workload cluster
- Move all CAPI resources to the new cluster with clusterctl move (pivot)
- The management cluster then manages itself (self-hosted)
# Install providers on the new cluster, then move the resources
clusterctl init --kubeconfig mgmt-real.kubeconfig --infrastructure aws
clusterctl move --to-kubeconfig mgmt-real.kubeconfig
clusterctl move transfers the Cluster and its whole ownership chain of objects plus secrets (kubeconfig, CA, etcd certificates) wholesale. Reconciliation is paused during the move, so workload clusters are unaffected.
If the Management Cluster Dies, Do Workloads Die Too
No. This is an important virtue of CAPI design. Workload clusters operate fully independently without the management cluster. What you lose is lifecycle operations (creation/upgrade/auto-remediation). You still need a DR plan.
- If the source of truth for CAPI resources is Git (GitOps), the primary recovery path is building a new management cluster and re-syncing. However, cluster secrets (CA and so on) are not in Git, so backups are mandatory.
- Periodically back up the CAPI namespaces of the management cluster (resources plus secrets) with tools like Velero.
- Rehearse the recovery. "We have backups" and "we can restore" are different propositions.
Version Skew and Upgrade Order
The versions of the management machinery itself are also under management.
Upgrade order (top to bottom)
1. Kubernetes version of the management cluster
(check the range supported by your CAPI release)
2. The clusterctl binary
3. CAPI core + providers: clusterctl upgrade plan / apply
4. Kubernetes versions of workload clusters (rolling by wave)
- Inside each cluster: control plane → workers
- No skipping minor versions (1.30 → 1.32 not allowed; go via 1.31)
Also check CAPI contract versions (the compatibility contract with infrastructure providers). Upgrading only the core while neglecting providers leads to incidents where reconciliation stops. clusterctl upgrade plan computes the compatibility matrix, so always look at the plan first.
Limits and Reality — What to Know Before Adopting
Learning Curve and Required Operational Maturity
CAPI has many abstraction layers. When something breaks, you must be able to ride the debugging chain down from Cluster → KCP → Machine → InfraMachine → cloud-init logs. For teams unfamiliar with Kubernetes controller patterns (owner references, conditions, finalizers), it is a steep hill.
Uneven Provider Maturity
CAPA/CAPZ/CAPV have many large-scale production references, but some providers have thin feature breadth and documentation. Before adopting, verify the provider's release cadence, issue response speed, and ClusterClass support.
The Relationship with Managed Kubernetes (EKS/AKS/GKE)
"We use EKS anyway — do we need CAPI?" is a fair question. The answer depends on the shape of your organization.
- If you run only three EKS clusters, Terraform or eksctl is probably sufficient.
- But if your fleet mixes EKS, on-prem vSphere, and edge bare metal, CAPI is close to the only option that covers everything with a single API. Managed control planes can also be declared as CAPI resources, via CAPA's AWSManagedControlPlane or CAPZ's AKS support.
Comparison with Crossplane / Terraform
| Aspect | Cluster API | Crossplane | Terraform |
|---|---|---|---|
| Primary purpose | Dedicated to K8s cluster lifecycle | General cloud resource composition | General IaC |
| Operating model | Always-on controller reconcile | Always-on controller reconcile | Apply at execution time |
| Node/machine abstraction | First-class (Machine etc.) | None (managed K8s focused) | Indirect via modules |
| Control plane orchestration | KCP automates down to etcd | Delegated to managed CPs | Build it yourself |
| Auto-remediation | MachineHealthCheck built in | Drift correction level | None |
| Non-cluster resources (DBs etc.) | Out of scope | Strength | Strength |
| State store | etcd (K8s native) | etcd | State file |
These three are often a division of labor rather than rivals. It is common to layer them: "VPC/IAM with Terraform, clusters with CAPI, the databases clusters use with Crossplane."
Which Organizations Does It Fit — A Decision Guide
Q1. You manage fewer than 5 clusters with no growth plans
→ CAPI is overkill. Managed K8s + IaC is enough.
Q2. You have 10+ clusters, or growth toward tenants/edge is coming
→ Strong CAPI candidate. Especially if cluster creation must be self-service.
Q3. Your estate mixes on-prem (vSphere/bare metal) or multi-cloud
→ CAPI's strongest use case. The single declarative model shines most here.
Q4. Does the team have K8s controller/CRD operating experience
→ If not, build muscle first at small scale (CAPD labs, dev clusters).
Q5. Does a platform team exist
→ CAPI is a platform team tool. Without a dedicated owner it rots unattended.
Troubleshooting — The Order to Look When a Machine Is Stuck
A machine stuck in Provisioning is the most common symptom in CAPI operations. Internalize the diagnostic order and you will find the cause within 20 minutes most of the time.
# 1. Big picture: see where it stopped, as a tree
clusterctl describe cluster dev-cluster-01 --show-conditions all
# 2. Machine conditions and events
kubectl describe machine dev-cluster-01-md-0-xxxxx
kubectl get events --field-selector involvedObject.kind=Machine
# 3. Infra machine (provider side) state — did the VM actually start
kubectl describe dockermachine dev-cluster-01-md-0-xxxxx
# On AWS: kubectl describe awsmachine ...
# 4. Was the bootstrap secret created (cloud-init data)
kubectl get secret | grep dev-cluster-01-md-0
# 5. Controller logs — layer by layer
kubectl logs -n capi-system deploy/capi-controller-manager
kubectl logs -n capi-kubeadm-bootstrap-system \
deploy/capi-kubeadm-bootstrap-controller-manager
kubectl logs -n capi-kubeadm-control-plane-system \
deploy/capi-kubeadm-control-plane-controller-manager
kubectl logs -n capd-system deploy/capd-controller-manager
# 6. cloud-init logs inside the machine (access differs per infra)
# CAPD: docker exec into the container, then
# cat /var/log/cloud-init-output.log (or journalctl -u kubelet)
A list of frequently encountered causes:
| Symptom | Common cause |
|---|---|
| InfraMachine never appears | Provider not installed, credential secret error, quota exceeded |
| VM started but never joins | Image lacks kubeadm/kubelet, network blocked toward the CP endpoint |
| Stuck on the first CP machine | LB/endpoint not created, certSANs mismatch, etcd failed to start |
| Nodes stay NotReady | CNI not installed (normal phase), CNI misconfiguration |
| Stuck mid-upgrade | Drain blocked by PDBs, insufficient capacity for maxSurge |
| Deletion never finishes | Waiting on finalizers — check provider logs for infra deletion failure |
In particular, drain deadlock caused by PodDisruptionBudgets is a regular upgrade issue. A PDB whose minAvailable equals the replica count means the drain never completes. Setting nodeDrainTimeout lets the process force ahead after a set period.
Adoption Checklist
[ ] Does your cluster count / growth plan justify adopting CAPI
[ ] Did you vet the maturity of your infra provider (CAPA/CAPV/Metal3 etc.)
[ ] Has the whole team experienced the full flow via a CAPD-based local lab
[ ] Did you build a golden image pipeline (image-builder)
[ ] Did you templatize the standard cluster shape with ClusterClass
[ ] Did you configure the MachineHealthCheck maxUnhealthy safety valve
[ ] Did you define the GitOps repo structure and PR review policy
[ ] Did you document naming/label/upgrade-wave strategies
[ ] Did you finish management cluster backup (Velero etc.) and a restore drill
[ ] Do you have a regular upgrade calendar driven by clusterctl upgrade plan
[ ] Did you write a runbook for stuck-machine troubleshooting
[ ] Did you agree PDB/nodeDrainTimeout policies with workload teams
Closing Thoughts
The essence of Cluster API is not a "cluster creation tool" but a replacement of the operating model for clusters. In the world of console clicks and scripts, a cluster was a handcrafted artifact. In the CAPI world, a cluster is an ordinary resource — defined by declaration and guarded by controllers, like pods stamped out by a Deployment. Dead nodes get replaced, version bumps converge via rolling replacement, and every change lands in Git history.
Of course it is not free. There is the learning curve of the abstraction layers, the judgment needed about the provider ecosystem, and a new operational subject called the management cluster. For organizations with five clusters or fewer, it may be overkill. But if your destiny is dozens of clusters soon — and most platform organizations cannot escape that destiny — it is a technology that pays compound interest the earlier you learn it. Start today with a 30-minute lab on your laptop using kind and CAPD.
References
- Cluster API official documentation (The Cluster API Book): https://cluster-api.sigs.k8s.io/
- Cluster API GitHub repository: https://github.com/kubernetes-sigs/cluster-api
- Quick Start (CAPD lab): https://cluster-api.sigs.k8s.io/user/quick-start
- ClusterClass concept documentation: https://cluster-api.sigs.k8s.io/tasks/experimental-features/cluster-class/
- clusterctl reference: https://cluster-api.sigs.k8s.io/clusterctl/overview
- Cluster API Provider AWS (CAPA): https://cluster-api-aws.sigs.k8s.io/
- Cluster API Provider Azure (CAPZ): https://capz.sigs.k8s.io/
- Cluster API Provider GCP (CAPG): https://github.com/kubernetes-sigs/cluster-api-provider-gcp
- Cluster API Provider vSphere (CAPV): https://github.com/kubernetes-sigs/cluster-api-provider-vsphere
- Cluster API Provider OpenStack (CAPO): https://github.com/kubernetes-sigs/cluster-api-provider-openstack
- Metal3 (bare metal provider): https://metal3.io/
- Kubernetes image-builder: https://image-builder.sigs.k8s.io/
- kubeadm official documentation: https://kubernetes.io/docs/reference/setup-tools/kubeadm/
- Kubernetes version skew policy: https://kubernetes.io/releases/version-skew-policy/
- kind official documentation: https://kind.sigs.k8s.io/
- Argo CD official documentation: https://argo-cd.readthedocs.io/
- Flux official documentation: https://fluxcd.io/
- Crossplane official site: https://www.crossplane.io/