💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

Even in an era where managed Kubernetes offerings (EKS, GKE, AKS) have become the de facto standard, plenty of organizations still need to build Kubernetes directly on on-premises bare metal. If anything, demand has been growing again — driven by the GPU farm boom and the broader cloud repatriation trend.

The problem is that "installing Kubernetes yourself" is a much bigger job than it sounds. kubeadm only solves control plane bootstrapping; everything around it — OS preparation, container runtime installation, etcd topology, CNI deployment, load balancers, certificate renewal, upgrade orchestration — remains the operator's responsibility. The tool that fills this gap is Kubespray.

In this article we will dig deep, code-first, into what Kubespray actually is and how it works internally, how to design inventories and variables for a production cluster, and how day-2 operations (adding nodes, upgrades, backups, hardening) map onto playbooks. At the end, we will position Kubespray against Cluster API and offer decision criteria for when to use which. The reference versions are the Kubespray v2.28 line and Kubernetes v1.32 line, as of the first half of 2026.

Why On-Prem Kubernetes Still Matters

Despite cloud-first strategies being the norm, on-prem and bare-metal Kubernetes is not going away, for clear reasons.

First, data sovereignty and regulation. Network separation requirements in Korean financial regulations, national security guidelines for public institutions, and GDPR data residency requirements in Europe all directly govern where workloads physically run. For banks running workloads that integrate with the core banking ledger inside internal networks, or public-sector environments where external and internal networks are physically separated, public cloud is often simply not an option.

Second, GPU farms. As more organizations build their own LLM training and inference infrastructure, it has become common to tie tens or hundreds of GPU servers together with Kubernetes. There is a clear break-even point where owning expensive GPUs and maximizing utilization beats renting them by the hour in the cloud, in total cost of ownership terms.

Third, cost. For workloads with predictable traffic and resource usage above a certain scale, a private data center or colocation becomes cheaper even after accounting for depreciation. Egress traffic charges and block storage costs are the classic triggers that push teams toward repatriation.

Fourth, latency and special hardware. Edge computing on factory lines, trading systems that require colocation with exchanges, and telco NFV environments using SR-IOV or DPDK all presuppose direct control over physical infrastructure.

In these environments, "how do we install and maintain Kubernetes" immediately becomes the next question. Let us first compare the candidate answers.

The Tooling Landscape — What Do You Install With?

On-prem Kubernetes build tools fall into roughly five camps.

| --- | --- | --- | --- | --- | --- |

One-line decision guides:

- Manual kubeadm is the best textbook for understanding Kubernetes internals, but repeatable builds across dozens of machines need an automation layer.

- kOps is strong for self-managed AWS clusters but has effectively no bare-metal story.

- RKE2 and k3s are the easiest to install with good security defaults, but they are their own distributions rather than the upstream kubeadm path, and they pull you into the Rancher ecosystem.

- Talos is the most radical and attractive approach — an immutable OS without even SSH — but adoption is hard in organizations that must follow existing OS standards (security agents, asset management, and so on).

- Kubespray precisely fits the most common enterprise requirement: "we already have a standard Linux build, and we need to install upstream Kubernetes on top of it with fine-grained control." For organizations already using Ansible, the learning curve is gentle.

The protagonist of this article is that last option: Kubespray.

What Kubespray Really Is — Ansible Playbooks Wrapping kubeadm

Kubespray is a CNCF project under kubernetes-sigs. At its core, it is "a collection of Ansible playbooks and roles that build production-grade Kubernetes clusters." There is no magic proprietary engine — it orchestrates proven components with Ansible.

- Control plane bootstrapping calls kubeadm internally. A Kubespray cluster is a kubeadm cluster, inheriting the kubeadm certificate scheme and upgrade path as-is.

- Its roles cover the full span: OS preparation (swap off, kernel modules, sysctl), containerd installation, etcd cluster setup, CNI deployment (Calico, Cilium, flannel, and more), and add-ons (CoreDNS, MetalLB, ingress-nginx, etc.).

- The support matrix is broad: many Linux distributions, multiple CNIs, air-gapped environments, GPU nodes, and various topologies (stacked or external etcd), all selected through variables.

The end-to-end flow in ASCII:

+--------------------------------------------------------------------+

| Ansible control node (operator workstation or bastion) |

| |

| inventory/prod/ |

| ├─ hosts.yaml ← node list and groups (topology) |

| └─ group_vars/ |

| ├─ all/all.yml ← global vars (LB, proxy, registry) |

| └─ k8s_cluster/ |

| ├─ k8s-cluster.yml ← versions, CIDRs, CNI, proxy mode |

| └─ addons.yml ← ingress, metallb, cert-manager |

| |

| ansible-playbook cluster.yml |

| │ |

| ▼ |

| [role execution order] |

| 1. kubernetes/preinstall → OS checks, sysctl, modules, swap |

| 2. container-engine → install containerd / crio |

| 3. download → fetch binaries and images (mirror) |

| 4. etcd → etcd cluster and certificates |

| 5. kubernetes/control-plane → kubeadm init / join (CP nodes) |

| 6. kubernetes/node → kubelet config, kubeadm join |

| 7. network_plugin → apply Calico / Cilium manifests |

| 8. kubernetes-apps → CoreDNS, add-ons, MetalLB, etc. |

+--------------------------------------------------------------------+

│ SSH (become: root)

▼

+-------------------+ +-------------------+ +-------------------+

+-------------------+ +-------------------+ +-------------------+

+-------------------+ +-------------------+ +-------------------+

Two key insights. First, Kubespray is a procedural execution tool, not a declarative controller. It converges state only at the moment you run the playbook; if you do not run it, it does nothing. Second, the inventory plus group_vars are the definition of the cluster shape, so managing this directory in Git is the starting point of operations.

Prerequisites — Nodes, Network, Access

Node requirements

For production, the following is recommended.

- Control plane: 3 nodes (odd number to preserve quorum), minimum 2 vCPU / 4GB RAM, recommended 4 vCPU / 8GB or more. If etcd is co-located, disks must be local SSD/NVMe. etcd is extremely sensitive to fsync latency; a slow disk translates directly into cluster-wide instability.

- Workers: size to your workloads, accounting for systemReserved and kubeReserved for kubelet and system daemons.

- OS: distributions in the Kubespray support matrix such as Ubuntu 22.04/24.04 LTS, RHEL 9, Rocky 9. Standardizing OS and kernel versions across all nodes drastically lowers troubleshooting cost.

- Common conditions: swap disabled (Kubespray handles this, but check the permanent fstab setting), unique hostname, MAC, and product_uuid per node, time synchronization (chrony), and the br_netfilter and overlay kernel modules.

containerd is the default container runtime; leave it unless you have a specific reason. CRI-O is selectable, but containerd has the broadest support.

Network planning — CIDR design

These values are effectively impossible to change after the build, so this is where you must be most careful. Three ranges must never overlap with each other or with your corporate network.

Node network : 10.10.0.0/24 (physical server IPs — from corporate IPAM)

Pod CIDR : 10.233.64.0/18 (kube_pods_subnet — default)

Service CIDR : 10.233.0.0/18 (kube_service_addresses — default)

Per-node pod range: /24 (kube_network_node_prefix)

→ about 110 pods max per node, up to 64 nodes with a /18

The sizing formula is simple: subtract the per-node prefix from the pod CIDR size to get the maximum node count. With a /18 pod range and /24 per node, you get 2 to the 6th power, i.e. a 64-node ceiling. If you are planning for 300 nodes, decide before the build to widen the pod range to /16 or shrink the per-node prefix to /25. Also, overlap with the corporate network silently cuts connectivity to whatever internal systems live in that range — register these CIDRs formally in your IPAM documentation together with the network team.

Control plane HA — a VIP in front of the API server

Even with three control plane nodes, if clients (kubelet, kubectl, CI) point at a single node IP, that node is a single point of failure. There are three remedies.

1. kube-vip: runs as a static pod on the control plane nodes and advertises a VIP via ARP (or BGP). No external appliance needed — the simplest option on bare metal. Kubespray supports it directly via the kube_vip_enabled variable.

2. HAProxy + keepalived: two dedicated LB nodes run HAProxy spreading port 6443 across the three CP backends, with keepalived VRRP failing over the VIP. Traditional and battle-tested; a good fit where a network team owns the LB layer.

3. External hardware LB: if you already have L4 appliances (F5, etc.), use them.

Kubespray also has a built-in mechanism: the loadbalancer_apiserver_localhost option (enabled by default) runs an nginx-proxy static pod on every node, balancing from localhost to the three control plane nodes. Internal cluster components therefore get a degree of HA without an external LB — but kubectl and CI access from outside the cluster still needs the VIP.

SSH and sudo

Ansible connects to every node over SSH and works as root, so prepare the following.

Generate a key on the control node and distribute to all nodes

ssh-keygen -t ed25519 -f ~/.ssh/kubespray_ed25519

for h in 10.10.0.11 10.10.0.12 10.10.0.13 10.10.0.21 10.10.0.22; do

ssh-copy-id -i ~/.ssh/kubespray_ed25519.pub deploy@"$h"

done

Grant passwordless sudo to the deploy account (on each node)

echo 'deploy ALL=(ALL) NOPASSWD: ALL' | sudo tee /etc/sudoers.d/deploy

Air-gapped environment preparation — overview

In a network-separated environment, three classes of artifacts normally fetched from the internet must be moved to internal mirrors.

1. Container images → internal registry (Harbor, etc.)

2. Binary files (kubeadm, kubelet, etcd, CNI plugins, crictl, etc.) → internal HTTP server

3. OS packages (containerd dependencies, etc.) → internal yum/apt repository

Kubespray ships scripts under contrib/offline that extract the required image and file lists. Generate the lists in an internet-connected staging zone, populate your mirrors, and in the air-gapped inventory override the download URL variables to point at internal mirrors. The concrete variables are covered in the customization section below.

Hands-On — The Full Cluster Build Flow

Step 1: Prepare Kubespray and create the inventory

Always check out a release tag — never build production from master

git clone --branch v2.28.0 https://github.com/kubernetes-sigs/kubespray.git

cd kubespray

Use a dedicated virtualenv to avoid Ansible version conflicts

python3 -m venv .venv

source .venv/bin/activate

pip install -U pip

pip install -r requirements.txt

Copy the sample inventory to create the production inventory

cp -rfp inventory/sample inventory/prod

Pinning to a release tag is not just a nicety. Each Kubespray release fixes a matrix of supported Kubernetes minor versions and component versions; combinations outside that matrix are untested territory.

Step 2: hosts.yaml — defining the topology

A standard layout with three control plane nodes (etcd co-located) and three workers:

inventory/prod/hosts.yaml

all:

hosts:

cp1:

ansible_host: 10.10.0.11

ip: 10.10.0.11 # internal IP for kubelet/etcd to bind

access_ip: 10.10.0.11 # IP other nodes use to reach this node

cp2:

ansible_host: 10.10.0.12

ip: 10.10.0.12

access_ip: 10.10.0.12

cp3:

ansible_host: 10.10.0.13

ip: 10.10.0.13

access_ip: 10.10.0.13

worker1:

ansible_host: 10.10.0.21

ip: 10.10.0.21

worker2:

ansible_host: 10.10.0.22

ip: 10.10.0.22

worker3:

ansible_host: 10.10.0.23

ip: 10.10.0.23

children:

kube_control_plane:

hosts:

cp1:

cp2:

cp3:

kube_node:

hosts:

worker1:

worker2:

worker3:

etcd:

hosts:

cp1:

cp2:

cp3:

k8s_cluster:

children:

kube_control_plane:

kube_node:

calico_rr:

hosts: {}

Topology design points:

- You can separate the etcd group onto three dedicated nodes (stacked vs. external etcd). External etcd is more stable for large clusters (hundreds of nodes) or heavy API server load, but the managed node count grows to six. For clusters under 50 nodes, stacked is usually plenty.

- Make a habit of setting ip and access_ip explicitly. On multi-NIC servers, Ansible auto-detecting the wrong interface IP — and etcd binding to the wrong network — is a classic failure.

- The group names (kube_control_plane, kube_node, etcd, k8s_cluster) are reserved words referenced by Kubespray roles; do not rename them.

Step 3: Dissecting group_vars — k8s-cluster.yml

This file determines the identity of the cluster. The essential variables:

inventory/prod/group_vars/k8s_cluster/k8s-cluster.yml

Kubernetes version — only within the range supported by this Kubespray release

kube_version: "1.32.5"

Network plugin: calico, cilium, flannel, kube-ovn, etc.

kube_network_plugin: calico

CIDRs — unchangeable after the build; must not overlap with corporate ranges

kube_service_addresses: 10.233.0.0/18

kube_pods_subnet: 10.233.64.0/18

kube_network_node_prefix: 24

kube-proxy mode: ipvs recommended (beats iptables at thousands of services)

kube_proxy_mode: ipvs

Container runtime

container_manager: containerd

DNS

dns_mode: coredns

enable_nodelocaldns: true # node-local DNS cache — near-mandatory at scale

Cluster name (internal domain)

cluster_name: cluster.local

Copy the admin kubeconfig to the control node after the build

kubeconfig_localhost: true

Automatic certificate renewal (systemd timer attempts renewal monthly)

auto_renew_certificates: true

Decision criteria per variable:

- kube_network_plugin: the default, Calico, is a proven choice with BGP-based routing and network policy. Choose Cilium if you need the eBPF data plane, Hubble observability, or L7 policy. Crucially, this choice is effectively permanent — swapping the CNI on a running cluster is not a scenario Kubespray supports.

- kube_proxy_mode: ipvs avoids the linear scan cost of iptables mode when service counts grow large. Note that with Cilium in kube-proxy replacement mode you can even remove kube-proxy entirely.

- enable_nodelocaldns: a node-local cache absorbs pod DNS queries, reducing CoreDNS load and conntrack contention. Any operator who has suffered DNS timeouts will want this on by default.

The API server endpoint (VIP) is defined in the all-group variables.

inventory/prod/group_vars/all/all.yml

When using an external LB (HAProxy+keepalived or a hardware LB)

apiserver_loadbalancer_domain_name: "k8s-api.prod.internal"

loadbalancer_apiserver:

address: 10.10.0.100 # VIP

port: 6443

Per-node nginx-proxy local LB (default on — HA for internal components)

loadbalancer_apiserver_localhost: true

With kube-vip, no separate LB nodes are needed:

Control plane VIP via kube-vip (ARP mode)

kube_vip_enabled: true

kube_vip_controlplane_enabled: true

kube_vip_arp_enabled: true

kube_vip_address: 10.10.0.100

loadbalancer_apiserver:

address: 10.10.0.100

port: 6443

Step 4: Dissecting group_vars — addons.yml

Select the base add-ons that go on top of the cluster.

inventory/prod/group_vars/k8s_cluster/addons.yml

helm_enabled: true

metrics_server_enabled: true

Ingress controller

ingress_nginx_enabled: true

ingress_nginx_host_network: false

LoadBalancer service implementation for bare metal — MetalLB

metallb_enabled: true

metallb_speaker_enabled: true

metallb_config:

address_pools:

primary:

ip_range:

- 10.10.0.150-10.10.0.180

auto_assign: true

layer2:

- primary

Certificate automation

cert_manager_enabled: true

If simple local PVs are sufficient

local_path_provisioner_enabled: true

On bare metal there is no cloud provider to satisfy LoadBalancer-type services, so MetalLB takes that role. L2 mode is simple to configure but funnels traffic through a single node; BGP mode requires peering with your ToR switches but achieves true distribution. If you can negotiate with the network team, BGP mode is worth evaluating. Also note a widely used strategy: let Kubespray install add-ons only for the initial bootstrap, then manage their lifecycle directly with Helm/GitOps afterwards — Kubespray add-on variables are convenient but give you little freedom in chart version selection.

Step 5: Execution — cluster.yml

Pre-flight connectivity check

ansible -i inventory/prod/hosts.yaml all -m ping \

--private-key ~/.ssh/kubespray_ed25519 -u deploy -b

The main run — roughly 20 to 40 minutes for 6 nodes

ansible-playbook -i inventory/prod/hosts.yaml \

--private-key ~/.ssh/kubespray_ed25519 -u deploy \

--become --become-user=root \

cluster.yml

The stages scrolling by match the role order in the architecture diagram above. Three segments deserve attention: failures in the download role point to network/proxy/mirror problems; a stall in the etcd role points to inter-node 2379/2380 connectivity or the ip variable; failure at the health check after kubeadm init in the control-plane role usually means VIP or certificate SAN misconfiguration.

Step 6: Verification

With kubeconfig_localhost: true, admin.conf is copied under the inventory

export KUBECONFIG=$PWD/inventory/prod/artifacts/admin.conf

kubectl get nodes -o wide

NAME STATUS ROLES VERSION INTERNAL-IP

cp1 Ready control-plane v1.32.5 10.10.0.11

cp2 Ready control-plane v1.32.5 10.10.0.12

cp3 Ready control-plane v1.32.5 10.10.0.13

worker1 Ready <none> v1.32.5 10.10.0.21

...

Control plane pods and CNI state

kubectl get pods -n kube-system

etcd members and health (on cp1)

sudo ETCDCTL_API=3 etcdctl \

--endpoints=https://10.10.0.11:2379 \

--cacert=/etc/ssl/etcd/ssl/ca.pem \

--cert=/etc/ssl/etcd/ssl/node-cp1.pem \

--key=/etc/ssl/etcd/ssl/node-cp1-key.pem \

endpoint health --cluster

Smoke test: deploy → expose → DNS → delete

kubectl create deployment nginx --image=nginx --replicas=3

kubectl expose deployment nginx --port=80

kubectl run dns-test --rm -it --image=busybox:1.36 --restart=Never \

-- nslookup nginx.default.svc.cluster.local

kubectl delete deployment nginx svc/nginx

Beyond this, it is safest to consider production acceptance criteria met only after a VIP failover test (force-stop one CP node and verify kubectl keeps working) and a drain test of one worker.

Day-2 Operations as Playbooks

The real value of Kubespray is that day-2 operations are all codified as playbooks.

Adding nodes — scale.yml

1) Add worker4 to hosts.yaml, then

2) run scale targeting only the new node

ansible-playbook -i inventory/prod/hosts.yaml \

--become --limit=worker4 \

scale.yml

scale.yml skips the steps in cluster.yml that reconfigure existing nodes and performs only the preparation and kubeadm join of the new node, making it fast and safe. One caveat: even with the limit option, Ansible still needs facts from other hosts such as the etcd group. If fact gathering fails, the playbook breaks — so make refreshing facts for all nodes a habit first.

Refresh facts for all nodes before any limit run

ansible-playbook -i inventory/prod/hosts.yaml --become playbooks/facts.yml

Adding a control plane node requires cluster.yml, not scale.yml, and do not forget to refresh certificate SANs and the LB backend list afterwards.

Removing nodes — remove-node.yml

Drain → remove from cluster → clean the node, all in one

ansible-playbook -i inventory/prod/hosts.yaml \

--become \

-e node=worker3 \

remove-node.yml

remove-node.yml cordons and drains the target node, runs kubectl delete node, and cleans up kubelet and the runtime on the node side. If the node is already dead and unreachable, pass reset_nodes=false to skip node-side cleanup and remove only the cluster metadata. After removal, delete the node from hosts.yaml as well to keep the inventory matching reality. Skipping this synchronization is how inventory drift begins.

Upgrades — upgrade-cluster.yml

The operation that demands the most care. Principles first:

1. Never skip minor versions. You cannot jump from 1.30 straight to 1.32. Per the kubeadm and Kubernetes version skew policy, you must step through 1.30 → 1.31 → 1.32, switching the Kubespray release tag at each step to match.

2. Record and maintain the pairing of Kubespray version and cluster version. Your inventory repository should state plainly: "this cluster runs 1.31.4, deployed with tag v2.27.1."

3. An etcd backup before any upgrade is non-negotiable.

Example: 1.31.x → 1.32.x

cd kubespray && git checkout v2.28.0

pip install -r requirements.txt # the required Ansible version changes too

ansible-playbook -i inventory/prod/hosts.yaml \

--become \

-e kube_version=1.32.5 \

upgrade-cluster.yml

How upgrade-cluster.yml behaves is the key to zero downtime.

- Control plane and etcd are upgraded first, one node at a time.

- Workers proceed drain → upgrade kubelet/runtime → uncordon, with the number of nodes processed concurrently controlled by the serial variable (default 20%). Set serial=1 to be conservative.

- Drain behavior is tuned with drain_grace_period, drain_timeout, and drain_retries. An overly tight PodDisruptionBudget makes the drain wait forever, so auditing PDBs before an upgrade is mandatory.

One node at a time, with generous drain timeouts

ansible-playbook -i inventory/prod/hosts.yaml --become \

-e kube_version=1.32.5 \

-e serial=1 \

-e drain_timeout=600s \

-e drain_grace_period=120 \

upgrade-cluster.yml

Application-side requirements for zero downtime must be covered too: at least two replicas, sensible PDBs, preStop hooks and graceful shutdown, and identifying drain blockers like single-replica StatefulSets in advance.

reset.yml — the last resort

Tears the cluster off the nodes completely — including etcd data on disk

ansible-playbook -i inventory/prod/hosts.yaml --become reset.yml

reset.yml removes Kubernetes, etcd, CNI configuration, and container data from the nodes. Accidents really do happen where someone meaning to "reset just a few nodes" runs it without limit on a production cluster. There is a confirmation prompt, but wiring it into CI with auto-approval is absolutely forbidden.

etcd backup and restore

Kubespray does not back up etcd for you; build the discipline yourself.

Backup (on an etcd node — run periodically via cron/systemd timer)

sudo ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snap-20260613.db \

--endpoints=https://10.10.0.11:2379 \

--cacert=/etc/ssl/etcd/ssl/ca.pem \

--cert=/etc/ssl/etcd/ssl/node-cp1.pem \

--key=/etc/ssl/etcd/ssl/node-cp1-key.pem

Verify integrity

ETCDCTL_API=3 etcdctl --write-out=table snapshot status /backup/etcd-snap-20260613.db

Always ship snapshot files to storage outside the cluster, and run a quarterly restore rehearsal (snapshot → etcdctl snapshot restore → start etcd on a fresh data directory) to prove the backups are actually restorable. A surprising number of organizations have backups but have never once rehearsed a restore.

Customization — Things You Will Inevitably Meet in Production

Certificate management

Control plane certificates in a kubeadm cluster are valid for one year by default. In Kubespray, two variables manage this.

A systemd timer (k8s-certs-renew.timer) renews near-expiry certs monthly

auto_renew_certificates: true

Force regeneration in a playbook run when needed

-e force_certificate_regeneration=true

With auto_renew_certificates on, you prevent the classic "one year later the whole cluster cannot authenticate" incident. Note that the CA itself (10-year default) and any certificates installed on external LBs are separate concerns. Pair this with expiry monitoring (an x509 exporter, for example).

Private registry and air-gap details

The heart of an air-gapped build is the variable set that redirects every download path to internal mirrors.

inventory/prod/group_vars/all/offline.yml

registry_host: "harbor.internal:443/k8s-mirror"

files_repo: "https://mirror.internal/kubespray-files"

Point all container image repositories at the internal registry

kube_image_repo: "{{ registry_host }}"

gcr_image_repo: "{{ registry_host }}"

github_image_repo: "{{ registry_host }}"

docker_image_repo: "{{ registry_host }}"

quay_image_repo: "{{ registry_host }}"

Point binary download URLs at the internal file server

kubeadm_download_url: "{{ files_repo }}/kubeadm/{{ kube_version }}/kubeadm"

kubelet_download_url: "{{ files_repo }}/kubelet/{{ kube_version }}/kubelet"

kubectl_download_url: "{{ files_repo }}/kubectl/{{ kube_version }}/kubectl"

etcd_download_url: "{{ files_repo }}/etcd/etcd-{{ etcd_version }}-linux-{{ host_architecture }}.tar.gz"

cni_download_url: "{{ files_repo }}/cni/cni-plugins-linux-{{ host_architecture }}-{{ cni_version }}.tgz"

crictl_download_url: "{{ files_repo }}/crictl/crictl-{{ crictl_version }}-linux-{{ host_architecture }}.tar.gz"

runc_download_url: "{{ files_repo }}/runc/{{ runc_version }}/runc.{{ host_architecture }}"

containerd_download_url: "{{ files_repo }}/containerd/containerd-{{ containerd_version }}-linux-{{ host_architecture }}.tar.gz"

If the registry uses a private CA, register trust in containerd

containerd_registries_mirrors:

- prefix: "harbor.internal"

mirrors:

- host: "https://harbor.internal"

capabilities: ["pull", "resolve"]

skip_verify: false

In the staging zone, use generate_list.sh under contrib/offline to extract the full file and image list required by the release, and manage-offline-container-images.sh to pull and push images to the internal registry in bulk. One operational tip: make the mirror-population job itself a CI pipeline. Updating the lists by hand at every upgrade — and missing something — is the single most common cause of failed air-gapped upgrades.

Injecting sysctl and kernel tuning

You can manage node OS tuning in a separate Ansible role, but injecting it via Kubespray variables keeps the configuration in one place.

inventory/prod/group_vars/k8s_cluster/k8s-cluster.yml

additional_sysctl:

- name: net.core.somaxconn

value: 65535

- name: net.ipv4.tcp_max_syn_backlog

value: 65535

- name: fs.inotify.max_user_instances

value: 8192

- name: fs.inotify.max_user_watches

value: 1048576

Make kubelet reserve node resources

kube_reserved: true

kube_memory_reserved: 512Mi

kube_cpu_reserved: 200m

system_reserved: true

system_memory_reserved: 1Gi

system_cpu_reserved: 500m

The inotify limits are a bottleneck you will inevitably hit with log collectors and dense pod packing; raise them early.

Extra manifests and GitOps integration

Kubespray has no good generic hook for injecting arbitrary manifests. The practical pattern is clear: limit the responsibility of Kubespray to "from node OS up to the CNI," and manage everything above it (monitoring, logging, detailed ingress configuration, applications) with a GitOps tool such as Argo CD or Flux. Have the final step of the build pipeline kubectl-apply a single Argo CD bootstrap manifest, and from then on Git is the source of truth for everything inside the cluster.

Production Hardening

CIS Benchmark and kube-bench

CIS Kubernetes Benchmark conformance is a recurring requirement in financial and public-sector security reviews. Kubespray defaults do not fully satisfy CIS, so the recommended cycle is: run kube-bench after the build, identify the gaps, and correct them via variables. Frequently adjusted items:

API server hardening

kube_apiserver_request_timeout: 120s

kube_apiserver_enable_admission_plugins:

- NodeRestriction

- AlwaysPullImages

- EventRateLimit

kube_apiserver_admission_event_rate_limits:

limit_1:

type: Namespace

qps: 50

burst: 100

cache_size: 2000

kube_profiling: false

kubelet hardening

kubelet_protect_kernel_defaults: true

kubelet_event_record_qps: 1

kubelet_streaming_connection_idle_timeout: 5m

kubelet_make_iptables_util_chains: true

Anonymous auth is blocked by kubeadm defaults, but be explicit

kube_api_anonymous_auth: false

Rather than trying to silence every kube-bench finding, the realistic approach is to weigh the impact of each item on your workloads (AlwaysPullImages increases registry load, for example) and document the rationale for exceptions.

Audit log

kubernetes_audit: true

audit_log_path: /var/log/kubernetes/audit/audit.log

audit_log_maxage: 30 # retention days

audit_log_maxbackups: 10

audit_log_maxsize: 100 # MB

To use a custom policy instead of the default,

define rules in audit_policy_custom_rules

Audit logs are effectively the only primary evidence in a security incident investigation, so the full package includes a pipeline shipping them to a central log system rather than leaving them on node-local disk.

Encryption at rest for Secrets

Encrypt Secrets that would otherwise sit in etcd as plaintext.

kube_encrypt_secret_data: true

The default provider is secretbox — switchable to aescbc and others

kube_encryption_algorithm: "secretbox"

kube_encryption_resources: [secrets]

This is the minimum defense ensuring that a leaked etcd snapshot does not expose Secret plaintext. The limitation remains that the key itself lives on control plane node disks; for stricter requirements, evaluate external KMS integration.

Pod Security Admission defaults

After the removal of PodSecurityPolicy, PSA is the standard. Kubespray variables can set cluster-wide defaults.

kube_pod_security_use_default: true

kube_pod_security_default_enforce: baseline # applied to new namespaces

kube_pod_security_default_audit: restricted

kube_pod_security_default_warn: restricted

kube_pod_security_exempt_namespaces:

- kube-system

Raising enforce straight to restricted can mass-reject existing workloads, so the safe sequence is to surface violations via audit/warn first, then ratchet enforce up in stages.

CI/CD Integration and Operating at Scale

Inventory in Git, execution in pipelines

A recommended repository structure:

k8s-clusters/ ← internal Git repository

├─ README.md ← cluster-to-Kubespray version table

├─ clusters/

│ ├─ prod-seoul/

│ │ ├─ hosts.yaml

│ │ └─ group_vars/ ...

│ └─ stage-seoul/

│ ├─ hosts.yaml

│ └─ group_vars/ ...

└─ pipelines/

└─ run-kubespray.yaml ← CI definition

Pull the Kubespray codebase itself via a submodule or an in-pipeline git clone (pinned to a tag), and route all inventory changes through PR review. A skeleton of the execution pipeline:

GitLab CI example — conceptual skeleton

stages: [lint, diff, apply]

lint:

stage: lint

script:

- ansible-lint clusters/prod-seoul || true

- python3 scripts/validate_inventory.py clusters/prod-seoul

apply-prod:

stage: apply

when: manual # production must require manual approval

script:

- git clone --branch v2.28.0 --depth 1

https://github.com/kubernetes-sigs/kubespray.git

- pip install -r kubespray/requirements.txt

- ansible-playbook -i clusters/prod-seoul/hosts.yaml

--become kubespray/cluster.yml

environment: prod-seoul

Idempotency — its uses and its limits

Thanks to Ansible idempotency, rerunning cluster.yml converges only what changed, and resuming from a failure point is often as simple as rerunning the same command. But know the limits precisely.

- Removing an item from variables does not retract the existing configuration on nodes. Disabling an add-on, for example, does not necessarily delete already-deployed resources. This is the essential limitation of a procedural tool.

- Check mode (dry run) is unreliable with Kubespray, because many tasks depend on the real results of earlier tasks. Do not expect a Terraform-style "review the diff and approve" workflow; substitute it with applying to a staging cluster first.

- Playbook runs can collide with concurrent changes (manual kubectl operations and the like), so align execution windows with change freeze windows as an operational discipline.

Tips for large fleets — parallelism and partial runs

ansible.cfg

[defaults]

forks = 50 # default is 5 — raise it for dozens of nodes

strategy = linear

[ssh_connection]

pipelining = True # fewer SSH round trips — very noticeable

ssh_args = -o ControlMaster=auto -o ControlPersist=30m

For a sense of timing: a fresh 6-node build takes 20 to 40 minutes, a 50-node build over an hour, and upgrades scale almost linearly with node count due to drain time. Two tools for partial runs:

Specific nodes only — always refresh facts first

ansible-playbook -i inventory/prod/hosts.yaml --become playbooks/facts.yml

ansible-playbook -i inventory/prod/hosts.yaml --become \

--limit=worker7 cluster.yml

Specific component tags only — e.g. roll out a CoreDNS config change

ansible-playbook -i inventory/prod/hosts.yaml --become \

--tags=coredns cluster.yml

Skip the download stage to shorten repeat runs

ansible-playbook -i inventory/prod/hosts.yaml --become \

--skip-tags=download cluster.yml

Tags and limit are powerful but risk skipping inter-task dependencies, so pair them with the rule of running the same tag combination on staging first.

Kubespray vs. Cluster API — and Coexistence

Cluster API (CAPI) is a SIG project that declares Kubernetes clusters themselves as Kubernetes resources (Cluster, MachineDeployment, and so on), with controllers in a management cluster continuously converging state. Its philosophy is the polar opposite of Kubespray.

| Aspect | Kubespray | Cluster API |

| --- | --- | --- |

| Paradigm | Procedural — converges only when run | Declarative — controllers converge continuously |

| Required infra | Just an Ansible control node | Management cluster + infrastructure provider |

| Bare metal | Anywhere SSH works | Needs Metal3 (IPMI/Redfish) or the BYOH provider |

| Node recovery | Manual (rerun playbooks) | Automatic recreation via MachineHealthCheck |

| Many clusters | Repeat runs per inventory | Strong at fleet management |

| OS customization | Very flexible (applies onto existing OS) | Requires an image build pipeline |

| Learning curve | Gentle for Ansible users | Requires the CRD and controller model |

Decision criteria:

- If you have a single-digit number of clusters, manually provisioned servers, and in-house Ansible skills, Kubespray is simple and sufficient.

- If you must stamp out dozens of clusters and can automate bare-metal lifecycle via IPMI/Redfish, CAPI plus Metal3 wins long-term.

- There is a realistic hybrid too. The CAPI management cluster itself must be bootstrapped somewhere — the chicken-and-egg problem — and a common pattern is to build that first cluster with Kubespray and manage the rest of the fleet with CAPI. Gradual migration, keeping existing Kubespray clusters while building new ones with CAPI, is also common.

Common Pitfalls and Troubleshooting

The pitfall list

1. Inventory drift: if someone changes node configuration by hand, or hosts.yaml is not updated after remove-node, the next playbook run behaves unexpectedly. The cure is discipline: inventory changes only via PR, manual node changes forbidden.

2. Conflicts with OS patching: unattended-upgrades bumping containerd on its own, or a kernel upgrade plus reboot dropping kernel module settings, are frequent incidents. Hold Kubernetes-related packages and perform OS patching as a controlled procedure with drains.

3. CNI cannot be changed: changing kube_network_plugin and rerunning cluster.yml will wreck the cluster. If a CNI swap is truly needed, the orthodox path is a new cluster build plus workload migration.

4. Kubespray and Ansible version mismatch: each release requires a specific Ansible version range. Make it routine to create a fresh virtualenv per release and reinstall requirements.txt.

5. Running cluster.yml against an existing cluster from a different version tag: unintended component upgrades happen. This is why the per-cluster version table matters.

6. NTP drift: skewed clocks across nodes destabilize certificate validation and etcd. Include chrony configuration in the preinstall checklist.

Resuming after failure, and logs

Rerun with verbose logs — idempotency lets completed tasks pass unchanged

ansible-playbook -i inventory/prod/hosts.yaml --become cluster.yml -vvv \

2>&1 | tee /tmp/kubespray-run.log

First-look points on the node side

journalctl -u kubelet -f # why kubelet fails to start

journalctl -u containerd -f # runtime/registry problems

crictl ps -a # control plane static pod states

ls /etc/kubernetes/manifests/ # kubeadm static pod manifests

Empirically, eight out of ten failures are "an environmental difference on one specific node." Check the failed task name and node, reproduce the same operation manually on that node, and the cause surfaces quickly. After fixing it, rerun the whole playbook from the start to confirm convergence.

Production checklist

[ ] Inventory and group_vars managed in Git via PR review

[ ] Kubespray release tag to kube_version mapping documented

[ ] 3 CP nodes + etcd quorum; etcd on SSD/NVMe

[ ] API server VIP (kube-vip or HAProxy+keepalived) failover test passed

[ ] Pod/Service CIDRs registered in corporate IPAM, no overlaps

[ ] Periodic etcd snapshots + offsite copies + quarterly restore rehearsal

[ ] auto_renew_certificates on + certificate expiry monitoring

[ ] kubernetes_audit on + central log shipping

[ ] kube_encrypt_secret_data on

[ ] PSA defaults applied (enforce baseline or higher)

[ ] kube-bench results and exception rationale documented

[ ] Upgrade runbook exists (no version skipping, serial, PDB audit)

[ ] Process to validate identical changes on a staging cluster first

[ ] OS patch policy and holds on Kubernetes packages

Closing

Kubespray is not a flashy tool. Instead of inventing a new abstraction, it layers proven Ansible automation on top of the upstream kubeadm standard, offering the most realistic path to "turn a pile of generic Linux servers into production Kubernetes." The price is the operational discipline that is the fate of any procedural tool. For organizations that control the inventory through Git, maintain a version mapping table, validate on staging first, and never skip backups and rehearsals, Kubespray will serve reliably for years.

Conversely, when the number of clusters grows and fleet management becomes the essential problem, that is the time to evaluate a move to Cluster API. Even then, Kubespray will quite likely still be the tool that bootstraps your first management cluster. For teams starting their on-prem Kubernetes journey, I recommend the learning path of doing kubeadm manually end-to-end once, then automating with Kubespray. You need to know what the tool does for you in order to fix things yourself when the tool fails.

References

- Kubespray official documentation: https://kubespray.io/

- Kubespray GitHub repository: https://github.com/kubernetes-sigs/kubespray

- Creating a cluster with kubeadm (Kubernetes official): https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/

- Kubernetes version skew policy: https://kubernetes.io/releases/version-skew-policy/

- etcd official documentation (operations, backup): https://etcd.io/docs/

- Calico official documentation: https://docs.tigera.io/calico/latest/about/

- Cilium official documentation: https://docs.cilium.io/

- MetalLB official documentation: https://metallb.universe.tf/

- kube-vip official documentation: https://kube-vip.io/

- cert-manager official documentation: https://cert-manager.io/docs/

- Pod Security Admission (Kubernetes official): https://kubernetes.io/docs/concepts/security/pod-security-admission/

- Encrypting Confidential Data at Rest (Kubernetes official): https://kubernetes.io/docs/tasks/administer-cluster/encrypt-data/

- kube-bench (Aqua Security): https://github.com/aquasecurity/kube-bench

- CIS Kubernetes Benchmark: https://www.cisecurity.org/benchmark/kubernetes

- Cluster API official documentation: https://cluster-api.sigs.k8s.io/

- Metal3 (bare metal provider): https://metal3.io/

- Ansible official documentation: https://docs.ansible.com/