Skip to content
Published on

Kubespray Deep Dive — Building Production On-Prem Kubernetes with Ansible

Authors

Introduction

Even in an era where managed Kubernetes offerings (EKS, GKE, AKS) have become the de facto standard, plenty of organizations still need to build Kubernetes directly on on-premises bare metal. If anything, demand has been growing again — driven by the GPU farm boom and the broader cloud repatriation trend.

The problem is that "installing Kubernetes yourself" is a much bigger job than it sounds. kubeadm only solves control plane bootstrapping; everything around it — OS preparation, container runtime installation, etcd topology, CNI deployment, load balancers, certificate renewal, upgrade orchestration — remains the operator's responsibility. The tool that fills this gap is Kubespray.

In this article we will dig deep, code-first, into what Kubespray actually is and how it works internally, how to design inventories and variables for a production cluster, and how day-2 operations (adding nodes, upgrades, backups, hardening) map onto playbooks. At the end, we will position Kubespray against Cluster API and offer decision criteria for when to use which. The reference versions are the Kubespray v2.28 line and Kubernetes v1.32 line, as of the first half of 2026.

Why On-Prem Kubernetes Still Matters

Despite cloud-first strategies being the norm, on-prem and bare-metal Kubernetes is not going away, for clear reasons.

First, data sovereignty and regulation. Network separation requirements in Korean financial regulations, national security guidelines for public institutions, and GDPR data residency requirements in Europe all directly govern where workloads physically run. For banks running workloads that integrate with the core banking ledger inside internal networks, or public-sector environments where external and internal networks are physically separated, public cloud is often simply not an option.

Second, GPU farms. As more organizations build their own LLM training and inference infrastructure, it has become common to tie tens or hundreds of GPU servers together with Kubernetes. There is a clear break-even point where owning expensive GPUs and maximizing utilization beats renting them by the hour in the cloud, in total cost of ownership terms.

Third, cost. For workloads with predictable traffic and resource usage above a certain scale, a private data center or colocation becomes cheaper even after accounting for depreciation. Egress traffic charges and block storage costs are the classic triggers that push teams toward repatriation.

Fourth, latency and special hardware. Edge computing on factory lines, trading systems that require colocation with exchanges, and telco NFV environments using SR-IOV or DPDK all presuppose direct control over physical infrastructure.

In these environments, "how do we install and maintain Kubernetes" immediately becomes the next question. Let us first compare the candidate answers.

The Tooling Landscape — What Do You Install With?

On-prem Kubernetes build tools fall into roughly five camps.

ToolApproachTarget OSHA control planeAir-gap supportBest fit
Manual kubeadmStep-by-step CLI bootstrapGeneric LinuxBuild it yourselfBuild it yourselfLearning, small scale, full custom
KubesprayAnsible playbooks orchestrating kubeadmUbuntu, RHEL, Rocky, Debian, etc.Built in (multi-CP + LB options)Officially supported (offline mirrors)Generic bare metal/VMs, heterogeneous OS, fine-grained customization
kOpsCluster lifecycle CLICloud-centric (AWS, GCE)Built inLimitedSelf-managed cloud; poor fit for on-prem
RKE2 / k3sOpinionated distribution (single binary)Generic LinuxBuilt inGoodEdge, security focus (FIPS), Rancher ecosystem
Talos LinuxImmutable OS purpose-built for KubernetesTalos-only OSBuilt inGoodGreenfield where you control the OS, maximum security

One-line decision guides:

  • Manual kubeadm is the best textbook for understanding Kubernetes internals, but repeatable builds across dozens of machines need an automation layer.
  • kOps is strong for self-managed AWS clusters but has effectively no bare-metal story.
  • RKE2 and k3s are the easiest to install with good security defaults, but they are their own distributions rather than the upstream kubeadm path, and they pull you into the Rancher ecosystem.
  • Talos is the most radical and attractive approach — an immutable OS without even SSH — but adoption is hard in organizations that must follow existing OS standards (security agents, asset management, and so on).
  • Kubespray precisely fits the most common enterprise requirement: "we already have a standard Linux build, and we need to install upstream Kubernetes on top of it with fine-grained control." For organizations already using Ansible, the learning curve is gentle.

The protagonist of this article is that last option: Kubespray.

What Kubespray Really Is — Ansible Playbooks Wrapping kubeadm

Kubespray is a CNCF project under kubernetes-sigs. At its core, it is "a collection of Ansible playbooks and roles that build production-grade Kubernetes clusters." There is no magic proprietary engine — it orchestrates proven components with Ansible.

  • Control plane bootstrapping calls kubeadm internally. A Kubespray cluster is a kubeadm cluster, inheriting the kubeadm certificate scheme and upgrade path as-is.
  • Its roles cover the full span: OS preparation (swap off, kernel modules, sysctl), containerd installation, etcd cluster setup, CNI deployment (Calico, Cilium, flannel, and more), and add-ons (CoreDNS, MetalLB, ingress-nginx, etc.).
  • The support matrix is broad: many Linux distributions, multiple CNIs, air-gapped environments, GPU nodes, and various topologies (stacked or external etcd), all selected through variables.

The end-to-end flow in ASCII:

+--------------------------------------------------------------------+
| Ansible control node (operator workstation or bastion)             |
|                                                                    |
|  inventory/prod/                                                   |
|   ├─ hosts.yaml            ← node list and groups (topology)       |
|   └─ group_vars/                                                   |
|       ├─ all/all.yml       ← global vars (LB, proxy, registry)     |
|       └─ k8s_cluster/                                              |
|           ├─ k8s-cluster.yml  ← versions, CIDRs, CNI, proxy mode   |
|           └─ addons.yml       ← ingress, metallb, cert-manager     |
|                                                                    |
|  ansible-playbook cluster.yml                                      |
|        │                                                           |
|        ▼                                                           |
|  [role execution order]                                            |
|   1. kubernetes/preinstall  → OS checks, sysctl, modules, swap     |
|   2. container-engine       → install containerd / crio           |
|   3. download               → fetch binaries and images (mirror)   |
|   4. etcd                   → etcd cluster and certificates        |
|   5. kubernetes/control-plane → kubeadm init / join (CP nodes)     |
|   6. kubernetes/node        → kubelet config, kubeadm join         |
|   7. network_plugin         → apply Calico / Cilium manifests      |
|   8. kubernetes-apps        → CoreDNS, add-ons, MetalLB, etc.      |
+--------------------------------------------------------------------+
        │ SSH (become: root)
+-------------------+  +-------------------+  +-------------------+
|  cp1 (CP+etcd)    |  |  cp2 (CP+etcd)    |  |  cp3 (CP+etcd)    |
+-------------------+  +-------------------+  +-------------------+
+-------------------+  +-------------------+  +-------------------+
|  worker1          |  |  worker2          |  |  workerN ...      |
+-------------------+  +-------------------+  +-------------------+

Two key insights. First, Kubespray is a procedural execution tool, not a declarative controller. It converges state only at the moment you run the playbook; if you do not run it, it does nothing. Second, the inventory plus group_vars are the definition of the cluster shape, so managing this directory in Git is the starting point of operations.

Prerequisites — Nodes, Network, Access

Node requirements

For production, the following is recommended.

  • Control plane: 3 nodes (odd number to preserve quorum), minimum 2 vCPU / 4GB RAM, recommended 4 vCPU / 8GB or more. If etcd is co-located, disks must be local SSD/NVMe. etcd is extremely sensitive to fsync latency; a slow disk translates directly into cluster-wide instability.
  • Workers: size to your workloads, accounting for systemReserved and kubeReserved for kubelet and system daemons.
  • OS: distributions in the Kubespray support matrix such as Ubuntu 22.04/24.04 LTS, RHEL 9, Rocky 9. Standardizing OS and kernel versions across all nodes drastically lowers troubleshooting cost.
  • Common conditions: swap disabled (Kubespray handles this, but check the permanent fstab setting), unique hostname, MAC, and product_uuid per node, time synchronization (chrony), and the br_netfilter and overlay kernel modules.

containerd is the default container runtime; leave it unless you have a specific reason. CRI-O is selectable, but containerd has the broadest support.

Network planning — CIDR design

These values are effectively impossible to change after the build, so this is where you must be most careful. Three ranges must never overlap with each other or with your corporate network.

Node network    : 10.10.0.0/24    (physical server IPs — from corporate IPAM)
Pod CIDR        : 10.233.64.0/18  (kube_pods_subnet — default)
Service CIDR    : 10.233.0.0/18   (kube_service_addresses — default)

Per-node pod range: /24 (kube_network_node_prefix)
→ about 110 pods max per node, up to 64 nodes with a /18

The sizing formula is simple: subtract the per-node prefix from the pod CIDR size to get the maximum node count. With a /18 pod range and /24 per node, you get 2 to the 6th power, i.e. a 64-node ceiling. If you are planning for 300 nodes, decide before the build to widen the pod range to /16 or shrink the per-node prefix to /25. Also, overlap with the corporate network silently cuts connectivity to whatever internal systems live in that range — register these CIDRs formally in your IPAM documentation together with the network team.

Control plane HA — a VIP in front of the API server

Even with three control plane nodes, if clients (kubelet, kubectl, CI) point at a single node IP, that node is a single point of failure. There are three remedies.

  1. kube-vip: runs as a static pod on the control plane nodes and advertises a VIP via ARP (or BGP). No external appliance needed — the simplest option on bare metal. Kubespray supports it directly via the kube_vip_enabled variable.
  2. HAProxy + keepalived: two dedicated LB nodes run HAProxy spreading port 6443 across the three CP backends, with keepalived VRRP failing over the VIP. Traditional and battle-tested; a good fit where a network team owns the LB layer.
  3. External hardware LB: if you already have L4 appliances (F5, etc.), use them.

Kubespray also has a built-in mechanism: the loadbalancer_apiserver_localhost option (enabled by default) runs an nginx-proxy static pod on every node, balancing from localhost to the three control plane nodes. Internal cluster components therefore get a degree of HA without an external LB — but kubectl and CI access from outside the cluster still needs the VIP.

SSH and sudo

Ansible connects to every node over SSH and works as root, so prepare the following.

# Generate a key on the control node and distribute to all nodes
ssh-keygen -t ed25519 -f ~/.ssh/kubespray_ed25519
for h in 10.10.0.11 10.10.0.12 10.10.0.13 10.10.0.21 10.10.0.22; do
  ssh-copy-id -i ~/.ssh/kubespray_ed25519.pub deploy@"$h"
done

# Grant passwordless sudo to the deploy account (on each node)
echo 'deploy ALL=(ALL) NOPASSWD: ALL' | sudo tee /etc/sudoers.d/deploy

Air-gapped environment preparation — overview

In a network-separated environment, three classes of artifacts normally fetched from the internet must be moved to internal mirrors.

  1. Container images → internal registry (Harbor, etc.)
  2. Binary files (kubeadm, kubelet, etcd, CNI plugins, crictl, etc.) → internal HTTP server
  3. OS packages (containerd dependencies, etc.) → internal yum/apt repository

Kubespray ships scripts under contrib/offline that extract the required image and file lists. Generate the lists in an internet-connected staging zone, populate your mirrors, and in the air-gapped inventory override the download URL variables to point at internal mirrors. The concrete variables are covered in the customization section below.

Hands-On — The Full Cluster Build Flow

Step 1: Prepare Kubespray and create the inventory

# Always check out a release tag — never build production from master
git clone --branch v2.28.0 https://github.com/kubernetes-sigs/kubespray.git
cd kubespray

# Use a dedicated virtualenv to avoid Ansible version conflicts
python3 -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -r requirements.txt

# Copy the sample inventory to create the production inventory
cp -rfp inventory/sample inventory/prod

Pinning to a release tag is not just a nicety. Each Kubespray release fixes a matrix of supported Kubernetes minor versions and component versions; combinations outside that matrix are untested territory.

Step 2: hosts.yaml — defining the topology

A standard layout with three control plane nodes (etcd co-located) and three workers:

# inventory/prod/hosts.yaml
all:
  hosts:
    cp1:
      ansible_host: 10.10.0.11
      ip: 10.10.0.11          # internal IP for kubelet/etcd to bind
      access_ip: 10.10.0.11   # IP other nodes use to reach this node
    cp2:
      ansible_host: 10.10.0.12
      ip: 10.10.0.12
      access_ip: 10.10.0.12
    cp3:
      ansible_host: 10.10.0.13
      ip: 10.10.0.13
      access_ip: 10.10.0.13
    worker1:
      ansible_host: 10.10.0.21
      ip: 10.10.0.21
    worker2:
      ansible_host: 10.10.0.22
      ip: 10.10.0.22
    worker3:
      ansible_host: 10.10.0.23
      ip: 10.10.0.23
  children:
    kube_control_plane:
      hosts:
        cp1:
        cp2:
        cp3:
    kube_node:
      hosts:
        worker1:
        worker2:
        worker3:
    etcd:
      hosts:
        cp1:
        cp2:
        cp3:
    k8s_cluster:
      children:
        kube_control_plane:
        kube_node:
    calico_rr:
      hosts: {}

Topology design points:

  • You can separate the etcd group onto three dedicated nodes (stacked vs. external etcd). External etcd is more stable for large clusters (hundreds of nodes) or heavy API server load, but the managed node count grows to six. For clusters under 50 nodes, stacked is usually plenty.
  • Make a habit of setting ip and access_ip explicitly. On multi-NIC servers, Ansible auto-detecting the wrong interface IP — and etcd binding to the wrong network — is a classic failure.
  • The group names (kube_control_plane, kube_node, etcd, k8s_cluster) are reserved words referenced by Kubespray roles; do not rename them.

Step 3: Dissecting group_vars — k8s-cluster.yml

This file determines the identity of the cluster. The essential variables:

# inventory/prod/group_vars/k8s_cluster/k8s-cluster.yml

# Kubernetes version — only within the range supported by this Kubespray release
kube_version: "1.32.5"

# Network plugin: calico, cilium, flannel, kube-ovn, etc.
kube_network_plugin: calico

# CIDRs — unchangeable after the build; must not overlap with corporate ranges
kube_service_addresses: 10.233.0.0/18
kube_pods_subnet: 10.233.64.0/18
kube_network_node_prefix: 24

# kube-proxy mode: ipvs recommended (beats iptables at thousands of services)
kube_proxy_mode: ipvs

# Container runtime
container_manager: containerd

# DNS
dns_mode: coredns
enable_nodelocaldns: true        # node-local DNS cache — near-mandatory at scale

# Cluster name (internal domain)
cluster_name: cluster.local

# Copy the admin kubeconfig to the control node after the build
kubeconfig_localhost: true

# Automatic certificate renewal (systemd timer attempts renewal monthly)
auto_renew_certificates: true

Decision criteria per variable:

  • kube_network_plugin: the default, Calico, is a proven choice with BGP-based routing and network policy. Choose Cilium if you need the eBPF data plane, Hubble observability, or L7 policy. Crucially, this choice is effectively permanent — swapping the CNI on a running cluster is not a scenario Kubespray supports.
  • kube_proxy_mode: ipvs avoids the linear scan cost of iptables mode when service counts grow large. Note that with Cilium in kube-proxy replacement mode you can even remove kube-proxy entirely.
  • enable_nodelocaldns: a node-local cache absorbs pod DNS queries, reducing CoreDNS load and conntrack contention. Any operator who has suffered DNS timeouts will want this on by default.

The API server endpoint (VIP) is defined in the all-group variables.

# inventory/prod/group_vars/all/all.yml

# When using an external LB (HAProxy+keepalived or a hardware LB)
apiserver_loadbalancer_domain_name: "k8s-api.prod.internal"
loadbalancer_apiserver:
  address: 10.10.0.100   # VIP
  port: 6443

# Per-node nginx-proxy local LB (default on — HA for internal components)
loadbalancer_apiserver_localhost: true

With kube-vip, no separate LB nodes are needed:

# Control plane VIP via kube-vip (ARP mode)
kube_vip_enabled: true
kube_vip_controlplane_enabled: true
kube_vip_arp_enabled: true
kube_vip_address: 10.10.0.100
loadbalancer_apiserver:
  address: 10.10.0.100
  port: 6443

Step 4: Dissecting group_vars — addons.yml

Select the base add-ons that go on top of the cluster.

# inventory/prod/group_vars/k8s_cluster/addons.yml

helm_enabled: true
metrics_server_enabled: true

# Ingress controller
ingress_nginx_enabled: true
ingress_nginx_host_network: false

# LoadBalancer service implementation for bare metal — MetalLB
metallb_enabled: true
metallb_speaker_enabled: true
metallb_config:
  address_pools:
    primary:
      ip_range:
        - 10.10.0.150-10.10.0.180
      auto_assign: true
  layer2:
    - primary

# Certificate automation
cert_manager_enabled: true

# If simple local PVs are sufficient
local_path_provisioner_enabled: true

On bare metal there is no cloud provider to satisfy LoadBalancer-type services, so MetalLB takes that role. L2 mode is simple to configure but funnels traffic through a single node; BGP mode requires peering with your ToR switches but achieves true distribution. If you can negotiate with the network team, BGP mode is worth evaluating. Also note a widely used strategy: let Kubespray install add-ons only for the initial bootstrap, then manage their lifecycle directly with Helm/GitOps afterwards — Kubespray add-on variables are convenient but give you little freedom in chart version selection.

Step 5: Execution — cluster.yml

# Pre-flight connectivity check
ansible -i inventory/prod/hosts.yaml all -m ping \
  --private-key ~/.ssh/kubespray_ed25519 -u deploy -b

# The main run — roughly 20 to 40 minutes for 6 nodes
ansible-playbook -i inventory/prod/hosts.yaml \
  --private-key ~/.ssh/kubespray_ed25519 -u deploy \
  --become --become-user=root \
  cluster.yml

The stages scrolling by match the role order in the architecture diagram above. Three segments deserve attention: failures in the download role point to network/proxy/mirror problems; a stall in the etcd role points to inter-node 2379/2380 connectivity or the ip variable; failure at the health check after kubeadm init in the control-plane role usually means VIP or certificate SAN misconfiguration.

Step 6: Verification

# With kubeconfig_localhost: true, admin.conf is copied under the inventory
export KUBECONFIG=$PWD/inventory/prod/artifacts/admin.conf

kubectl get nodes -o wide
# NAME      STATUS   ROLES           VERSION   INTERNAL-IP
# cp1       Ready    control-plane   v1.32.5   10.10.0.11
# cp2       Ready    control-plane   v1.32.5   10.10.0.12
# cp3       Ready    control-plane   v1.32.5   10.10.0.13
# worker1   Ready    <none>          v1.32.5   10.10.0.21
# ...

# Control plane pods and CNI state
kubectl get pods -n kube-system

# etcd members and health (on cp1)
sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://10.10.0.11:2379 \
  --cacert=/etc/ssl/etcd/ssl/ca.pem \
  --cert=/etc/ssl/etcd/ssl/node-cp1.pem \
  --key=/etc/ssl/etcd/ssl/node-cp1-key.pem \
  endpoint health --cluster

# Smoke test: deploy → expose → DNS → delete
kubectl create deployment nginx --image=nginx --replicas=3
kubectl expose deployment nginx --port=80
kubectl run dns-test --rm -it --image=busybox:1.36 --restart=Never \
  -- nslookup nginx.default.svc.cluster.local
kubectl delete deployment nginx svc/nginx

Beyond this, it is safest to consider production acceptance criteria met only after a VIP failover test (force-stop one CP node and verify kubectl keeps working) and a drain test of one worker.

Day-2 Operations as Playbooks

The real value of Kubespray is that day-2 operations are all codified as playbooks.

Adding nodes — scale.yml

# 1) Add worker4 to hosts.yaml, then
# 2) run scale targeting only the new node
ansible-playbook -i inventory/prod/hosts.yaml \
  --become --limit=worker4 \
  scale.yml

scale.yml skips the steps in cluster.yml that reconfigure existing nodes and performs only the preparation and kubeadm join of the new node, making it fast and safe. One caveat: even with the limit option, Ansible still needs facts from other hosts such as the etcd group. If fact gathering fails, the playbook breaks — so make refreshing facts for all nodes a habit first.

# Refresh facts for all nodes before any limit run
ansible-playbook -i inventory/prod/hosts.yaml --become playbooks/facts.yml

Adding a control plane node requires cluster.yml, not scale.yml, and do not forget to refresh certificate SANs and the LB backend list afterwards.

Removing nodes — remove-node.yml

# Drain → remove from cluster → clean the node, all in one
ansible-playbook -i inventory/prod/hosts.yaml \
  --become \
  -e node=worker3 \
  remove-node.yml

remove-node.yml cordons and drains the target node, runs kubectl delete node, and cleans up kubelet and the runtime on the node side. If the node is already dead and unreachable, pass reset_nodes=false to skip node-side cleanup and remove only the cluster metadata. After removal, delete the node from hosts.yaml as well to keep the inventory matching reality. Skipping this synchronization is how inventory drift begins.

Upgrades — upgrade-cluster.yml

The operation that demands the most care. Principles first:

  1. Never skip minor versions. You cannot jump from 1.30 straight to 1.32. Per the kubeadm and Kubernetes version skew policy, you must step through 1.30 → 1.31 → 1.32, switching the Kubespray release tag at each step to match.
  2. Record and maintain the pairing of Kubespray version and cluster version. Your inventory repository should state plainly: "this cluster runs 1.31.4, deployed with tag v2.27.1."
  3. An etcd backup before any upgrade is non-negotiable.
# Example: 1.31.x → 1.32.x
cd kubespray && git checkout v2.28.0
pip install -r requirements.txt   # the required Ansible version changes too

ansible-playbook -i inventory/prod/hosts.yaml \
  --become \
  -e kube_version=1.32.5 \
  upgrade-cluster.yml

How upgrade-cluster.yml behaves is the key to zero downtime.

  • Control plane and etcd are upgraded first, one node at a time.
  • Workers proceed drain → upgrade kubelet/runtime → uncordon, with the number of nodes processed concurrently controlled by the serial variable (default 20%). Set serial=1 to be conservative.
  • Drain behavior is tuned with drain_grace_period, drain_timeout, and drain_retries. An overly tight PodDisruptionBudget makes the drain wait forever, so auditing PDBs before an upgrade is mandatory.
# One node at a time, with generous drain timeouts
ansible-playbook -i inventory/prod/hosts.yaml --become \
  -e kube_version=1.32.5 \
  -e serial=1 \
  -e drain_timeout=600s \
  -e drain_grace_period=120 \
  upgrade-cluster.yml

Application-side requirements for zero downtime must be covered too: at least two replicas, sensible PDBs, preStop hooks and graceful shutdown, and identifying drain blockers like single-replica StatefulSets in advance.

reset.yml — the last resort

# Tears the cluster off the nodes completely — including etcd data on disk
ansible-playbook -i inventory/prod/hosts.yaml --become reset.yml

reset.yml removes Kubernetes, etcd, CNI configuration, and container data from the nodes. Accidents really do happen where someone meaning to "reset just a few nodes" runs it without limit on a production cluster. There is a confirmation prompt, but wiring it into CI with auto-approval is absolutely forbidden.

etcd backup and restore

Kubespray does not back up etcd for you; build the discipline yourself.

# Backup (on an etcd node — run periodically via cron/systemd timer)
sudo ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snap-20260613.db \
  --endpoints=https://10.10.0.11:2379 \
  --cacert=/etc/ssl/etcd/ssl/ca.pem \
  --cert=/etc/ssl/etcd/ssl/node-cp1.pem \
  --key=/etc/ssl/etcd/ssl/node-cp1-key.pem

# Verify integrity
ETCDCTL_API=3 etcdctl --write-out=table snapshot status /backup/etcd-snap-20260613.db

Always ship snapshot files to storage outside the cluster, and run a quarterly restore rehearsal (snapshot → etcdctl snapshot restore → start etcd on a fresh data directory) to prove the backups are actually restorable. A surprising number of organizations have backups but have never once rehearsed a restore.

Customization — Things You Will Inevitably Meet in Production

Certificate management

Control plane certificates in a kubeadm cluster are valid for one year by default. In Kubespray, two variables manage this.

# A systemd timer (k8s-certs-renew.timer) renews near-expiry certs monthly
auto_renew_certificates: true

# Force regeneration in a playbook run when needed
# -e force_certificate_regeneration=true

With auto_renew_certificates on, you prevent the classic "one year later the whole cluster cannot authenticate" incident. Note that the CA itself (10-year default) and any certificates installed on external LBs are separate concerns. Pair this with expiry monitoring (an x509 exporter, for example).

Private registry and air-gap details

The heart of an air-gapped build is the variable set that redirects every download path to internal mirrors.

# inventory/prod/group_vars/all/offline.yml
registry_host: "harbor.internal:443/k8s-mirror"
files_repo: "https://mirror.internal/kubespray-files"

# Point all container image repositories at the internal registry
kube_image_repo: "{{ registry_host }}"
gcr_image_repo: "{{ registry_host }}"
github_image_repo: "{{ registry_host }}"
docker_image_repo: "{{ registry_host }}"
quay_image_repo: "{{ registry_host }}"

# Point binary download URLs at the internal file server
kubeadm_download_url: "{{ files_repo }}/kubeadm/{{ kube_version }}/kubeadm"
kubelet_download_url: "{{ files_repo }}/kubelet/{{ kube_version }}/kubelet"
kubectl_download_url: "{{ files_repo }}/kubectl/{{ kube_version }}/kubectl"
etcd_download_url: "{{ files_repo }}/etcd/etcd-{{ etcd_version }}-linux-{{ host_architecture }}.tar.gz"
cni_download_url: "{{ files_repo }}/cni/cni-plugins-linux-{{ host_architecture }}-{{ cni_version }}.tgz"
crictl_download_url: "{{ files_repo }}/crictl/crictl-{{ crictl_version }}-linux-{{ host_architecture }}.tar.gz"
runc_download_url: "{{ files_repo }}/runc/{{ runc_version }}/runc.{{ host_architecture }}"
containerd_download_url: "{{ files_repo }}/containerd/containerd-{{ containerd_version }}-linux-{{ host_architecture }}.tar.gz"

# If the registry uses a private CA, register trust in containerd
containerd_registries_mirrors:
  - prefix: "harbor.internal"
    mirrors:
      - host: "https://harbor.internal"
        capabilities: ["pull", "resolve"]
        skip_verify: false

In the staging zone, use generate_list.sh under contrib/offline to extract the full file and image list required by the release, and manage-offline-container-images.sh to pull and push images to the internal registry in bulk. One operational tip: make the mirror-population job itself a CI pipeline. Updating the lists by hand at every upgrade — and missing something — is the single most common cause of failed air-gapped upgrades.

Injecting sysctl and kernel tuning

You can manage node OS tuning in a separate Ansible role, but injecting it via Kubespray variables keeps the configuration in one place.

# inventory/prod/group_vars/k8s_cluster/k8s-cluster.yml
additional_sysctl:
  - name: net.core.somaxconn
    value: 65535
  - name: net.ipv4.tcp_max_syn_backlog
    value: 65535
  - name: fs.inotify.max_user_instances
    value: 8192
  - name: fs.inotify.max_user_watches
    value: 1048576

# Make kubelet reserve node resources
kube_reserved: true
kube_memory_reserved: 512Mi
kube_cpu_reserved: 200m
system_reserved: true
system_memory_reserved: 1Gi
system_cpu_reserved: 500m

The inotify limits are a bottleneck you will inevitably hit with log collectors and dense pod packing; raise them early.

Extra manifests and GitOps integration

Kubespray has no good generic hook for injecting arbitrary manifests. The practical pattern is clear: limit the responsibility of Kubespray to "from node OS up to the CNI," and manage everything above it (monitoring, logging, detailed ingress configuration, applications) with a GitOps tool such as Argo CD or Flux. Have the final step of the build pipeline kubectl-apply a single Argo CD bootstrap manifest, and from then on Git is the source of truth for everything inside the cluster.

Production Hardening

CIS Benchmark and kube-bench

CIS Kubernetes Benchmark conformance is a recurring requirement in financial and public-sector security reviews. Kubespray defaults do not fully satisfy CIS, so the recommended cycle is: run kube-bench after the build, identify the gaps, and correct them via variables. Frequently adjusted items:

# API server hardening
kube_apiserver_request_timeout: 120s
kube_apiserver_enable_admission_plugins:
  - NodeRestriction
  - AlwaysPullImages
  - EventRateLimit
kube_apiserver_admission_event_rate_limits:
  limit_1:
    type: Namespace
    qps: 50
    burst: 100
    cache_size: 2000
kube_profiling: false

# kubelet hardening
kubelet_protect_kernel_defaults: true
kubelet_event_record_qps: 1
kubelet_streaming_connection_idle_timeout: 5m
kubelet_make_iptables_util_chains: true

# Anonymous auth is blocked by kubeadm defaults, but be explicit
kube_api_anonymous_auth: false

Rather than trying to silence every kube-bench finding, the realistic approach is to weigh the impact of each item on your workloads (AlwaysPullImages increases registry load, for example) and document the rationale for exceptions.

Audit log

kubernetes_audit: true
audit_log_path: /var/log/kubernetes/audit/audit.log
audit_log_maxage: 30        # retention days
audit_log_maxbackups: 10
audit_log_maxsize: 100      # MB
# To use a custom policy instead of the default,
# define rules in audit_policy_custom_rules

Audit logs are effectively the only primary evidence in a security incident investigation, so the full package includes a pipeline shipping them to a central log system rather than leaving them on node-local disk.

Encryption at rest for Secrets

Encrypt Secrets that would otherwise sit in etcd as plaintext.

kube_encrypt_secret_data: true
# The default provider is secretbox — switchable to aescbc and others
kube_encryption_algorithm: "secretbox"
kube_encryption_resources: [secrets]

This is the minimum defense ensuring that a leaked etcd snapshot does not expose Secret plaintext. The limitation remains that the key itself lives on control plane node disks; for stricter requirements, evaluate external KMS integration.

Pod Security Admission defaults

After the removal of PodSecurityPolicy, PSA is the standard. Kubespray variables can set cluster-wide defaults.

kube_pod_security_use_default: true
kube_pod_security_default_enforce: baseline   # applied to new namespaces
kube_pod_security_default_audit: restricted
kube_pod_security_default_warn: restricted
kube_pod_security_exempt_namespaces:
  - kube-system

Raising enforce straight to restricted can mass-reject existing workloads, so the safe sequence is to surface violations via audit/warn first, then ratchet enforce up in stages.

CI/CD Integration and Operating at Scale

Inventory in Git, execution in pipelines

A recommended repository structure:

k8s-clusters/                  ← internal Git repository
├─ README.md                   ← cluster-to-Kubespray version table
├─ clusters/
│   ├─ prod-seoul/
│   │   ├─ hosts.yaml
│   │   └─ group_vars/ ...
│   └─ stage-seoul/
│       ├─ hosts.yaml
│       └─ group_vars/ ...
└─ pipelines/
    └─ run-kubespray.yaml      ← CI definition

Pull the Kubespray codebase itself via a submodule or an in-pipeline git clone (pinned to a tag), and route all inventory changes through PR review. A skeleton of the execution pipeline:

# GitLab CI example — conceptual skeleton
stages: [lint, diff, apply]

lint:
  stage: lint
  script:
    - ansible-lint clusters/prod-seoul || true
    - python3 scripts/validate_inventory.py clusters/prod-seoul

apply-prod:
  stage: apply
  when: manual            # production must require manual approval
  script:
    - git clone --branch v2.28.0 --depth 1
        https://github.com/kubernetes-sigs/kubespray.git
    - pip install -r kubespray/requirements.txt
    - ansible-playbook -i clusters/prod-seoul/hosts.yaml
        --become kubespray/cluster.yml
  environment: prod-seoul

Idempotency — its uses and its limits

Thanks to Ansible idempotency, rerunning cluster.yml converges only what changed, and resuming from a failure point is often as simple as rerunning the same command. But know the limits precisely.

  • Removing an item from variables does not retract the existing configuration on nodes. Disabling an add-on, for example, does not necessarily delete already-deployed resources. This is the essential limitation of a procedural tool.
  • Check mode (dry run) is unreliable with Kubespray, because many tasks depend on the real results of earlier tasks. Do not expect a Terraform-style "review the diff and approve" workflow; substitute it with applying to a staging cluster first.
  • Playbook runs can collide with concurrent changes (manual kubectl operations and the like), so align execution windows with change freeze windows as an operational discipline.

Tips for large fleets — parallelism and partial runs

# ansible.cfg
[defaults]
forks = 50                 # default is 5 — raise it for dozens of nodes
strategy = linear
[ssh_connection]
pipelining = True          # fewer SSH round trips — very noticeable
ssh_args = -o ControlMaster=auto -o ControlPersist=30m

For a sense of timing: a fresh 6-node build takes 20 to 40 minutes, a 50-node build over an hour, and upgrades scale almost linearly with node count due to drain time. Two tools for partial runs:

# Specific nodes only — always refresh facts first
ansible-playbook -i inventory/prod/hosts.yaml --become playbooks/facts.yml
ansible-playbook -i inventory/prod/hosts.yaml --become \
  --limit=worker7 cluster.yml

# Specific component tags only — e.g. roll out a CoreDNS config change
ansible-playbook -i inventory/prod/hosts.yaml --become \
  --tags=coredns cluster.yml

# Skip the download stage to shorten repeat runs
ansible-playbook -i inventory/prod/hosts.yaml --become \
  --skip-tags=download cluster.yml

Tags and limit are powerful but risk skipping inter-task dependencies, so pair them with the rule of running the same tag combination on staging first.

Kubespray vs. Cluster API — and Coexistence

Cluster API (CAPI) is a SIG project that declares Kubernetes clusters themselves as Kubernetes resources (Cluster, MachineDeployment, and so on), with controllers in a management cluster continuously converging state. Its philosophy is the polar opposite of Kubespray.

AspectKubesprayCluster API
ParadigmProcedural — converges only when runDeclarative — controllers converge continuously
Required infraJust an Ansible control nodeManagement cluster + infrastructure provider
Bare metalAnywhere SSH worksNeeds Metal3 (IPMI/Redfish) or the BYOH provider
Node recoveryManual (rerun playbooks)Automatic recreation via MachineHealthCheck
Many clustersRepeat runs per inventoryStrong at fleet management
OS customizationVery flexible (applies onto existing OS)Requires an image build pipeline
Learning curveGentle for Ansible usersRequires the CRD and controller model

Decision criteria:

  • If you have a single-digit number of clusters, manually provisioned servers, and in-house Ansible skills, Kubespray is simple and sufficient.
  • If you must stamp out dozens of clusters and can automate bare-metal lifecycle via IPMI/Redfish, CAPI plus Metal3 wins long-term.
  • There is a realistic hybrid too. The CAPI management cluster itself must be bootstrapped somewhere — the chicken-and-egg problem — and a common pattern is to build that first cluster with Kubespray and manage the rest of the fleet with CAPI. Gradual migration, keeping existing Kubespray clusters while building new ones with CAPI, is also common.

Common Pitfalls and Troubleshooting

The pitfall list

  1. Inventory drift: if someone changes node configuration by hand, or hosts.yaml is not updated after remove-node, the next playbook run behaves unexpectedly. The cure is discipline: inventory changes only via PR, manual node changes forbidden.
  2. Conflicts with OS patching: unattended-upgrades bumping containerd on its own, or a kernel upgrade plus reboot dropping kernel module settings, are frequent incidents. Hold Kubernetes-related packages and perform OS patching as a controlled procedure with drains.
  3. CNI cannot be changed: changing kube_network_plugin and rerunning cluster.yml will wreck the cluster. If a CNI swap is truly needed, the orthodox path is a new cluster build plus workload migration.
  4. Kubespray and Ansible version mismatch: each release requires a specific Ansible version range. Make it routine to create a fresh virtualenv per release and reinstall requirements.txt.
  5. Running cluster.yml against an existing cluster from a different version tag: unintended component upgrades happen. This is why the per-cluster version table matters.
  6. NTP drift: skewed clocks across nodes destabilize certificate validation and etcd. Include chrony configuration in the preinstall checklist.

Resuming after failure, and logs

# Rerun with verbose logs — idempotency lets completed tasks pass unchanged
ansible-playbook -i inventory/prod/hosts.yaml --become cluster.yml -vvv \
  2>&1 | tee /tmp/kubespray-run.log

# First-look points on the node side
journalctl -u kubelet -f          # why kubelet fails to start
journalctl -u containerd -f      # runtime/registry problems
crictl ps -a                      # control plane static pod states
ls /etc/kubernetes/manifests/     # kubeadm static pod manifests

Empirically, eight out of ten failures are "an environmental difference on one specific node." Check the failed task name and node, reproduce the same operation manually on that node, and the cause surfaces quickly. After fixing it, rerun the whole playbook from the start to confirm convergence.

Production checklist

[ ] Inventory and group_vars managed in Git via PR review
[ ] Kubespray release tag to kube_version mapping documented
[ ] 3 CP nodes + etcd quorum; etcd on SSD/NVMe
[ ] API server VIP (kube-vip or HAProxy+keepalived) failover test passed
[ ] Pod/Service CIDRs registered in corporate IPAM, no overlaps
[ ] Periodic etcd snapshots + offsite copies + quarterly restore rehearsal
[ ] auto_renew_certificates on + certificate expiry monitoring
[ ] kubernetes_audit on + central log shipping
[ ] kube_encrypt_secret_data on
[ ] PSA defaults applied (enforce baseline or higher)
[ ] kube-bench results and exception rationale documented
[ ] Upgrade runbook exists (no version skipping, serial, PDB audit)
[ ] Process to validate identical changes on a staging cluster first
[ ] OS patch policy and holds on Kubernetes packages

Closing

Kubespray is not a flashy tool. Instead of inventing a new abstraction, it layers proven Ansible automation on top of the upstream kubeadm standard, offering the most realistic path to "turn a pile of generic Linux servers into production Kubernetes." The price is the operational discipline that is the fate of any procedural tool. For organizations that control the inventory through Git, maintain a version mapping table, validate on staging first, and never skip backups and rehearsals, Kubespray will serve reliably for years.

Conversely, when the number of clusters grows and fleet management becomes the essential problem, that is the time to evaluate a move to Cluster API. Even then, Kubespray will quite likely still be the tool that bootstraps your first management cluster. For teams starting their on-prem Kubernetes journey, I recommend the learning path of doing kubeadm manually end-to-end once, then automating with Kubespray. You need to know what the tool does for you in order to fix things yourself when the tool fails.

References