Skip to content
Published on

Kubernetes Internals Complete Guide — etcd, API Server, Controller, Scheduler, kubelet, CRI/CNI/CSI Deep Dive (2025)

Authors

Introduction — Kubernetes Is Not Just a Tool

kubectl apply -f deployment.yaml. One command and a container is deployed somewhere in the cluster, restarted if it dies, has traffic balanced, updates rolled out. That is Kubernetes's promise.

Behind that promise: etcd's distributed consensus, the API Server's watch protocol, the Scheduler's bin-packing, the Controller's reconciliation loop, kubelet's Pod lifecycle, and the CRI/CNI/CSI plugin interfaces. All collaborating to produce the abstraction called "declarative infrastructure."

This article covers everything from Borg's history to 2025's sidecar-less service mesh and eBPF networking. It is the final part of the Linux Infrastructure series.


1. What Kubernetes Is

Kubernetes is a distributed system that automatically converges a cluster to a user-declared desired state.

Key words: declarative, convergence, distributed.

1.1 Declarative

The user declares "what should be," not "how." Example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.25

1.2 Convergence

When desired state differs from actual state, controllers do work to close the gap — the reconciliation loop. Delete a Pod, a controller recreates it. Change an image, rolling update.

1.3 Distributed

The control plane is distributed (etcd via Raft). So is the data plane. One node dying does not stop the cluster.


2. History — Borg to Kubernetes 1.30

  • Borg (2003+): Google's internal cluster manager. Concepts: Job, Task, Cell, Borgmaster, Borglet. Handles millions of tasks per minute at 90%+ utilization.
  • Omega (2013): Next-gen Borg with shared-state model and multiple schedulers. Architectural inspiration for Kubernetes.
  • Kubernetes (2014): Led by Joe Beda, Brendan Burns, Craig McLuckie. Go, open source, REST resources. 1.0 released 2015.
  • CNCF (2015): Kubernetes as the flagship project.
  • 1.5: StatefulSet, CRI introduced. 1.7: RBAC stable. 1.24: Docker shim removed. 1.30: Pod scheduling readiness, AppArmor stable.

Managed offerings: EKS, GKE, AKS, DOKS, LKE. Distributions: OpenShift, Rancher, Tanzu, k3s, kubeadm.

★ Insight ─────────────────────────────────────

  • Borg-Kubernetes asymmetry: Google runs both; Borg is not migrated to Kubernetes because it fits Google's workloads better. Open-sourcing does not mean sharing the same code.
  • Kubernetes simplified Omega: single source of truth (etcd) + everyone watches, instead of shared-state complexity.
  • alpha -> beta -> stable takes years. Always check stability level before production use. ─────────────────────────────────────────────────

3. Architecture — Control Plane and Data Plane

3.1 Control Plane

  • etcd: distributed KV, cluster state.
  • API Server (kube-apiserver): REST interface, the only path to etcd.
  • Controller Manager: runs controllers; the heart of reconciliation.
  • Scheduler: assigns Pods to nodes.
  • Cloud Controller Manager: integrates with cloud providers.

3.2 Data Plane

  • kubelet: per-node agent; Pod lifecycle.
  • kube-proxy: per-node network proxy; Service implementation.
  • Container runtime: containerd, CRI-O.

3.3 Example Flow — Pod Creation

  1. kubectl POSTs YAML to API Server.
  2. API Server authenticates, runs admission, persists to etcd.
  3. Scheduler watches, picks a node, updates Pod.
  4. kubelet on that node sees the Pod and calls CRI.
  5. CRI starts the container; CNI wires the network.
  6. kubelet reports status back to API Server.

Key principle: no direct component-to-component traffic — everything goes through API Server. Loose coupling, single source of truth, single audit log. Downside: API Server and etcd can be bottlenecks.


4. etcd — The Foundation

4.1 What It Is

Distributed KV store. Created by CoreOS (name = /etc + d). Now a CNCF project.

Features: distributed (Raft), linearizable, watch, transactions, TTL leases.

4.2 Raft

One leader handles writes, replicates log entries to followers; majority ack commits. A 3-node cluster tolerates 1 failure, 5-node tolerates 2.

4.3 In Kubernetes

Every object (Pod, Service, Secret) stored under keys like:

/registry/pods/default/my-pod
/registry/services/default/my-service

Encoded as protobuf (migrated from JSON).

4.4 Watch

Clients subscribe to a prefix and receive change streams. This is the foundation of API Server's watch that Scheduler, Controllers, and kubelet all use.

4.5 Limits

  • Object size cap (~1.5MB).
  • Around 10K writes/sec ceiling.
  • All data mmap'd in memory.
  • Secret rotation is expensive.

4.6 Operations

Regular etcdctl snapshot save, etcdctl defrag, watch leader change frequency, disk fsync latency. Mismanaged etcd breaks the entire cluster.


5. API Server — The Hub

5.1 Responsibilities

REST API, authentication, authorization, admission control, watch distribution, etcd access.

5.2 REST Model

GET    /api/v1/namespaces/default/pods
POST   /api/v1/namespaces/default/pods
PUT    /api/v1/namespaces/default/pods/my-pod
DELETE /api/v1/namespaces/default/pods/my-pod

API groups: core/v1, apps/v1, batch/v1, networking.k8s.io/v1, rbac.authorization.k8s.io/v1.

5.3 Authentication

X.509 certs, ServiceAccount tokens, OIDC bearer tokens, webhooks. Multiple mechanisms can coexist.

5.4 Authorization — RBAC

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: pod-reader
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "watch", "list"]

5.5 Admission Control

Mutating (fills defaults) and validating (rejects invalid). Built-ins include LimitRanger, ResourceQuota, PodSecurity, ServiceAccount. Dynamic webhooks power cert-manager, Istio injector, etc.

5.6 Watch and Informer Pattern

GET /api/v1/pods?watch=true
< event: ADDED, pod: ...
< event: MODIFIED, pod: ...

Standard client pattern (list + watch, called an informer):

informer := factory.Core().V1().Pods().Informer()
informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
    AddFunc: func(obj interface{}) { /* ... */ },
    UpdateFunc: func(old, new interface{}) { /* ... */ },
    DeleteFunc: func(obj interface{}) { /* ... */ },
})
informer.Run(stopCh)

5.7 API Extension

CRD (most common) or API aggregation (metrics-server style).


6. Scheduler — Matching Pods to Nodes

6.1 Flow

Watch unscheduled Pods -> filter nodes -> score candidates -> pick highest -> update spec.nodeName.

6.2 Filter Plugins

NodeResourcesFit, NodeAffinity, PodAffinity, TaintToleration, VolumeBinding, PodTopologySpread.

6.3 Score Plugins

NodeResourcesBalancedAllocation, NodeResourcesLeastAllocated, InterPodAffinity.

6.4 Plugin Interface

type FilterPlugin interface {
    Filter(ctx context.Context, state *CycleState, pod *Pod, nodeInfo *NodeInfo) *Status
}

type ScorePlugin interface {
    Score(ctx context.Context, state *CycleState, pod *Pod, nodeName string) (int64, *Status)
}

6.5 Bin Packing

A greedy bin-packing problem. Not optimal. Descheduler can evict Pods to improve fit.

6.6 Custom Scheduler

Run multiple schedulers; select per Pod with spec.schedulerName.


7. Controller Manager — Convergence Engine

7.1 The Pattern

for {
    desired := getDesiredState()
    actual := getActualState()
    if desired != actual {
        reconcile(desired, actual)
    }
}

7.2 Built-in Controllers

Deployment, ReplicaSet, StatefulSet, DaemonSet, Job, Node, Service, Endpoint, Namespace, PersistentVolume.

7.3 Properties

  • Idempotent: same input -> same action.
  • Self-healing: retries on failure.
  • Observable: all changes via API Server.
  • Tradeoff: eventual consistency, not instant.

7.4 Operators

Custom controllers on top of CRDs. The backbone of Kubernetes's extensibility (PostgreSQL, Kafka, Cassandra operators).

★ Insight ─────────────────────────────────────

  • Reconciliation originated in Borg: RPC-based imperative systems crumble under partial failures. Declarative state + eternal convergence automates recovery.
  • Operators make Kubernetes a general distributed-systems platform, not just a container orchestrator.
  • Eventual consistency is intentional: strong consistency would make the system slower and more complex. ─────────────────────────────────────────────────

8. kubelet — The Node Agent

8.1 Responsibilities

Watch Pods assigned to this node, start/stop containers via CRI, configure network via CNI, mount volumes via CSI, run probes, report status.

8.2 Pod Lifecycle

Pending -> ContainerCreating -> Running -> Succeeded/Failed -> Unknown.

8.3 Probes

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5

Liveness (restart on fail), Readiness (excluded from Service on fail), Startup (runs first during init).

8.4 Resource Limits

resources:
  requests:
    cpu: "100m"
    memory: "128Mi"
  limits:
    cpu: "500m"
    memory: "256Mi"

Requests used by scheduler. Limits enforced by kubelet via cgroups.

8.5 Eviction

When a node is under pressure, kubelet evicts in order: BestEffort first, Burstable next, Guaranteed last.

8.6 Static Pods

Pods defined by YAML in /etc/kubernetes/manifests/, managed directly by kubelet, used to bootstrap control-plane components.


9. kube-proxy — Service Implementation

9.1 What a Service Is

Stable endpoint for a group of Pods.

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  selector:
    app: nginx
  ports:
  - port: 80
    targetPort: 8080

9.2 Modes

  • iptables (default): DNAT rules per Service/Endpoint. Simple but scales poorly.
  • IPVS: kernel L4 load balancer. Faster at scale, supports rr/least-conn.
  • eBPF (Cilium): replaces kube-proxy entirely with BPF maps.

Sample iptables rules:

-A KUBE-SERVICES -d 10.96.123.456/32 -p tcp --dport 80 -j KUBE-SVC-XXX
-A KUBE-SVC-XXX -m statistic --mode random --probability 0.5 -j KUBE-SEP-AAA
-A KUBE-SEP-AAA -p tcp -j DNAT --to-destination 10.0.1.10:8080

9.3 Service Types

ClusterIP, NodePort, LoadBalancer, ExternalName.

9.4 EndpointSlice

Replaces old Endpoints: splits large Services into small slices for watch efficiency.


10. CRI, CNI, CSI — Plugin Interfaces

10.1 CRI

gRPC between kubelet and container runtime.

service RuntimeService {
    rpc CreatePodSandbox(CreatePodSandboxRequest) returns (CreatePodSandboxResponse);
    rpc StartContainer(StartContainerRequest) returns (StartContainerResponse);
}

Implementations: containerd, CRI-O. (Docker shim removed in 1.24.)

10.2 CNI

ADD <network> <container>
DEL <network> <container>
CHECK <network> <container>

CNI plugins are binaries under /etc/cni/net.d/. Plugins: bridge, calico, flannel, cilium, weave.

10.3 CSI

gRPC storage plugin interface with Identity, Controller (CreateVolume, attach), and Node (mount) services. Drivers: AWS EBS, GCE PD, Azure Disk, Ceph, Longhorn.

10.4 Elegance

Standard API + pluggable implementations. No vendor lock-in; Kubernetes itself is the standard.


11. Core Objects

  • Pod: smallest deploy unit; containers share namespaces.
  • ReplicaSet: N identical Pods from a template.
  • Deployment: ReplicaSet plus rolling update and rollback.
  • StatefulSet: ordered Pods with stable hostnames and PVs.
  • DaemonSet: one Pod per node.
  • Job / CronJob: batch and scheduled batch.
  • ConfigMap / Secret: configuration and sensitive data (base64, optional at-rest encryption).

Deployment example:

apiVersion: apps/v1
kind: Deployment
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

12. Networking

12.1 Kubernetes Network Model

Every Pod gets a unique IP, all Pods can reach all Pods and nodes without NAT, and Pods see each other by their own IP.

12.2 Service Discovery

CoreDNS resolves my-service.my-namespace.svc.cluster.local to a ClusterIP. Headless Services (clusterIP: None) return Pod IPs directly — useful for StatefulSet clients.

12.3 Ingress / Gateway API

Ingress is the L7 entry. Gateway API is its more expressive successor (stable 2023).

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
spec:
  parentRefs:
  - name: my-gateway
  hostnames: ["api.example.com"]
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /
    backendRefs:
    - name: api-service
      port: 80

12.4 NetworkPolicy

Default-allow becomes default-deny once a policy applies. Must be implemented by the CNI; Calico and Cilium support it well.


13. Storage

  • PV: cluster-level storage resource.
  • PVC: user request.
  • StorageClass: dynamic provisioner.

Access modes: ReadWriteOnce, ReadOnlyMany, ReadWriteMany.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-pvc
spec:
  storageClassName: gp3
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi

StatefulSet uses volumeClaimTemplates so each Pod gets its own PV.


14. Security

14.1 RBAC

Role/ClusterRole plus RoleBinding/ClusterRoleBinding.

14.2 ServiceAccount

Identity for workloads. Token mounted at /var/run/secrets/kubernetes.io/serviceaccount/token.

14.3 Pod Security Standards

Levels: privileged, baseline, restricted. Enforced via namespace labels:

metadata:
  labels:
    pod-security.kubernetes.io/enforce: restricted

PSP removed in 1.25; PSS is the standard.

14.4 Secrets

Kubernetes Secret is base64, not encryption. Enable etcd encryption-at-rest, or use Vault / AWS Secrets Manager / Sealed Secrets.


15. Limits and Pain Points

Complexity (steep learning curve), etcd pressure in large clusters, tangled networking, control-plane cost, weak multi-tenancy (namespace isolation is not strong).


16. Alternatives

  • Nomad: simpler, lighter, handles non-containers.
  • Docker Swarm: nearly deprecated.
  • Mesos: historical.
  • ECS: AWS lock-in.
  • Cloud Run / Lambda: serverless.
  • k3s: edge.

Choose Kubernetes when you need a standard, scale, and multi-cloud.


17. Future

Sidecar-less service mesh (Cilium leading), Gateway API replacing Ingress, WebAssembly workloads (krustlet, wasmCloud), confidential computing (SGX, SEV), KubeVirt for VMs, sched_ext enabling per-workload scheduling.


18. Conclusion

Kubernetes is not a simple tool — it is a decade of lessons from Borg distilled into the de-facto standard for distributed infrastructure. Every major cloud exposes it, every new infra project assumes it, every cloud-native architecture inherits its model. Deep understanding is now required literacy for infrastructure engineers.

This article closes the Linux Infrastructure series: boot, scheduler, io_uring, memory, eBPF, containers, and now orchestration. Together they paint the full picture of how code lives on a modern cloud-native stack.


Appendix — FAQ

Do I need Kubernetes for small apps? Usually no. Docker Compose, systemd, or a PaaS is simpler.

Self-hosted or managed? Managed (EKS/GKE/AKS) for small teams. Self-hosted only when a dedicated infra team exists.

Is running etcd really risky? Yes — backup, monitoring, defrag, upgrades all require care. Managed Kubernetes automates this.

Run databases on Kubernetes? Possible via well-built operators, but managed RDS is often more attractive.

Tools beyond kubectl? helm, kustomize, k9s, stern, kubectx/kubens, lens.