✍️ 필사 모드: Kubernetes Internals Complete Guide — etcd, API Server, Controller, Scheduler, kubelet, CRI/CNI/CSI Deep Dive (2025)
EnglishIntroduction — Kubernetes Is Not Just a Tool
kubectl apply -f deployment.yaml. One command and a container is deployed somewhere in the cluster, restarted if it dies, has traffic balanced, updates rolled out. That is Kubernetes's promise.
Behind that promise: etcd's distributed consensus, the API Server's watch protocol, the Scheduler's bin-packing, the Controller's reconciliation loop, kubelet's Pod lifecycle, and the CRI/CNI/CSI plugin interfaces. All collaborating to produce the abstraction called "declarative infrastructure."
This article covers everything from Borg's history to 2025's sidecar-less service mesh and eBPF networking. It is the final part of the Linux Infrastructure series.
1. What Kubernetes Is
Kubernetes is a distributed system that automatically converges a cluster to a user-declared desired state.
Key words: declarative, convergence, distributed.
1.1 Declarative
The user declares "what should be," not "how." Example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.25
1.2 Convergence
When desired state differs from actual state, controllers do work to close the gap — the reconciliation loop. Delete a Pod, a controller recreates it. Change an image, rolling update.
1.3 Distributed
The control plane is distributed (etcd via Raft). So is the data plane. One node dying does not stop the cluster.
2. History — Borg to Kubernetes 1.30
- Borg (2003+): Google's internal cluster manager. Concepts: Job, Task, Cell, Borgmaster, Borglet. Handles millions of tasks per minute at 90%+ utilization.
- Omega (2013): Next-gen Borg with shared-state model and multiple schedulers. Architectural inspiration for Kubernetes.
- Kubernetes (2014): Led by Joe Beda, Brendan Burns, Craig McLuckie. Go, open source, REST resources. 1.0 released 2015.
- CNCF (2015): Kubernetes as the flagship project.
- 1.5: StatefulSet, CRI introduced. 1.7: RBAC stable. 1.24: Docker shim removed. 1.30: Pod scheduling readiness, AppArmor stable.
Managed offerings: EKS, GKE, AKS, DOKS, LKE. Distributions: OpenShift, Rancher, Tanzu, k3s, kubeadm.
★ Insight ─────────────────────────────────────
- Borg-Kubernetes asymmetry: Google runs both; Borg is not migrated to Kubernetes because it fits Google's workloads better. Open-sourcing does not mean sharing the same code.
- Kubernetes simplified Omega: single source of truth (etcd) + everyone watches, instead of shared-state complexity.
- alpha -> beta -> stable takes years. Always check stability level before production use.
─────────────────────────────────────────────────
3. Architecture — Control Plane and Data Plane
3.1 Control Plane
- etcd: distributed KV, cluster state.
- API Server (
kube-apiserver): REST interface, the only path to etcd. - Controller Manager: runs controllers; the heart of reconciliation.
- Scheduler: assigns Pods to nodes.
- Cloud Controller Manager: integrates with cloud providers.
3.2 Data Plane
- kubelet: per-node agent; Pod lifecycle.
- kube-proxy: per-node network proxy; Service implementation.
- Container runtime: containerd, CRI-O.
3.3 Example Flow — Pod Creation
- kubectl POSTs YAML to API Server.
- API Server authenticates, runs admission, persists to etcd.
- Scheduler watches, picks a node, updates Pod.
- kubelet on that node sees the Pod and calls CRI.
- CRI starts the container; CNI wires the network.
- kubelet reports status back to API Server.
Key principle: no direct component-to-component traffic — everything goes through API Server. Loose coupling, single source of truth, single audit log. Downside: API Server and etcd can be bottlenecks.
4. etcd — The Foundation
4.1 What It Is
Distributed KV store. Created by CoreOS (name = /etc + d). Now a CNCF project.
Features: distributed (Raft), linearizable, watch, transactions, TTL leases.
4.2 Raft
One leader handles writes, replicates log entries to followers; majority ack commits. A 3-node cluster tolerates 1 failure, 5-node tolerates 2.
4.3 In Kubernetes
Every object (Pod, Service, Secret) stored under keys like:
/registry/pods/default/my-pod
/registry/services/default/my-service
Encoded as protobuf (migrated from JSON).
4.4 Watch
Clients subscribe to a prefix and receive change streams. This is the foundation of API Server's watch that Scheduler, Controllers, and kubelet all use.
4.5 Limits
- Object size cap (~1.5MB).
- Around 10K writes/sec ceiling.
- All data mmap'd in memory.
- Secret rotation is expensive.
4.6 Operations
Regular etcdctl snapshot save, etcdctl defrag, watch leader change frequency, disk fsync latency. Mismanaged etcd breaks the entire cluster.
5. API Server — The Hub
5.1 Responsibilities
REST API, authentication, authorization, admission control, watch distribution, etcd access.
5.2 REST Model
GET /api/v1/namespaces/default/pods
POST /api/v1/namespaces/default/pods
PUT /api/v1/namespaces/default/pods/my-pod
DELETE /api/v1/namespaces/default/pods/my-pod
API groups: core/v1, apps/v1, batch/v1, networking.k8s.io/v1, rbac.authorization.k8s.io/v1.
5.3 Authentication
X.509 certs, ServiceAccount tokens, OIDC bearer tokens, webhooks. Multiple mechanisms can coexist.
5.4 Authorization — RBAC
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: default
name: pod-reader
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "watch", "list"]
5.5 Admission Control
Mutating (fills defaults) and validating (rejects invalid). Built-ins include LimitRanger, ResourceQuota, PodSecurity, ServiceAccount. Dynamic webhooks power cert-manager, Istio injector, etc.
5.6 Watch and Informer Pattern
GET /api/v1/pods?watch=true
< event: ADDED, pod: ...
< event: MODIFIED, pod: ...
Standard client pattern (list + watch, called an informer):
informer := factory.Core().V1().Pods().Informer()
informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) { /* ... */ },
UpdateFunc: func(old, new interface{}) { /* ... */ },
DeleteFunc: func(obj interface{}) { /* ... */ },
})
informer.Run(stopCh)
5.7 API Extension
CRD (most common) or API aggregation (metrics-server style).
6. Scheduler — Matching Pods to Nodes
6.1 Flow
Watch unscheduled Pods -> filter nodes -> score candidates -> pick highest -> update spec.nodeName.
6.2 Filter Plugins
NodeResourcesFit, NodeAffinity, PodAffinity, TaintToleration, VolumeBinding, PodTopologySpread.
6.3 Score Plugins
NodeResourcesBalancedAllocation, NodeResourcesLeastAllocated, InterPodAffinity.
6.4 Plugin Interface
type FilterPlugin interface {
Filter(ctx context.Context, state *CycleState, pod *Pod, nodeInfo *NodeInfo) *Status
}
type ScorePlugin interface {
Score(ctx context.Context, state *CycleState, pod *Pod, nodeName string) (int64, *Status)
}
6.5 Bin Packing
A greedy bin-packing problem. Not optimal. Descheduler can evict Pods to improve fit.
6.6 Custom Scheduler
Run multiple schedulers; select per Pod with spec.schedulerName.
7. Controller Manager — Convergence Engine
7.1 The Pattern
for {
desired := getDesiredState()
actual := getActualState()
if desired != actual {
reconcile(desired, actual)
}
}
7.2 Built-in Controllers
Deployment, ReplicaSet, StatefulSet, DaemonSet, Job, Node, Service, Endpoint, Namespace, PersistentVolume.
7.3 Properties
- Idempotent: same input -> same action.
- Self-healing: retries on failure.
- Observable: all changes via API Server.
- Tradeoff: eventual consistency, not instant.
7.4 Operators
Custom controllers on top of CRDs. The backbone of Kubernetes's extensibility (PostgreSQL, Kafka, Cassandra operators).
★ Insight ─────────────────────────────────────
- Reconciliation originated in Borg: RPC-based imperative systems crumble under partial failures. Declarative state + eternal convergence automates recovery.
- Operators make Kubernetes a general distributed-systems platform, not just a container orchestrator.
- Eventual consistency is intentional: strong consistency would make the system slower and more complex.
─────────────────────────────────────────────────
8. kubelet — The Node Agent
8.1 Responsibilities
Watch Pods assigned to this node, start/stop containers via CRI, configure network via CNI, mount volumes via CSI, run probes, report status.
8.2 Pod Lifecycle
Pending -> ContainerCreating -> Running -> Succeeded/Failed -> Unknown.
8.3 Probes
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
Liveness (restart on fail), Readiness (excluded from Service on fail), Startup (runs first during init).
8.4 Resource Limits
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "256Mi"
Requests used by scheduler. Limits enforced by kubelet via cgroups.
8.5 Eviction
When a node is under pressure, kubelet evicts in order: BestEffort first, Burstable next, Guaranteed last.
8.6 Static Pods
Pods defined by YAML in /etc/kubernetes/manifests/, managed directly by kubelet, used to bootstrap control-plane components.
9. kube-proxy — Service Implementation
9.1 What a Service Is
Stable endpoint for a group of Pods.
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: nginx
ports:
- port: 80
targetPort: 8080
9.2 Modes
- iptables (default): DNAT rules per Service/Endpoint. Simple but scales poorly.
- IPVS: kernel L4 load balancer. Faster at scale, supports rr/least-conn.
- eBPF (Cilium): replaces kube-proxy entirely with BPF maps.
Sample iptables rules:
-A KUBE-SERVICES -d 10.96.123.456/32 -p tcp --dport 80 -j KUBE-SVC-XXX
-A KUBE-SVC-XXX -m statistic --mode random --probability 0.5 -j KUBE-SEP-AAA
-A KUBE-SEP-AAA -p tcp -j DNAT --to-destination 10.0.1.10:8080
9.3 Service Types
ClusterIP, NodePort, LoadBalancer, ExternalName.
9.4 EndpointSlice
Replaces old Endpoints: splits large Services into small slices for watch efficiency.
10. CRI, CNI, CSI — Plugin Interfaces
10.1 CRI
gRPC between kubelet and container runtime.
service RuntimeService {
rpc CreatePodSandbox(CreatePodSandboxRequest) returns (CreatePodSandboxResponse);
rpc StartContainer(StartContainerRequest) returns (StartContainerResponse);
}
Implementations: containerd, CRI-O. (Docker shim removed in 1.24.)
10.2 CNI
ADD <network> <container>
DEL <network> <container>
CHECK <network> <container>
CNI plugins are binaries under /etc/cni/net.d/. Plugins: bridge, calico, flannel, cilium, weave.
10.3 CSI
gRPC storage plugin interface with Identity, Controller (CreateVolume, attach), and Node (mount) services. Drivers: AWS EBS, GCE PD, Azure Disk, Ceph, Longhorn.
10.4 Elegance
Standard API + pluggable implementations. No vendor lock-in; Kubernetes itself is the standard.
11. Core Objects
- Pod: smallest deploy unit; containers share namespaces.
- ReplicaSet: N identical Pods from a template.
- Deployment: ReplicaSet plus rolling update and rollback.
- StatefulSet: ordered Pods with stable hostnames and PVs.
- DaemonSet: one Pod per node.
- Job / CronJob: batch and scheduled batch.
- ConfigMap / Secret: configuration and sensitive data (base64, optional at-rest encryption).
Deployment example:
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
12. Networking
12.1 Kubernetes Network Model
Every Pod gets a unique IP, all Pods can reach all Pods and nodes without NAT, and Pods see each other by their own IP.
12.2 Service Discovery
CoreDNS resolves my-service.my-namespace.svc.cluster.local to a ClusterIP. Headless Services (clusterIP: None) return Pod IPs directly — useful for StatefulSet clients.
12.3 Ingress / Gateway API
Ingress is the L7 entry. Gateway API is its more expressive successor (stable 2023).
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
spec:
parentRefs:
- name: my-gateway
hostnames: ["api.example.com"]
rules:
- matches:
- path:
type: PathPrefix
value: /
backendRefs:
- name: api-service
port: 80
12.4 NetworkPolicy
Default-allow becomes default-deny once a policy applies. Must be implemented by the CNI; Calico and Cilium support it well.
13. Storage
- PV: cluster-level storage resource.
- PVC: user request.
- StorageClass: dynamic provisioner.
Access modes: ReadWriteOnce, ReadOnlyMany, ReadWriteMany.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-pvc
spec:
storageClassName: gp3
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
StatefulSet uses volumeClaimTemplates so each Pod gets its own PV.
14. Security
14.1 RBAC
Role/ClusterRole plus RoleBinding/ClusterRoleBinding.
14.2 ServiceAccount
Identity for workloads. Token mounted at /var/run/secrets/kubernetes.io/serviceaccount/token.
14.3 Pod Security Standards
Levels: privileged, baseline, restricted. Enforced via namespace labels:
metadata:
labels:
pod-security.kubernetes.io/enforce: restricted
PSP removed in 1.25; PSS is the standard.
14.4 Secrets
Kubernetes Secret is base64, not encryption. Enable etcd encryption-at-rest, or use Vault / AWS Secrets Manager / Sealed Secrets.
15. Limits and Pain Points
Complexity (steep learning curve), etcd pressure in large clusters, tangled networking, control-plane cost, weak multi-tenancy (namespace isolation is not strong).
16. Alternatives
- Nomad: simpler, lighter, handles non-containers.
- Docker Swarm: nearly deprecated.
- Mesos: historical.
- ECS: AWS lock-in.
- Cloud Run / Lambda: serverless.
- k3s: edge.
Choose Kubernetes when you need a standard, scale, and multi-cloud.
17. Future
Sidecar-less service mesh (Cilium leading), Gateway API replacing Ingress, WebAssembly workloads (krustlet, wasmCloud), confidential computing (SGX, SEV), KubeVirt for VMs, sched_ext enabling per-workload scheduling.
18. Conclusion
Kubernetes is not a simple tool — it is a decade of lessons from Borg distilled into the de-facto standard for distributed infrastructure. Every major cloud exposes it, every new infra project assumes it, every cloud-native architecture inherits its model. Deep understanding is now required literacy for infrastructure engineers.
This article closes the Linux Infrastructure series: boot, scheduler, io_uring, memory, eBPF, containers, and now orchestration. Together they paint the full picture of how code lives on a modern cloud-native stack.
Appendix — FAQ
Do I need Kubernetes for small apps? Usually no. Docker Compose, systemd, or a PaaS is simpler.
Self-hosted or managed? Managed (EKS/GKE/AKS) for small teams. Self-hosted only when a dedicated infra team exists.
Is running etcd really risky? Yes — backup, monitoring, defrag, upgrades all require care. Managed Kubernetes automates this.
Run databases on Kubernetes? Possible via well-built operators, but managed RDS is often more attractive.
Tools beyond kubectl? helm, kustomize, k9s, stern, kubectx/kubens, lens.
현재 단락 (1/247)
`kubectl apply -f deployment.yaml`. One command and a container is deployed somewhere in the cluster...