- Published on
Cilium Datapath Architecture — Inside a Cluster Without kube-proxy
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- Where Cilium Sits — More Than a CNI
- Dissecting the eBPF Datapath — A Packet on Its Journey
- How kube-proxy Replacement Works
- The Identity-Based Security Model
- Tunneling vs Native Routing
- IPAM Modes
- Installation and Upgrades in Practice
- Verifying the kube-proxy Replacement
- The Performance Angle — Why It Gets Faster
- Troubleshooting Basics
- Operational Cautions
- Adoption Checklist
- Closing
- References
Introduction
In a traditional Kubernetes cluster, a single service request traverses thousands of iptables chains before reaching a pod. In a cluster with more than five thousand services, a single rule update can take seconds, and the kernel pays a near-linear lookup cost for every packet. Cilium replaces this path with eBPF programs and hash maps, processing packets with close to O(1) lookups regardless of how many services exist.
Cilium became a CNCF Graduated project in October 2023 and has since rapidly established itself as the standard CNI in the managed Kubernetes world. GKE Dataplane V2 is built on Cilium, AKS offers Azure CNI Powered by Cilium, and on EKS it has become common to install Cilium directly and remove kube-proxy entirely. In this article we dissect the eBPF datapath by following a packet on its journey, and cover everything an operator needs to know: kube-proxy replacement, the identity security model, routing modes, IPAM, installation, verification, and troubleshooting.
Where Cilium Sits — More Than a CNI
On the surface Cilium is a CNI plugin, but the territory it actually covers is much broader.
| Area | Capability | What it replaces |
|---|---|---|
| Pod networking | IP allocation, pod-to-pod routing | Legacy CNIs (flannel, calico, etc.) |
| Service load balancing | ClusterIP, NodePort, LoadBalancer | kube-proxy |
| Network policy | L3/L4/L7 policy, DNS policy | NetworkPolicy plus a separate proxy |
| Observability | Flow visibility (Hubble) | Separate monitoring agents |
| Multi-cluster | ClusterMesh | Parts of a service mesh |
| Encryption | WireGuard, IPsec | Separate overlay solutions |
The common foundation of all these capabilities is eBPF. eBPF lets you run verifier-approved programs at hook points inside the kernel without recompiling it or loading modules. From a networking standpoint, the important hooks are XDP (driver level) and tc (traffic control, the L2/L3 ingress and egress points).
Dissecting the eBPF Datapath — A Packet on Its Journey
Let us follow a packet arriving from outside via NodePort until it reaches a pod.
[External client]
|
v
+---------------------------------------------------------------+
| Node (Linux kernel) |
| |
| NIC receive |
| | |
| v |
| [XDP hook] ──────── bpf_xdp.o |
| | - NodePort acceleration, DDoS filtering, LB (driver lvl) |
| | - DROP/TX(hairpin)/PASS decided here |
| v |
| [tc ingress hook] ── bpf_host.o (cil_from_netdev) |
| | - Service translation: ClusterIP/NodePort -> backend IP |
| | (lookups in cilium_lb4_services_v2, cilium_lb4_backends)|
| | - conntrack entry (cilium_ct4_global map) |
| | - ipcache lookup: destination IP -> identity |
| v |
| Routing/redirect (bpf_redirect_peer skips the veth pair) |
| | |
| v |
| [Pod lxc interface] ── bpf_lxc.o (cil_to_container) |
| | - Policy evaluation: src identity + dst port/proto match |
| | (per-endpoint cilium_policy_v2 map) |
| v |
| [Pod network namespace -> application socket] |
+---------------------------------------------------------------+
The key points to note:
- XDP runs before the NIC driver converts the packet into an sk_buff. Being a pre-allocation stage it is the fastest, and Cilium uses it for NodePort/LoadBalancer acceleration (
loadBalancer.acceleration=native) and the standalone L4LB mode. - The tc hook is the heart of the Cilium datapath. Service translation, conntrack, policy evaluation, and tunnel encapsulation all happen here.
- bpf_redirect_peer delivers the packet from the host-side veth directly to the peer device inside the pod namespace, crossing the namespace boundary in one step without rescheduling a softirq.
- Policy evaluation is based not on IPs but on identity (explained below).
Comparison with the iptables Path
This is the path the same NodePort request takes in a kube-proxy (iptables mode) cluster.
NIC -> netfilter PREROUTING
-> KUBE-SERVICES chain (linear walk, one rule set per service)
-> KUBE-SVC-XXXX (probabilistic DNAT branch, one per backend)
-> KUBE-SEP-XXXX (DNAT executed)
-> conntrack (nf_conntrack table)
-> FORWARD chain -> CNI bridge/routing -> veth -> pod
| Aspect | iptables (kube-proxy) | eBPF (Cilium) |
|---|---|---|
| Service lookup | Linear walk through rule chains | Hash map, O(1) |
| Rule updates | Full table rewrite (slower with more services) | Incremental per-entry map updates |
| Backend selection | statistic module probability branches | Random or Maglev consistent hashing |
| conntrack | nf_conntrack (global, lock contention) | BPF-map based dedicated CT |
| Policy expression | IP/CIDR based | Identity based (label semantics preserved) |
| L7 awareness | Not possible | Possible via Envoy integration |
With a few hundred services the perceived difference is small, but beyond a few thousand, the iptables update latency and first-packet latency surface as operational problems. The eBPF datapath makes both axes — lookup cost and update cost — constant time.
How kube-proxy Replacement Works
Cilium kube-proxy replacement (KPR) implements the Kubernetes service abstraction as two layers of eBPF maps.
Service lookup flow (simplified)
Destination 10.96.0.10:443 (ClusterIP)
|
v
cilium_lb4_services_v2 map
key: (IP, port, scope)
value: (backend count, rev_nat index, flags)
|
v
cilium_lb4_backends_v3 map
key: backend id
value: (pod IP, port, state)
|
v
Perform DNAT + record reverse mapping in cilium_lb4_reverse_nat
The agent watches Service/EndpointSlice objects on the Kubernetes API and applies only the deltas to the maps. Because the data plane is closed inside the kernel, existing connections and existing service translations keep working even if the agent dies briefly.
DSR and Maglev
When NodePort traffic arrives at a node that hosts no backend, an extra hop to another node occurs. In the default SNAT mode the response travels back through the ingress node, but in DSR (Direct Server Return) mode the backend pod replies to the client directly, cutting the return-path hop and bandwidth cost while also preserving the client source IP.
Maglev consistent hashing is the L4 load balancer algorithm published by Google; it ensures that most existing flows keep mapping to the same backend even as backends are added or removed. In setups where multiple nodes receive the same VIP traffic via ECMP, backend selection stays consistent across nodes, minimizing connection breakage during node failures.
# helm values - KPR + DSR + Maglev example
kubeProxyReplacement: true
loadBalancer:
mode: dsr
algorithm: maglev
acceleration: native # XDP acceleration (supported NICs only)
maglev:
tableSize: 16381 # prime number larger than backends x 100 recommended
Keep in mind that DSR works most naturally in native routing mode; combining it with tunnel mode requires separate options such as Geneve DSR.
The Identity-Based Security Model
The core of the Cilium security model is: decide by identity, not by IP.
Pod labels identity (number)
--------------------------- ----------------
app=frontend, env=prod ---> identity 51234
app=backend, env=prod ---> identity 60917
(reserved) host ---> 1
(reserved) world ---> 2
(reserved) remote-node ---> 6
ipcache map: IP/CIDR -> identity
10.0.1.23/32 -> 51234
10.0.2.40/32 -> 60917
0.0.0.0/0 -> 2 (world)
The sequence of operations:
- When a pod is created, the agent extracts its security-relevant label set and assigns the same numeric identity to identical label sets (global agreement via kvstore or CRDs).
- The mapping between the pod IP and its identity is propagated to the
cilium_ipcachemap on every node. - When a packet is sent, the source identity is carried as metadata (in the tunnel header or as a packet mark), or recovered on the receiving side via an ipcache lookup.
- The policy map of the receiving endpoint answers the question — can identity 51234 reach TCP 8080 — in O(1).
The advantage of this model is that policy semantics survive IP churn. If a pod is rescheduled and its IP changes, the identity stays the same as long as the labels do, and the policy map needs no modification. Only the ipcache entry is updated.
# Inspect identities and the ipcache directly
kubectl -n kube-system exec ds/cilium -- cilium identity list
kubectl -n kube-system exec ds/cilium -- cilium bpf ipcache list
kubectl -n kube-system exec ds/cilium -- cilium endpoint list
Tunneling vs Native Routing
There are two main ways for pod-to-pod packets to cross node boundaries.
Tunnel mode (VXLAN/Geneve) Native routing
+--------+ encapsulate +--------+ +--------+ plain IP +--------+
| node A | ===========> | node B | | node A | ---------> | node B |
+--------+ UDP 8472 +--------+ +--------+ router/BGP +--------+
Pod packet wrapped in outer header Network must route the PodCIDRs
| Criterion | VXLAN | Geneve | Native routing (BGP/cloud) |
|---|---|---|---|
| Network requirement | Only UDP 8472 between nodes | Only UDP 6081 between nodes | Underlay must route PodCIDRs |
| Overhead | About 50 bytes encapsulation | About 50 bytes or more (variable options) | None |
| MTU impact | Yes (MTU reduction needed) | Yes | None |
| DSR friendliness | Limited | Good for carrying DSR options | Most natural |
| Operational difficulty | Low | Low | Requires BGP or cloud routing knowledge |
| Suitable environment | Uncontrollable underlay | Where metadata extension matters | On-prem BGP, ENI and other cloud-native setups |
A rough summary of the decision criteria:
- If you cannot control the underlay network (shared corporate network, plain L3 environment) → start with tunnel mode
- If you can peer BGP with your ToR switches on-prem → native routing plus the BGP Control Plane
- If you use ENI or Azure IPAM on AWS/Azure → native routing is the baseline assumption
# Native routing + BGP example (helm values)
routingMode: native
ipv4NativeRoutingCIDR: 10.0.0.0/8
autoDirectNodeRoutes: true # install direct node routes when on the same L2
bgpControlPlane:
enabled: true # use CiliumBGPPeeringPolicy/ClusterConfig
IPAM Modes
Where and how pod IPs are allocated is also part of the datapath design.
| IPAM mode | Allocator | Characteristics |
|---|---|---|
| cluster-pool (default) | Cilium operator | Per-node PodCIDRs distributed via CiliumNode CRD, flexible pool sizes |
| kubernetes | kube-controller-manager | Uses Node.spec.podCIDR, compatible with existing cluster conventions |
| eni | AWS ENI | Pods get VPC-native IPs, routing mode must be native |
| azure | Azure IPAM | Foundation of Azure CNI Powered by Cilium |
| multi-pool | Cilium operator | Different pools per namespace or node |
In cluster-pool mode, sizing the pool too small initially can block pod scheduling due to CIDR exhaustion when nodes are added, so clusterPoolIPv4PodCIDRList should be generously sized with future node counts in mind.
Installation and Upgrades in Practice
Kernel Requirements
| Feature | Minimum kernel |
|---|---|
| Base datapath | 4.19 or later (practically 5.4 or later recommended) |
| WireGuard encryption | 5.6 or later |
| XDP acceleration | NIC driver support plus 5.x recommended |
| BIG TCP, netkit and other recent features | 6.x series |
As of 2026 the major distributions (RHEL 9, Ubuntu 22.04/24.04) all ship 5.14 or later, so this is usually a non-issue, but older on-prem nodes must be checked with uname -r.
Helm Installation Example
helm repo add cilium https://helm.cilium.io/
helm repo update
# Assumes the cluster was created without kube-proxy, or it has been removed
helm install cilium cilium/cilium \
--version 1.18.5 \
--namespace kube-system \
-f values.yaml
# values.yaml — kube-proxy replacement + Hubble enabled baseline
kubeProxyReplacement: true
k8sServiceHost: api.mycluster.internal # mandatory with KPR: point at the API server directly
k8sServicePort: 6443
routingMode: tunnel
tunnelProtocol: vxlan
ipam:
mode: cluster-pool
operator:
clusterPoolIPv4PodCIDRList:
- 10.128.0.0/12
clusterPoolIPv4MaskSize: 24
hubble:
enabled: true
relay:
enabled: true
ui:
enabled: true
operator:
replicas: 2
prometheus:
enabled: true
The reason for specifying k8sServiceHost matters. Without kube-proxy, Cilium itself handles ClusterIP translation — but before the Cilium agent is up, it cannot resolve the kubernetes.default ClusterIP. Pointing it at the real API server address avoids this chicken-and-egg problem.
Using cilium-cli
# Comprehensive installation health check
cilium status --wait
# Cluster-wide connectivity test (deploys test pods automatically)
cilium connectivity test
# Verify configuration
cilium config view | grep -i kube-proxy
Upgrade Procedure
# 1) Always read the upgrade guide in the release notes (no skipping minor versions)
# 2) Pre-flight check: pre-pull new images and verify CRD compatibility
helm install cilium-preflight cilium/cilium --version 1.18.5 \
--namespace kube-system \
--set preflight.enabled=true \
--set agent=false --set operator.enabled=false
# 3) After preflight is confirmed healthy, run the actual upgrade
helm upgrade cilium cilium/cilium --version 1.18.5 \
--namespace kube-system -f values.yaml
# 4) Watch the rolling restart
kubectl -n kube-system rollout status ds/cilium
During an upgrade, the eBPF programs and maps stay in the kernel even while agent pods restart, so existing traffic should not be interrupted. However, major changes that alter map layouts can trigger temporary recreation, which is why reading the release notes is mandatory.
Verifying the kube-proxy Replacement
You must verify with your own eyes that KPR is actually working.
# 1) Confirm KPR mode - look for "True"
kubectl -n kube-system exec ds/cilium -- cilium status --verbose | grep -A3 KubeProxyReplacement
# 2) Confirm services landed in the eBPF maps
kubectl -n kube-system exec ds/cilium -- cilium service list
kubectl -n kube-system exec ds/cilium -- cilium bpf lb list
# 3) Confirm kube-proxy is truly gone and no iptables remnants exist
kubectl -n kube-system get ds kube-proxy 2>&1 || echo "no kube-proxy - good"
iptables-save | grep -c KUBE-SVC || echo "no KUBE-SVC chains - good"
# 4) Real connectivity test
kubectl run probe --image=curlimages/curl --rm -it --restart=Never -- \
curl -s -o /dev/null -w "%{http_code}\n" http://my-service.default.svc.cluster.local
Especially when migrating an existing cluster off kube-proxy, leftover iptables rules can conflict with the eBPF path. Your plan must include a step to clean up residual KUBE-* chains on every node after deleting the kube-proxy DaemonSet (node reboot or an iptables flush).
The Performance Angle — Why It Gets Faster
Benchmark numbers are highly environment-dependent, so understanding where the structural gains come from matters more.
- Lookup complexity: iptables walks rules proportionally to their count; eBPF hash maps are constant time. The gap widens as services and policies grow.
- Shorter path length: bpf_redirect_peer saves a softirq cycle when crossing the veth pair, and in host-routing mode the upper netfilter stack is skipped entirely.
- Update cost: in clusters with frequent deployments, full iptables rewrites cause CPU spikes and rule application delays, while incremental map updates are nearly free.
- XDP: driver-level processing decides LB/drop before sk_buff allocation, raising NodePort packet throughput substantially.
There are costs too. Enabling L7 policy routes the affected traffic through userspace Envoy, adding latency, and tunnel mode brings encapsulation overhead plus MTU reduction. The operational key is not "everything gets faster" but knowing which feature costs what when you turn it on.
Troubleshooting Basics
# Observe live datapath events (including drop reasons)
kubectl -n kube-system exec ds/cilium -- cilium monitor --type drop
kubectl -n kube-system exec ds/cilium -- cilium monitor --type policy-verdict
# Endpoint details (policy state, identity)
kubectl -n kube-system exec ds/cilium -- cilium endpoint list
kubectl -n kube-system exec ds/cilium -- cilium endpoint get 1234
# Query conntrack/NAT maps directly
kubectl -n kube-system exec ds/cilium -- cilium bpf ct list global | head
kubectl -n kube-system exec ds/cilium -- cilium bpf nat list | head
# Collect a full diagnostic bundle (for issue reports)
cilium sysdump
# or per node
kubectl -n kube-system exec ds/cilium -- cilium-bugtool
The drop reason codes from cilium monitor are your first clue. Frequently seen reasons:
| Drop reason | Common cause |
|---|---|
| Policy denied | Not allowed by policy — recheck identity and port |
| CT: Map insertion failed | conntrack map full — tune bpf-ct-global-tcp-max |
| Unsupported L3 protocol | Non-IP traffic — confirm whether intended |
| Stale or unroutable IP | ipcache mismatch — check agent/node synchronization |
| Missed tail call | Program load mismatch — restart agent, check for mixed versions |
Operational Cautions
- Version compatibility: Cilium minor versions only support sequential upgrades. Do not jump from 1.16 to 1.18. Also check the Kubernetes version support matrix.
- Policy migration: when migrating from Calico and similar, most existing NetworkPolicy objects work as-is, but subtle behavioral differences from the identity model (especially around ipBlock and node IP handling) must be validated in staging first.
- Reserved identities: without understanding reserved identities like host, remote-node, world, and kube-apiserver, you will hit incidents where node-originated traffic or health checks get blocked by policy.
- Resource limits: in large clusters, review the default BPF map sizes (ct, ipcache, lb maps) and adjust agent memory requests accordingly.
- Managed environment constraints: with GKE Dataplane V2 or AKS Cilium mode, some helm values are fixed by the cloud provider. Do not expect the same freedom as a self-managed installation.
Adoption Checklist
- Are node kernels 5.4 or later (5.6 or later if WireGuard is needed)
- Is the routing mode decision (tunnel vs native) documented with its rationale
- Does the PodCIDR pool accommodate three years of node growth
- Are k8sServiceHost/Port set when kubeProxyReplacement is enabled
- If tunnel mode, is UDP 8472 (VXLAN) or 6081 (Geneve) open between nodes
- Is the MTU calculation done (accounting for encapsulation/encryption overhead)
- Does cilium connectivity test pass all items
- Is there a procedure to clean residual iptables chains after removing kube-proxy
- Are cilium status, BPF map utilization, and drop counts wired into monitoring
- Is the preflight procedure for upgrades part of the runbook
Closing
The essence of Cilium is pushing the semantics of Kubernetes networking — services, labels, policies — directly down into kernel data structures. In the iptables era, the meaning carried by labels was lost in translation to IP rules; in the eBPF datapath it is preserved all the way to the kernel in the form of identities. Once you understand this structure, kube-proxy replacement, policy evaluation, and troubleshooting all connect into a single picture. In the next article we will cover the network policies built on top of this — L3 through L7 and DNS — with hands-on YAML.
References
- Cilium official documentation: https://docs.cilium.io/
- Cilium kube-proxy replacement guide: https://docs.cilium.io/en/stable/network/kubernetes/kubeproxy-free/
- Cilium routing modes documentation: https://docs.cilium.io/en/stable/network/concepts/routing/
- eBPF official site: https://ebpf.io/
- Linux kernel BPF documentation: https://www.kernel.org/doc/html/latest/bpf/
- Kubernetes Service documentation: https://kubernetes.io/docs/concepts/services-networking/service/
- CNCF Cilium graduation announcement: https://www.cncf.io/announcements/2023/10/11/cloud-native-computing-foundation-announces-cilium-graduation/
- VXLAN RFC 7348: https://datatracker.ietf.org/doc/html/rfc7348
- Geneve RFC 8926: https://datatracker.ietf.org/doc/html/rfc8926
- Cilium GitHub repository: https://github.com/cilium/cilium
- Envoy proxy documentation: https://www.envoyproxy.io/docs