Cilium Datapath Architecture — Inside a Cluster Without kube-proxy

Introduction
Where Cilium Sits — More Than a CNI
Dissecting the eBPF Datapath — A Packet on Its Journey
- Comparison with the iptables Path
How kube-proxy Replacement Works
- DSR and Maglev
The Identity-Based Security Model
Tunneling vs Native Routing
IPAM Modes
Installation and Upgrades in Practice
Verifying the kube-proxy Replacement
The Performance Angle — Why It Gets Faster
Troubleshooting Basics
Operational Cautions
Adoption Checklist
Closing
References

Introduction

In a traditional Kubernetes cluster, a single service request traverses thousands of iptables chains before reaching a pod. In a cluster with more than five thousand services, a single rule update can take seconds, and the kernel pays a near-linear lookup cost for every packet. Cilium replaces this path with eBPF programs and hash maps, processing packets with close to O(1) lookups regardless of how many services exist.

Cilium became a CNCF Graduated project in October 2023 and has since rapidly established itself as the standard CNI in the managed Kubernetes world. GKE Dataplane V2 is built on Cilium, AKS offers Azure CNI Powered by Cilium, and on EKS it has become common to install Cilium directly and remove kube-proxy entirely. In this article we dissect the eBPF datapath by following a packet on its journey, and cover everything an operator needs to know: kube-proxy replacement, the identity security model, routing modes, IPAM, installation, verification, and troubleshooting.

Where Cilium Sits — More Than a CNI

On the surface Cilium is a CNI plugin, but the territory it actually covers is much broader.

Area	Capability	What it replaces
Pod networking	IP allocation, pod-to-pod routing	Legacy CNIs (flannel, calico, etc.)
Service load balancing	ClusterIP, NodePort, LoadBalancer	kube-proxy
Network policy	L3/L4/L7 policy, DNS policy	NetworkPolicy plus a separate proxy
Observability	Flow visibility (Hubble)	Separate monitoring agents
Multi-cluster	ClusterMesh	Parts of a service mesh
Encryption	WireGuard, IPsec	Separate overlay solutions

The common foundation of all these capabilities is eBPF. eBPF lets you run verifier-approved programs at hook points inside the kernel without recompiling it or loading modules. From a networking standpoint, the important hooks are XDP (driver level) and tc (traffic control, the L2/L3 ingress and egress points).

Dissecting the eBPF Datapath — A Packet on Its Journey

Let us follow a packet arriving from outside via NodePort until it reaches a pod.

[External client]
      |
      v
+---------------------------------------------------------------+
| Node (Linux kernel)                                            |
|                                                               |
|  NIC receive                                                  |
|   |                                                           |
|   v                                                           |
|  [XDP hook] ──────── bpf_xdp.o                                |
|   |  - NodePort acceleration, DDoS filtering, LB (driver lvl) |
|   |  - DROP/TX(hairpin)/PASS decided here                     |
|   v                                                           |
|  [tc ingress hook] ── bpf_host.o (cil_from_netdev)            |
|   |  - Service translation: ClusterIP/NodePort -> backend IP  |
|   |    (lookups in cilium_lb4_services_v2, cilium_lb4_backends)|
|   |  - conntrack entry (cilium_ct4_global map)                |
|   |  - ipcache lookup: destination IP -> identity             |
|   v                                                           |
|  Routing/redirect (bpf_redirect_peer skips the veth pair)     |
|   |                                                           |
|   v                                                           |
|  [Pod lxc interface] ── bpf_lxc.o (cil_to_container)          |
|   |  - Policy evaluation: src identity + dst port/proto match |
|   |    (per-endpoint cilium_policy_v2 map)                    |
|   v                                                           |
|  [Pod network namespace -> application socket]                |
+---------------------------------------------------------------+

The key points to note:

XDP runs before the NIC driver converts the packet into an sk_buff. Being a pre-allocation stage it is the fastest, and Cilium uses it for NodePort/LoadBalancer acceleration (loadBalancer.acceleration=native) and the standalone L4LB mode.
The tc hook is the heart of the Cilium datapath. Service translation, conntrack, policy evaluation, and tunnel encapsulation all happen here.
bpf_redirect_peer delivers the packet from the host-side veth directly to the peer device inside the pod namespace, crossing the namespace boundary in one step without rescheduling a softirq.
Policy evaluation is based not on IPs but on identity (explained below).

Comparison with the iptables Path

This is the path the same NodePort request takes in a kube-proxy (iptables mode) cluster.

NIC -> netfilter PREROUTING
        -> KUBE-SERVICES chain (linear walk, one rule set per service)
        -> KUBE-SVC-XXXX (probabilistic DNAT branch, one per backend)
        -> KUBE-SEP-XXXX (DNAT executed)
      -> conntrack (nf_conntrack table)
      -> FORWARD chain -> CNI bridge/routing -> veth -> pod

Aspect	iptables (kube-proxy)	eBPF (Cilium)
Service lookup	Linear walk through rule chains	Hash map, O(1)
Rule updates	Full table rewrite (slower with more services)	Incremental per-entry map updates
Backend selection	statistic module probability branches	Random or Maglev consistent hashing
conntrack	nf_conntrack (global, lock contention)	BPF-map based dedicated CT
Policy expression	IP/CIDR based	Identity based (label semantics preserved)
L7 awareness	Not possible	Possible via Envoy integration

With a few hundred services the perceived difference is small, but beyond a few thousand, the iptables update latency and first-packet latency surface as operational problems. The eBPF datapath makes both axes — lookup cost and update cost — constant time.

How kube-proxy Replacement Works

Cilium kube-proxy replacement (KPR) implements the Kubernetes service abstraction as two layers of eBPF maps.

Service lookup flow (simplified)

  Destination 10.96.0.10:443 (ClusterIP)
        |
        v
  cilium_lb4_services_v2  map
    key:   (IP, port, scope)
    value: (backend count, rev_nat index, flags)
        |
        v
  cilium_lb4_backends_v3  map
    key:   backend id
    value: (pod IP, port, state)
        |
        v
  Perform DNAT + record reverse mapping in cilium_lb4_reverse_nat

The agent watches Service/EndpointSlice objects on the Kubernetes API and applies only the deltas to the maps. Because the data plane is closed inside the kernel, existing connections and existing service translations keep working even if the agent dies briefly.

DSR and Maglev

When NodePort traffic arrives at a node that hosts no backend, an extra hop to another node occurs. In the default SNAT mode the response travels back through the ingress node, but in DSR (Direct Server Return) mode the backend pod replies to the client directly, cutting the return-path hop and bandwidth cost while also preserving the client source IP.

Maglev consistent hashing is the L4 load balancer algorithm published by Google; it ensures that most existing flows keep mapping to the same backend even as backends are added or removed. In setups where multiple nodes receive the same VIP traffic via ECMP, backend selection stays consistent across nodes, minimizing connection breakage during node failures.

# helm values - KPR + DSR + Maglev example
kubeProxyReplacement: true
loadBalancer:
  mode: dsr
  algorithm: maglev
  acceleration: native     # XDP acceleration (supported NICs only)
maglev:
  tableSize: 16381         # prime number larger than backends x 100 recommended

Keep in mind that DSR works most naturally in native routing mode; combining it with tunnel mode requires separate options such as Geneve DSR.

The Identity-Based Security Model

The core of the Cilium security model is: decide by identity, not by IP.

Pod labels                         identity (number)
---------------------------       ----------------
app=frontend, env=prod      --->  identity 51234
app=backend,  env=prod      --->  identity 60917
(reserved) host             --->  1
(reserved) world            --->  2
(reserved) remote-node      --->  6

ipcache map: IP/CIDR -> identity
  10.0.1.23/32 -> 51234
  10.0.2.40/32 -> 60917
  0.0.0.0/0    -> 2 (world)

The sequence of operations:

When a pod is created, the agent extracts its security-relevant label set and assigns the same numeric identity to identical label sets (global agreement via kvstore or CRDs).
The mapping between the pod IP and its identity is propagated to the cilium_ipcache map on every node.
When a packet is sent, the source identity is carried as metadata (in the tunnel header or as a packet mark), or recovered on the receiving side via an ipcache lookup.
The policy map of the receiving endpoint answers the question — can identity 51234 reach TCP 8080 — in O(1).

The advantage of this model is that policy semantics survive IP churn. If a pod is rescheduled and its IP changes, the identity stays the same as long as the labels do, and the policy map needs no modification. Only the ipcache entry is updated.

# Inspect identities and the ipcache directly
kubectl -n kube-system exec ds/cilium -- cilium identity list
kubectl -n kube-system exec ds/cilium -- cilium bpf ipcache list
kubectl -n kube-system exec ds/cilium -- cilium endpoint list

Tunneling vs Native Routing

There are two main ways for pod-to-pod packets to cross node boundaries.

Tunnel mode (VXLAN/Geneve)              Native routing
+--------+  encapsulate +--------+      +--------+  plain IP  +--------+
| node A | ===========> | node B |      | node A | ---------> | node B |
+--------+  UDP 8472    +--------+      +--------+ router/BGP +--------+
 Pod packet wrapped in outer header      Network must route the PodCIDRs

Criterion	VXLAN	Geneve	Native routing (BGP/cloud)
Network requirement	Only UDP 8472 between nodes	Only UDP 6081 between nodes	Underlay must route PodCIDRs
Overhead	About 50 bytes encapsulation	About 50 bytes or more (variable options)	None
MTU impact	Yes (MTU reduction needed)	Yes	None
DSR friendliness	Limited	Good for carrying DSR options	Most natural
Operational difficulty	Low	Low	Requires BGP or cloud routing knowledge
Suitable environment	Uncontrollable underlay	Where metadata extension matters	On-prem BGP, ENI and other cloud-native setups

A rough summary of the decision criteria:

If you cannot control the underlay network (shared corporate network, plain L3 environment) → start with tunnel mode
If you can peer BGP with your ToR switches on-prem → native routing plus the BGP Control Plane
If you use ENI or Azure IPAM on AWS/Azure → native routing is the baseline assumption

# Native routing + BGP example (helm values)
routingMode: native
ipv4NativeRoutingCIDR: 10.0.0.0/8
autoDirectNodeRoutes: true     # install direct node routes when on the same L2
bgpControlPlane:
  enabled: true                # use CiliumBGPPeeringPolicy/ClusterConfig

IPAM Modes

Where and how pod IPs are allocated is also part of the datapath design.

IPAM mode	Allocator	Characteristics
cluster-pool (default)	Cilium operator	Per-node PodCIDRs distributed via CiliumNode CRD, flexible pool sizes
kubernetes	kube-controller-manager	Uses Node.spec.podCIDR, compatible with existing cluster conventions
eni	AWS ENI	Pods get VPC-native IPs, routing mode must be native
azure	Azure IPAM	Foundation of Azure CNI Powered by Cilium
multi-pool	Cilium operator	Different pools per namespace or node

In cluster-pool mode, sizing the pool too small initially can block pod scheduling due to CIDR exhaustion when nodes are added, so clusterPoolIPv4PodCIDRList should be generously sized with future node counts in mind.

Installation and Upgrades in Practice

Kernel Requirements

Feature	Minimum kernel
Base datapath	4.19 or later (practically 5.4 or later recommended)
WireGuard encryption	5.6 or later
XDP acceleration	NIC driver support plus 5.x recommended
BIG TCP, netkit and other recent features	6.x series

As of 2026 the major distributions (RHEL 9, Ubuntu 22.04/24.04) all ship 5.14 or later, so this is usually a non-issue, but older on-prem nodes must be checked with uname -r.

Helm Installation Example

helm repo add cilium https://helm.cilium.io/
helm repo update

# Assumes the cluster was created without kube-proxy, or it has been removed
helm install cilium cilium/cilium \
  --version 1.18.5 \
  --namespace kube-system \
  -f values.yaml

# values.yaml — kube-proxy replacement + Hubble enabled baseline
kubeProxyReplacement: true
k8sServiceHost: api.mycluster.internal   # mandatory with KPR: point at the API server directly
k8sServicePort: 6443

routingMode: tunnel
tunnelProtocol: vxlan

ipam:
  mode: cluster-pool
  operator:
    clusterPoolIPv4PodCIDRList:
      - 10.128.0.0/12
    clusterPoolIPv4MaskSize: 24

hubble:
  enabled: true
  relay:
    enabled: true
  ui:
    enabled: true

operator:
  replicas: 2

prometheus:
  enabled: true

The reason for specifying k8sServiceHost matters. Without kube-proxy, Cilium itself handles ClusterIP translation — but before the Cilium agent is up, it cannot resolve the kubernetes.default ClusterIP. Pointing it at the real API server address avoids this chicken-and-egg problem.

Using cilium-cli

# Comprehensive installation health check
cilium status --wait

# Cluster-wide connectivity test (deploys test pods automatically)
cilium connectivity test

# Verify configuration
cilium config view | grep -i kube-proxy

Upgrade Procedure

# 1) Always read the upgrade guide in the release notes (no skipping minor versions)
# 2) Pre-flight check: pre-pull new images and verify CRD compatibility
helm install cilium-preflight cilium/cilium --version 1.18.5 \
  --namespace kube-system \
  --set preflight.enabled=true \
  --set agent=false --set operator.enabled=false

# 3) After preflight is confirmed healthy, run the actual upgrade
helm upgrade cilium cilium/cilium --version 1.18.5 \
  --namespace kube-system -f values.yaml

# 4) Watch the rolling restart
kubectl -n kube-system rollout status ds/cilium

During an upgrade, the eBPF programs and maps stay in the kernel even while agent pods restart, so existing traffic should not be interrupted. However, major changes that alter map layouts can trigger temporary recreation, which is why reading the release notes is mandatory.

Verifying the kube-proxy Replacement

You must verify with your own eyes that KPR is actually working.

# 1) Confirm KPR mode - look for "True"
kubectl -n kube-system exec ds/cilium -- cilium status --verbose | grep -A3 KubeProxyReplacement

# 2) Confirm services landed in the eBPF maps
kubectl -n kube-system exec ds/cilium -- cilium service list
kubectl -n kube-system exec ds/cilium -- cilium bpf lb list

# 3) Confirm kube-proxy is truly gone and no iptables remnants exist
kubectl -n kube-system get ds kube-proxy 2>&1 || echo "no kube-proxy - good"
iptables-save | grep -c KUBE-SVC || echo "no KUBE-SVC chains - good"

# 4) Real connectivity test
kubectl run probe --image=curlimages/curl --rm -it --restart=Never -- \
  curl -s -o /dev/null -w "%{http_code}\n" http://my-service.default.svc.cluster.local

Especially when migrating an existing cluster off kube-proxy, leftover iptables rules can conflict with the eBPF path. Your plan must include a step to clean up residual KUBE-* chains on every node after deleting the kube-proxy DaemonSet (node reboot or an iptables flush).

The Performance Angle — Why It Gets Faster

Benchmark numbers are highly environment-dependent, so understanding where the structural gains come from matters more.

Lookup complexity: iptables walks rules proportionally to their count; eBPF hash maps are constant time. The gap widens as services and policies grow.
Shorter path length: bpf_redirect_peer saves a softirq cycle when crossing the veth pair, and in host-routing mode the upper netfilter stack is skipped entirely.
Update cost: in clusters with frequent deployments, full iptables rewrites cause CPU spikes and rule application delays, while incremental map updates are nearly free.
XDP: driver-level processing decides LB/drop before sk_buff allocation, raising NodePort packet throughput substantially.

There are costs too. Enabling L7 policy routes the affected traffic through userspace Envoy, adding latency, and tunnel mode brings encapsulation overhead plus MTU reduction. The operational key is not "everything gets faster" but knowing which feature costs what when you turn it on.

Troubleshooting Basics

# Observe live datapath events (including drop reasons)
kubectl -n kube-system exec ds/cilium -- cilium monitor --type drop
kubectl -n kube-system exec ds/cilium -- cilium monitor --type policy-verdict

# Endpoint details (policy state, identity)
kubectl -n kube-system exec ds/cilium -- cilium endpoint list
kubectl -n kube-system exec ds/cilium -- cilium endpoint get 1234

# Query conntrack/NAT maps directly
kubectl -n kube-system exec ds/cilium -- cilium bpf ct list global | head
kubectl -n kube-system exec ds/cilium -- cilium bpf nat list | head

# Collect a full diagnostic bundle (for issue reports)
cilium sysdump
# or per node
kubectl -n kube-system exec ds/cilium -- cilium-bugtool

The drop reason codes from cilium monitor are your first clue. Frequently seen reasons:

Drop reason	Common cause
Policy denied	Not allowed by policy — recheck identity and port
CT: Map insertion failed	conntrack map full — tune bpf-ct-global-tcp-max
Unsupported L3 protocol	Non-IP traffic — confirm whether intended
Stale or unroutable IP	ipcache mismatch — check agent/node synchronization
Missed tail call	Program load mismatch — restart agent, check for mixed versions

Operational Cautions

Version compatibility: Cilium minor versions only support sequential upgrades. Do not jump from 1.16 to 1.18. Also check the Kubernetes version support matrix.
Policy migration: when migrating from Calico and similar, most existing NetworkPolicy objects work as-is, but subtle behavioral differences from the identity model (especially around ipBlock and node IP handling) must be validated in staging first.
Reserved identities: without understanding reserved identities like host, remote-node, world, and kube-apiserver, you will hit incidents where node-originated traffic or health checks get blocked by policy.
Resource limits: in large clusters, review the default BPF map sizes (ct, ipcache, lb maps) and adjust agent memory requests accordingly.
Managed environment constraints: with GKE Dataplane V2 or AKS Cilium mode, some helm values are fixed by the cloud provider. Do not expect the same freedom as a self-managed installation.

Adoption Checklist

Closing

The essence of Cilium is pushing the semantics of Kubernetes networking — services, labels, policies — directly down into kernel data structures. In the iptables era, the meaning carried by labels was lost in translation to IP rules; in the eBPF datapath it is preserved all the way to the kernel in the form of identities. Once you understand this structure, kube-proxy replacement, policy evaluation, and troubleshooting all connect into a single picture. In the next article we will cover the network policies built on top of this — L3 through L7 and DNS — with hands-on YAML.

References

Cilium official documentation: https://docs.cilium.io/
Cilium kube-proxy replacement guide: https://docs.cilium.io/en/stable/network/kubernetes/kubeproxy-free/
Cilium routing modes documentation: https://docs.cilium.io/en/stable/network/concepts/routing/
eBPF official site: https://ebpf.io/
Linux kernel BPF documentation: https://www.kernel.org/doc/html/latest/bpf/
Kubernetes Service documentation: https://kubernetes.io/docs/concepts/services-networking/service/
CNCF Cilium graduation announcement: https://www.cncf.io/announcements/2023/10/11/cloud-native-computing-foundation-announces-cilium-graduation/
VXLAN RFC 7348: https://datatracker.ietf.org/doc/html/rfc7348
Geneve RFC 8926: https://datatracker.ietf.org/doc/html/rfc8926
Cilium GitHub repository: https://github.com/cilium/cilium
Envoy proxy documentation: https://www.envoyproxy.io/docs