Skip to content
Published on

Cilium eBPF Datapath Deep Dive: Packet Processing Pipeline

Authors

Cilium eBPF Datapath Deep Dive: Packet Processing Pipeline

Overview

The Cilium eBPF datapath is a high-performance network pipeline that processes packets within the Linux kernel. This post provides a detailed analysis of the entire journey a packet takes when being sent or received by a Pod.

1. Packet Flow Overview

1.1 Ingress Packet Flow (External -> Pod)

External Network
    |
    v
[Physical NIC] eth0
    |
    v
[XDP Program] (optional)
  - NodePort acceleration
  - DDoS mitigation
  - Pre-filtering
    |
    v
[tc ingress: from-netdev]
  - Source Identity lookup (ipcache)
  - Tunnel decapsulation (VXLAN/Geneve)
  - NodePort DNAT
    |
    v
[Routing Decision]
  - Local Pod? -> cilium_host
  - Remote node? -> Tunnel or direct routing
    |
    v
[cilium_host: to-host]
  - Host firewall policy check
    |
    v
[tc egress: to-container] (lxc*)
  - Destination Identity check
  - Ingress policy check
  - L3/L4 filtering
  - Conntrack entry create/update
  - L7 redirect (to Envoy if needed)
    |
    v
[Pod Network Namespace]

1.2 Egress Packet Flow (Pod -> External)

[Pod Network Namespace]
    |
    v
[tc ingress: from-container] (lxc*)
  - Source endpoint identification
  - Egress policy check
  - Service DNAT (kube-proxy replacement)
  - Conntrack lookup/create
  - L7 redirect (if needed)
    |
    v
[Routing Decision]
  - Same-node Pod? -> Direct delivery
  - Remote Pod? -> Tunnel or direct routing
  - External? -> SNAT (masquerade)
    |
    v
[tc egress: to-netdev] (eth0)
  - Apply SNAT
  - Tunnel encapsulation (VXLAN/Geneve)
    |
    v
[Physical NIC] eth0
    |
    v
External Network

2. BPF Program Details

2.1 from-container (Pod Egress)

The first BPF program that processes packets leaving a Pod:

// Conceptual from-container processing flow (simplified)
// Actual code: bpf/bpf_lxc.c - handle_xgress()

int from_container(struct __sk_buff *skb) {
    // 1. Parse packet (L2/L3/L4 headers)
    // 2. Identify source endpoint
    // 3. Check egress policy
    //    - Identity-based L3/L4 policy
    //    - CIDR policy
    //    - L7 policy -> Envoy redirect
    // 4. Service load balancing
    //    - ClusterIP DNAT
    //    - NodePort DNAT
    // 5. Conntrack processing
    // 6. Routing decision
    //    - Same node: tail call to-container
    //    - Remote node: tunnel encap or direct routing
    //    - External: SNAT
    return TC_ACT_OK; // or TC_ACT_SHOT (drop)
}

Key processing stages:

StageDescriptionBPF Map Used
Packet parsingExtract L2/L3/L4 headers-
Endpoint identificationIdentify source Podcilium_lxc
Policy checkEgress policy matchingcilium_policy
Service LBClusterIP/NodePort DNATcilium_lb4_services
ConntrackConnection state trackingcilium_ct4_global
RoutingDetermine destination nodecilium_ipcache, cilium_tunnel_map

2.2 to-container (Pod Ingress)

The BPF program that processes packets arriving at a Pod:

// Conceptual to-container processing flow
// Actual code: bpf/bpf_lxc.c - handle_ingress()

int to_container(struct __sk_buff *skb) {
    // 1. Parse packet
    // 2. Source Identity lookup
    //    - Source IP -> Identity mapping from ipcache
    // 3. Ingress policy check
    //    - Identity-based L3/L4 policy
    //    - L7 policy -> Envoy redirect
    // 4. Conntrack update
    // 5. Reverse NAT (for response packets)
    // 6. Deliver to Pod
    return TC_ACT_OK;
}

2.3 from-overlay / to-overlay

Packet processing through the overlay network (VXLAN/Geneve):

VXLAN receive flow:
[Physical NIC] -> [VXLAN decapsulation] -> [from-overlay BPF]
  - Extract inner Identity (Geneve TLV or source IP based)
  - Destination endpoint lookup
  - Policy check
  - Deliver to target Pod

VXLAN send flow:
[from-container BPF] -> [Routing: remote node] -> [to-overlay BPF]
  - Include Identity info in tunnel header
  - VXLAN/Geneve encapsulation
  - Forward to physical NIC

3. Connection Tracking Implementation

3.1 Conntrack Structure

Cilium implements its own eBPF-based connection tracking:

// Conntrack key structure (conceptual)
struct ct_key {
    __u32 src_ip;
    __u32 dst_ip;
    __u16 src_port;
    __u16 dst_port;
    __u8  protocol;    // TCP, UDP, ICMP
    __u8  direction;   // ingress, egress
};

// Conntrack value structure
struct ct_entry {
    __u64 rx_packets;
    __u64 rx_bytes;
    __u64 tx_packets;
    __u64 tx_bytes;
    __u32 lifetime;
    __u16 rev_nat_index;  // reverse NAT index
    __u16 src_sec_id;     // source Identity
    __u32 flags;
};

3.2 Conntrack State Machine

TCP connection state tracking:

SYN -> [NEW] -> SYN-ACK -> [ESTABLISHED] -> FIN -> [CLOSING] -> [CLOSED]

Timeouts per state:
- NEW: 60 seconds
- ESTABLISHED: 6 hours (TCP), 60 seconds (UDP)
- CLOSING: 10 seconds

3.3 Conntrack Usage

# Query conntrack table
cilium bpf ct list global

# Example TCP connection output:
# TCP IN  10.244.1.5:34567 -> 10.96.0.1:443
#   Expires: 21590s Identity: 48291
#   RxPackets: 142  RxBytes: 15234
#   TxPackets: 138  TxBytes: 12890
#   Flags: rx+tx established

# Count conntrack entries
cilium bpf ct list global | wc -l

# Filter conntrack for specific IP
cilium bpf ct list global | grep "10.244.1.5"

3.4 Conntrack GC (Garbage Collection)

Expired conntrack entries are periodically cleaned up:

Agent internal GC loop:
1. Iterate all entries in the BPF map
2. Identify entries with expired timeouts
3. Delete expired entries
4. Also delete associated NAT entries
5. Update metrics

GC interval: Configurable (default ~12 seconds)

4. NAT Engine

4.1 SNAT (Source NAT / Masquerade)

Translates source IP to node IP for traffic from Pods to outside the cluster:

SNAT processing flow:

Pod (10.244.1.5:34567) -> External (8.8.8.8:443)
    |
    v
[from-container BPF]
  - Check if destination is outside cluster
  - Determine if SNAT is needed
    |
    v
[SNAT Engine]
  - Translate source IP to node IP (10.244.1.5 -> 192.168.1.100)
  - Translate source port to ephemeral port
  - Store NAT mapping in BPF map
    |
    v
Node IP (192.168.1.100:50123) -> External (8.8.8.8:443)

Response packet (Reverse SNAT):
External (8.8.8.8:443) -> Node IP (192.168.1.100:50123)
    |
    v
[to-netdev BPF]
  - Look up original source in NAT mapping
  - Restore destination to original Pod IP/port
    |
    v
External (8.8.8.8:443) -> Pod (10.244.1.5:34567)

4.2 DNAT (Service Load Balancing)

Translates Kubernetes service IPs to actual backend Pod IPs:

Service DNAT flow:

Client (10.244.1.5) -> Service (10.96.0.10:80)
    |
    v
[from-container BPF: Service LB]
  - Look up service in cilium_lb4_services
  - Select backend (round-robin, Maglev, etc.)
  - Translate destination IP/port to backend
    |
    v
Client (10.244.1.5) -> Backend Pod (10.244.2.10:8080)

4.3 Maglev Consistent Hashing

Maglev is a consistent hashing algorithm developed by Google, used in Cilium for service load balancing:

Maglev hashing process:

1. Build lookup table (65537 entries)
   - Each backend maps to multiple positions in the table
   - Uniform distribution regardless of backend count

2. Compute packet hash
   - 5-tuple hash: src_ip + dst_ip + src_port + dst_port + protocol
   - Hash value determines table index

3. Select backend
   - table[hash % table_size] = backend

Impact on backend changes:
- Adding/removing 1 backend -> Only ~1/N connections remapped
- Most existing connections maintained
# Check Maglev configuration
cilium config | grep maglev

# Check service backends
cilium bpf lb list

# Service Maglev lookup table
cilium bpf lb maglev list

4.4 Processing by Service Type

ClusterIP:
  - DNAT in from-container
  - Socket-level LB (bpf_sock) or tc level

NodePort:
  - DNAT at XDP or tc ingress (from-netdev)
  - DSR mode: Direct response packet delivery
  - SNAT mode: Response through original node

LoadBalancer:
  - Same as NodePort + ExternalIP mapping
  - IP allocated by LB-IPAM

ExternalTrafficPolicy: Local
  - Only process if local backends exist
  - Preserve client source IP

5. DSR (Direct Server Return) Mode

5.1 DSR Operation

In DSR mode, response packets are delivered directly to the client without passing through the original node that received the request:

DSR flow:

1. Client -> Node A (NodePort)
   [Node A: DNAT + set DSR option]
   - Translate destination to backend Pod (Node B)
   - Encode original service IP/port in IP options

2. Node A -> Node B (backend Pod)
   [Node B: Process DSR option]
   - Restore original service IP/port from IP options
   - Store reverse NAT info in conntrack

3. Node B (backend Pod) -> Client
   [Node B: Reverse NAT]
   - Translate source IP to service IP
   - Respond directly without going through Node A

Normal mode (SNAT):
  Client -> Node A -> Node B -> Node A -> Client
  (2 extra hops)

DSR mode:
  Client -> Node A -> Node B -> Client
  (response is direct)

5.2 DSR Implementation Methods

IPv4 DSR:
  - Pass original address via IP Options or Geneve TLV
  - Minimize additional packet overhead

IPv6 DSR:
  - Use Segment Routing Header (SRH) or Extension Header
  - Native IPv6 support

6. eBPF Tail Calls: Program Chaining

6.1 Tail Call Mechanism

eBPF programs have instruction count limits, so complex processing is split across multiple programs:

Tail call chain example:

[from-container]
    |
    tail_call -> [IPv4 policy check]
                    |
                    tail_call -> [Service LB]
                                    |
                                    tail_call -> [NAT processing]
                                                    |
                                                    tail_call -> [Forward]

6.2 Tail Call Map

// Tail call map structure (conceptual)
// BPF_MAP_TYPE_PROG_ARRAY
struct bpf_map tail_call_map = {
    .type = BPF_MAP_TYPE_PROG_ARRAY,
    .max_entries = 64,
    // index -> BPF program FD mapping
};

// Tail call invocation
// tail_call(skb, &tail_call_map, CILIUM_CALL_IPV4_FROM_LXC);

6.3 Key Tail Call Points in Cilium

CILIUM_CALL_IPV4_FROM_LXC      = 0   // IPv4 from-container entry
CILIUM_CALL_IPV4_CT_INGRESS    = 4   // IPv4 conntrack ingress
CILIUM_CALL_IPV4_CT_EGRESS     = 5   // IPv4 conntrack egress
CILIUM_CALL_IPV4_NODEPORT_NAT  = 13  // NodePort NAT
CILIUM_CALL_IPV4_NODEPORT_DSR  = 14  // NodePort DSR
CILIUM_CALL_IPV4_ENCAP         = 15  // Tunnel encapsulation
CILIUM_CALL_SEND_ICMP_UNREACH  = 18  // Send ICMP Unreachable
CILIUM_CALL_SRV6_ENCAP         = 23  // SRv6 encapsulation

7. Socket-Level Load Balancing

7.1 Socket LB Overview

Socket-level load balancing bypasses iptables and tc-level NAT by translating service IPs to backend IPs directly at the connect() system call:

Traditional approach (iptables/tc):
connect(serviceIP) -> [kernel network stack] -> [NAT] -> [conntrack] -> backend

Socket LB approach:
connect(serviceIP) -> [BPF sock_ops] -> connect(backendIP)
  - No NAT needed
  - No conntrack entries needed
  - Network stack overhead eliminated

7.2 BPF Socket Program Types

cgroup/connect4:     Intercept connect() (IPv4)
cgroup/connect6:     Intercept connect() (IPv6)
cgroup/sendmsg4:     Intercept UDP sendmsg() (IPv4)
cgroup/sendmsg6:     Intercept UDP sendmsg() (IPv6)
cgroup/recvmsg4:     Intercept UDP recvmsg() (IPv4)
cgroup/recvmsg6:     Intercept UDP recvmsg() (IPv6)
cgroup/getpeername4: Intercept getpeername() (IPv4)
cgroup/getpeername6: Intercept getpeername() (IPv6)

7.3 Socket LB Advantages

Performance comparison:

tc-level LB:
  - NAT performed per packet
  - Conntrack entries required
  - Additional CPU cycles

Socket-level LB:
  - Translation only once per connection
  - No conntrack needed
  - Minimal CPU overhead
  - Original service IP preserved (getpeername)

7.4 Socket LB Limitations

Socket LB is not applied for:
  - Traffic entering from external via NodePort
  - hostNetwork Pod traffic
  - Certain kube-proxy compatibility modes
  - Services with L7 policies (Envoy redirect needed)

In these cases, tc-level fallback processing is used

8. Packet Drops and Monitoring

8.1 Drop Reason Codes

Cilium provides detailed reason codes when packets are dropped:

# Monitor dropped packets
cilium monitor --type drop

# Example output:
# xx drop (Policy denied) flow ...
# xx drop (Invalid source ip) flow ...
# xx drop (CT: Map insertion failed) flow ...

Key drop reasons:

CodeDescription
Policy deniedDenied by policy
Invalid source ipSource IP is invalid
CT: Map insertion failedConntrack map insertion failed
No mapping for NATNo NAT mapping found
Unknown L3 targetUnknown L3 destination
Stale or unroutable IPStale or unroutable IP
Authentication requiredmTLS authentication required
Service backend not foundNo service backend available

8.2 Packet Tracing

# Real-time packet trace
cilium monitor --type trace

# Monitor traffic for a specific endpoint only
cilium monitor --type trace --from-endpoint 1234

# Policy verdict monitoring
cilium monitor --type policy-verdict

# Debug-level monitoring
cilium monitor --type debug

9. Performance Optimization Techniques

9.1 XDP Acceleration

Performance by XDP mode:
  - XDP native (driver-integrated): Best performance
  - XDP generic (software): Compatibility first
  - XDP offload (NIC hardware): Specific NICs only

XDP use cases:
  - NodePort acceleration
  - LoadBalancer acceleration
  - DDoS mitigation (packet drop)
  - Pre-filtering

9.2 BIG TCP

BIG TCP operation:
  - GRO (Generic Receive Offload) merges received packets into 64KB+ size
  - Process as large packets internally, reducing per-packet overhead
  - GSO (Generic Segmentation Offload) segments on transmit
  - Internal processing optimization without MTU constraints

9.3 eBPF Program Optimization

Compile-time optimization:
  - Exclude unused features from compilation
  - Endpoints without policies: minimal BPF code
  - IPv4-only environments: remove IPv6 code

Runtime optimization:
  - BPF map prefetching
  - Inline function usage
  - Eliminate unnecessary conditional branches
  - Native code execution via JIT compilation

10. Datapath Debugging

10.1 BPF Program Status Check

# List all BPF programs
bpftool prog list

# BPF programs for specific interface
tc filter show dev lxc1234 ingress
tc filter show dev lxc1234 egress

# BPF program statistics
bpftool prog show id 42 --json

# BPF map statistics
bpftool map show id 10 --json

10.2 Performance Metrics

# Datapath-related metrics
cilium metrics list | grep datapath

# Key metrics:
# cilium_datapath_conntrack_gc_runs_total
# cilium_datapath_conntrack_gc_entries
# cilium_bpf_map_ops_total
# cilium_drop_count_total
# cilium_forward_count_total

Summary

Cilium's eBPF datapath achieves high performance through these core design principles:

  • In-Kernel Processing: All packet processing in kernel space, eliminating user space transition overhead
  • Tail Call Chaining: Complex logic split across multiple BPF programs for modularity
  • Custom Conntrack: High-performance BPF map-based connection tracking instead of Linux netfilter conntrack
  • Socket-Level LB: Service load balancing without NAT for minimal overhead
  • DSR Mode: Eliminates unnecessary network hops, reducing response latency
  • XDP Acceleration: Ultra-fast packet processing at the network driver level