- Published on
eBPF Deep Dive — The VM, Verifier, XDP, and CO-RE That Made the Linux Kernel Programmable (2025)
- Authors

- Name
- Youngju Kim
- @fjvbn20031
TL;DR
- eBPF is a VM that runs sandboxed bytecode inside the Linux kernel. You can observe and modify kernel behavior without building kernel modules.
- 11-register 64-bit RISC-like ISA.
clang -target bpfcompiles C to bytecode, then the kernel JITs it to native. - The Verifier guarantees safety. Before loading, symbolic execution proves: no infinite loops, pointer bounds checked, no uninitialized reads.
- Maps are shared structures readable from kernel and user space. Hash, Array, LRU, LPM Trie, Ring Buffer, Per-CPU — dozens of variants.
- Program types decide the hook:
kprobe,tracepoint,XDP,TC,LSM,cgroup_skb, and more. - CO-RE + BTF solves the struct-layout drift problem. Compile once, run on any kernel.
- XDP runs right after the NIC driver, pushing tens of millions of packets per second per core. DPDK-class performance without leaving the kernel.
- In production: Cilium (K8s networking/security), Pixie (observability), Falco (runtime security), bpftrace (ad-hoc tracing), Katran (L4 LB).
1. Why eBPF — A Short History
1.1 The Old Dilemma
Extending the Linux kernel gave you three options:
- Upstream patch: years to merge; small features rejected.
- Loadable kernel module: a crash panics the box; rebuild per kernel version.
- User-space path: syscalls/ioctls; context switches dominate.
For an L4 load balancer: HAProxy pays per-packet user-kernel switches; IPVS is fast but risky; DPDK hits tens of millions of pps but burns a full core and bypasses the kernel stack. eBPF opened a fourth path: run your sandboxed code inside the kernel, with safety proven by the verifier.
1.2 Classic BPF (1992)
Van Jacobson and Steven McCanne built the Berkeley Packet Filter for tcpdump: filter packets in-kernel, 32-bit ISA, two registers. 10-20x faster captures. This is cBPF.
1.3 Alexei Starovoitov's 2014 Rewrite
Alexei Starovoitov introduced eBPF: 11 registers, 64-bit, maps, helper functions, JIT, a much stronger verifier, and hooks beyond networking. "Kernel-module power, user-space safety."
1.4 2025 Ecosystem
Meta's Katran, Netflix FlowLogs, GKE networking, Cloudflare DDoS, Cilium (de facto K8s CNI), RHEL 9 shipping bpftrace/bcc/libbpf. eBPF Foundation sits under the Linux Foundation.
2. The eBPF Virtual Machine
2.1 Instruction Format
Fixed 64-bit (8-byte) instructions:
struct bpf_insn {
__u8 code; // 8 bits: opcode
__u8 dst_reg:4; // 4 bits: destination register
__u8 src_reg:4; // 4 bits: source register
__s16 off; // 16 bits: signed offset
__s32 imm; // 32 bits: signed immediate
};
2.2 Registers
R0 : return value
R1-R5: function arguments (C ABI)
R6-R9: callee-saved
R10 : frame pointer (read-only, stack access)
R10 is read-only and the only path to stack memory.
2.3 Example
int add(int a, int b) { return a + b; }
Compiled with clang -target bpf -O2:
; 0000000000000000 <add>:
; 0: bf 10 00 00 00 00 00 00 r0 = r1 ; R0 = a
; 1: 0f 20 00 00 00 00 00 00 r0 += r2 ; R0 += b
; 2: 95 00 00 00 00 00 00 00 exit
2.4 JIT
The kernel JITs bytecode on load (x86_64, ARM64, RISC-V, MIPS, PowerPC, s390x, SPARC). Enabled by default on modern kernels:
sysctl net.core.bpf_jit_enable # 1 = enabled
Performance approaches native C.
2.5 Instruction Limits and Bounded Loops
- Pre-5.2: 4,096 insns per function.
- 5.2+: 1M insns, bounded by a verifier complexity budget.
- 5.3+: bounded loops allowed if the verifier can prove an upper bound.
for (int i = 0; i < 64; i++) { /* OK */ }
#pragma unroll
for (int i = 0; i < 4; i++) { /* always OK */ }
for (int i = 0; i < x; i++) { /* NG: x unknown */ }
3. The Verifier — eBPF's Heart
3.1 What It Proves
- Termination: every path reaches
exit. - Memory safety: pointer access stays in bounds.
- Initialization: no uninitialized reads.
- Type safety: scalars never dereferenced as pointers.
- Complexity bound: verification itself terminates in reasonable time.
On failure, bpf() returns -EINVAL with a detailed log.
3.2 Symbolic Execution
The verifier tracks value ranges per register and forks at every branch:
int x = get_pid(); // x: [0, 4194304]
if (x > 100) {
// in this branch: x in [101, 4194304]
use(x);
}
It tracks s32_min/max, u32_min/max, s64_min/max, and pointer kind plus offset.
3.3 Pointer Type System
PTR_TO_CTX : program context
PTR_TO_PACKET : packet data
PTR_TO_PACKET_END : packet end
PTR_TO_MAP_KEY : map key
PTR_TO_MAP_VALUE : map value
PTR_TO_STACK : stack memory
PTR_TO_SOCKET : socket
SCALAR_VALUE : non-pointer
PTR_TO_MAP_VALUE requires a NULL check before deref:
void *val = bpf_map_lookup_elem(&my_map, &key);
if (!val) return 0;
*(int *)val = 42;
3.4 Packet Bounds Checks
SEC("xdp")
int drop_port_80(struct xdp_md *ctx) {
void *data = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;
struct ethhdr *eth = data;
if (data + sizeof(*eth) > data_end) // required
return XDP_PASS;
if (eth->h_proto == bpf_htons(ETH_P_IP)) { /* ... */ }
return XDP_PASS;
}
3.5 Reading Verifier Errors
Logs are long but regular:
17: invalid access to packet, off=14 size=2, R1(id=0,off=14,r=0)
R1 offset is outside of the packet
Fix: add a data + N > data_end check before the access.
3.6 Complexity Budget
If too many paths are explored, you get "BPF program is too complex." Limit: 1,000,000 states on 5.2+. Mitigations: prefer bounded loops over unroll, __always_inline helpers, rely on state pruning (4.19+).
4. Maps
4.1 Why
- 512-byte stack limit.
- No direct sharing between programs.
- Need user-space I/O.
Maps are typed KV stores managed by the kernel and accessible from both sides.
4.2 Common Types
| Type | Use |
|---|---|
BPF_MAP_TYPE_HASH | per-PID stats |
BPF_MAP_TYPE_ARRAY | counters, config |
BPF_MAP_TYPE_PERCPU_HASH | lock-free counters |
BPF_MAP_TYPE_LRU_HASH | connection tracking |
BPF_MAP_TYPE_LPM_TRIE | routing tables |
BPF_MAP_TYPE_STACK_TRACE | profiling |
BPF_MAP_TYPE_RINGBUF | events to user (5.8+) |
BPF_MAP_TYPE_PROG_ARRAY | tail calls |
BPF_MAP_TYPE_DEVMAP | XDP redirect |
BPF_MAP_TYPE_SK_STORAGE | per-socket state |
BPF_MAP_TYPE_TASK_STORAGE | per-task state |
4.3 Declaration (libbpf)
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__type(key, u32);
__type(value, u64);
__uint(max_entries, 10240);
} pid_counts SEC(".maps");
4.4 Access from the Program
u32 pid = bpf_get_current_pid_tgid() >> 32;
u64 *count = bpf_map_lookup_elem(&pid_counts, &pid);
if (!count) {
u64 one = 1;
bpf_map_update_elem(&pid_counts, &pid, &one, BPF_ANY);
} else {
__sync_fetch_and_add(count, 1);
}
4.5 Per-CPU Maps
Per-CPU maps avoid cache-line contention by giving each CPU its own slot. Sum across CPUs in user space:
int ncpus = libbpf_num_possible_cpus();
u64 vals[ncpus];
bpf_map_lookup_elem(fd, &key, vals);
u64 total = 0;
for (int i = 0; i < ncpus; i++) total += vals[i];
4.6 Ring Buffer
Linux 5.8+, MPSC, back-pressure aware, event-loss counters:
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024);
} events SEC(".maps");
SEC("tp/sched/sched_process_exec")
int handle_exec(void *ctx) {
struct event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
if (!e) return 0;
e->pid = bpf_get_current_pid_tgid() >> 32;
bpf_get_current_comm(&e->comm, sizeof(e->comm));
bpf_ringbuf_submit(e, 0);
return 0;
}
5. Program Types and Hooks
5.1 Kprobe / Kretprobe
Hook kernel function entry/exit. Works on almost any kernel function, but names/signatures drift across versions.
SEC("kprobe/do_sys_openat2")
int kprobe_openat(struct pt_regs *ctx) {
const char *filename = (const char *)PT_REGS_PARM2(ctx);
char buf[256];
bpf_probe_read_user_str(buf, sizeof(buf), filename);
return 0;
}
5.2 Tracepoint
Stable ABI events explicitly defined by kernel developers. Listed under /sys/kernel/debug/tracing/events.
5.3 Fentry / Fexit (BPF Trampoline, 5.5+)
Faster than kprobe. Replaces a 5-byte NOP at function entry with a call.
SEC("fentry/tcp_v4_connect")
int BPF_PROG(connect_entry, struct sock *sk) { return 0; }
SEC("fexit/tcp_v4_connect")
int BPF_PROG(connect_exit, struct sock *sk, int ret) { return 0; }
5.4 XDP
Runs right after the NIC driver RX, before sk_buff allocation.
SEC("xdp")
int xdp_drop_tcp_80(struct xdp_md *ctx) {
void *data = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;
struct ethhdr *eth = data;
if (data + sizeof(*eth) > data_end) return XDP_PASS;
if (eth->h_proto != bpf_htons(ETH_P_IP)) return XDP_PASS;
struct iphdr *ip = data + sizeof(*eth);
if ((void *)(ip + 1) > data_end) return XDP_PASS;
if (ip->protocol != IPPROTO_TCP) return XDP_PASS;
struct tcphdr *tcp = (void *)ip + ip->ihl * 4;
if ((void *)(tcp + 1) > data_end) return XDP_PASS;
if (tcp->dest == bpf_htons(80)) return XDP_DROP;
return XDP_PASS;
}
Return codes: XDP_PASS, XDP_DROP, XDP_TX, XDP_REDIRECT, XDP_ABORTED.
5.5 TC BPF
Runs at the qdisc level (egress + ingress). Used by Cilium for identity-based filtering and L7 mark-and-route prep.
5.6 cgroup_skb / cgroup_sock
Per-cgroup network control, used by K8s namespace isolation.
5.7 LSM BPF (5.7+)
Security policies implemented dynamically via eBPF, no recompile needed:
SEC("lsm/file_open")
int BPF_PROG(block_secret_file, struct file *file) {
char name[256];
bpf_probe_read_kernel_str(name, sizeof(name),
file->f_path.dentry->d_iname);
if (strcmp(name, "secret") == 0) return -EACCES;
return 0;
}
5.8 Sockops
Hooks into socket lifecycle. Cilium uses this to translate connect(svc_ip) into a direct backend connection, skipping iptables entirely.
6. CO-RE — Compile Once, Run Everywhere
6.1 The Problem
Struct field offsets change across kernel versions. A prebuilt .o reads the wrong bytes on another kernel. BCC's fix was runtime LLVM compilation — hundreds of MB and seconds per load.
6.2 BTF + CO-RE
BTF stores the kernel's type info at /sys/kernel/btf/vmlinux. CO-RE marks field accesses as relocatable so libbpf rewrites offsets at load time against the target kernel.
#include <vmlinux.h>
#include <bpf/bpf_core_read.h>
SEC("kprobe/tcp_sendmsg")
int trace_tcp_sendmsg(struct pt_regs *ctx) {
struct sock *sk = (struct sock *)PT_REGS_PARM1(ctx);
u16 sport;
BPF_CORE_READ_INTO(&sport, sk, __sk_common.skc_num);
return 0;
}
6.3 Generating vmlinux.h
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
No kernel headers needed.
6.4 Feature Detection
if (bpf_core_field_exists(task->comm_size)) {
/* new kernel */
} else {
/* old kernel */
}
6.5 What CO-RE Changed
Containers ship a single multi-MB binary (libbpf + CO-RE .bpf.o) instead of hundreds of MB of LLVM + kernel headers. Cilium, Falco, Pixie, Inspektor Gadget all rely on it.
7. XDP — Ultra-Fast Packet Processing
7.1 Position
NIC -> XDP (right after driver RX) -> sk_buff alloc -> TC ingress -> routing -> TC egress -> driver TX -> NIC
XDP runs before sk_buff allocation, saving hundreds of nanoseconds per packet.
7.2 Modes
- Native XDP: driver-integrated, fastest.
- Generic XDP: kernel emulation, slower.
- Offloaded XDP: runs on SmartNIC hardware.
7.3 Numbers
| Method | pps (single core) | Latency |
|---|---|---|
| Kernel stack (iptables) | ~1 Mpps | ~10 us |
| XDP Native | ~24 Mpps | ~1 us |
| DPDK | ~30 Mpps | ~0.5 us |
7.4 DDoS Filtering (Cloudflare-style)
LPM-trie blacklists plus per-source SYN-rate maps, dropped at XDP_DROP. Cloudflare has absorbed 5-10 Tbps floods at the NIC edge this way.
7.5 Katran (Meta L4 LB)
- DSR (Direct Server Return).
- Maglev consistent hashing.
- Millions of pps per host.
8. Cilium
8.1 K8s Networking Pain
iptables mode scales O(n) with service count. Even IPVS still pays for sk_buff and netfilter hooks.
8.2 Cilium's Bypass
Cilium attaches eBPF to veth-pair TC hooks and handles service resolution, policy, LB, and encryption directly, skipping iptables.
8.3 Identity-Based Policy
Labels map to 16-bit identities. Policy is an O(1) map lookup on (src_identity, dst_identity, port) regardless of how many policies exist.
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: allow-frontend-to-backend
spec:
endpointSelector:
matchLabels:
app: backend
ingress:
- fromEndpoints:
- matchLabels:
app: frontend
toPorts:
- ports:
- port: "8080"
protocol: TCP
8.4 Replacing kube-proxy
Sockops intercepts connect(svc_ip), performs DNAT via O(1) map lookup, and connects straight to the backend. For same-node pods, sockmap redirect bypasses the network stack entirely.
9. Observability: bpftrace, bcc, Pixie
9.1 bpftrace
DTrace-like one-liners:
bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }'
bpftrace -e '
kprobe:tcp_v4_connect { @start[tid] = nsecs; }
kretprobe:tcp_v4_connect /@start[tid]/ {
@latency = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
}'
9.2 bcc
Python harness that embeds BPF C. Needs runtime LLVM; being phased out in favor of libbpf + CO-RE.
9.3 libbpf + skeleton
Production-grade. Single binary, CO-RE-portable:
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024);
} events SEC(".maps");
SEC("tp/syscalls/sys_enter_openat")
int handle_openat(void *ctx) {
u32 *e = bpf_ringbuf_reserve(&events, sizeof(u32), 0);
if (!e) return 0;
*e = bpf_get_current_pid_tgid() >> 32;
bpf_ringbuf_submit(e, 0);
return 0;
}
char LICENSE[] SEC("license") = "GPL";
9.4 Pixie
DaemonSet uses uprobes on libssl/libc/libgrpc to decode HTTP/gRPC/SQL in-cluster, without code changes.
10. Falco — Runtime Security
Falco watches every syscall via eBPF and fires on rule matches:
- rule: Read sensitive file untrusted
desc: Detect reading sensitive files
condition: >
sensitive_files and open_read
and proc_name_exists
and not proc.name in (trusted_procs)
output: >
Sensitive file opened (user=%user.name
command=%proc.cmdline file=%fd.name)
priority: WARNING
Moving from kernel module to eBPF made it deployable on any K8s node with sub-1 percent overhead.
11. Limits and Caveats
11.1 Verifier Walls
Complex logic must be split via tail calls or bpf_loop (5.17+).
11.2 GPL Constraint
Most useful helpers are GPL-only:
char LICENSE[] SEC("license") = "GPL";
Without it, the verifier refuses the load.
11.3 Overhead
XDP microbenchmarks:
- Empty: ~24 Mpps
- 1 map lookup: ~18 Mpps
- 3 lookups + logic: ~10 Mpps
11.4 Debugging Is Hard
No GDB, bpf_printk goes to trace_pipe, no stack traces. Use bpftool prog dump jited, read verifier logs carefully, and drive tests via BPF_PROG_TEST_RUN.
11.5 Observer Effect
Uprobe/kprobe adds 5-30 ns per call. Pre-filter in BPF, ship only interesting events to user via ring buffer.
12. Security
12.1 Is eBPF Safe?
The verifier enforces memory safety, not intent. Side channels (Spectre-like), verifier DoS, and kernel-memory leaks by privileged users are real concerns — mitigated by kernel lockdown and CAP_BPF.
12.2 Capabilities
CAP_SYS_ADMIN: legacy broad access.CAP_BPF(5.8+): scoped BPF.CAP_PERFMON: tracing.CAP_NET_ADMIN: network BPF.
12.3 Unprivileged BPF
Disabled by default after Spectre (kernel.unprivileged_bpf_disabled=1). Effectively dead in 2025.
13. Benchmarking and Tuning
13.1 XDP
ip link set dev eth0 xdp obj drop.o sec xdp
ip -s link show eth0
bpftool prog show + bpftool prog profile, plus perf stat -e cycles,instructions.
13.2 Verifier Profiling
bpftool prog load prog.o /sys/fs/bpf/prog \
log_level 2 log_size 4194304 2>&1 | tee verifier.log
13.3 XDP Tuning
- Batch bounds checks across contiguous fields.
- Avoid
bpf_xdp_adjust_head. - Grow PCIe NIC RX ring.
- Pin IRQs via RPS/RFS.
- Use huge pages to cut TLB misses.
14. Tail Calls and BPF-to-BPF Calls
14.1 Tail Calls
Jump between programs via a BPF_MAP_TYPE_PROG_ARRAY:
struct {
__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
__uint(max_entries, 4);
__type(key, u32);
__type(value, u32);
} prog_array SEC(".maps");
SEC("xdp")
int entry(struct xdp_md *ctx) {
bpf_tail_call(ctx, &prog_array, NEXT_PROG);
return XDP_PASS;
}
Stack is lost across tail calls; max chain depth 33.
14.2 BPF-to-BPF Calls (4.16+)
Normal function calls. No recursion allowed.
15. Future and Trends
15.1 eBPF for Windows
Microsoft's port is in beta; the same ELF targets both kernels.
15.2 sched_ext (Linux 6.12)
Write the process scheduler itself in eBPF, replacing CFS/EEVDF with custom policies:
SEC("struct_ops/scx_example_enqueue")
void BPF_STRUCT_OPS(enqueue, struct task_struct *p, u64 enq_flags) {
scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
}
15.3 Rust via aya
use aya_bpf::{macros::xdp, programs::XdpContext};
#[xdp]
pub fn my_xdp(ctx: XdpContext) -> u32 {
match unsafe { try_my_xdp(ctx) } {
Ok(ret) => ret,
Err(_) => xdp_action::XDP_PASS,
}
}
15.4 Service Mesh
Cilium Service Mesh uses eBPF for L4 + identity + mTLS and per-node envoy for L7, eliminating per-pod sidecars — about 70 percent overhead reduction.
16. Learning Path
- Read ebpf.io and the official "What is eBPF."
- Use bpftrace one-liners.
- Work through BCC's
tools/directory. - Move to libbpf-bootstrap and skeletons.
- Read Cilium
bpf/, Katran, Tetragon.
Books: "Learning eBPF" (Liz Rice), "Linux Observability with BPF" (Calavera & Fontana). Conferences: eBPF Summit, LPC BPF track, KubeCon.
17. Summary Cheat Sheet
eBPF Cheat Sheet
- VM: 11 regs, 64-bit RISC-like, JIT, 1M insn limit
- Verifier: symbolic execution, memory safety, pointer typing
- Maps: Hash, Array, LRU, LPM trie, Ring buffer, Per-CPU
- Program types: kprobe, tracepoint, fentry/fexit, XDP, TC, cgroup_skb, LSM, sockops
- CO-RE: BTF relocations, single binary, all kernels
- Production: Cilium, Katran, Pixie, Falco, bpftrace
- Tools: bpftool, libbpf, bcc, aya (Rust)
18. Quiz
Q1. What's the most common reason the verifier rejects a program?
A. Missing bounds checks before pointer access or dereferencing a map lookup without a NULL check. In XDP, look for invalid access to packet, off=N size=M or R1 pointer arithmetic on PTR_TO_MAP_VALUE_OR_NULL.
Q2. What does CO-RE solve?
A. Struct-offset drift between kernels. BTF stores each kernel's type info; libbpf rewrites field offsets at load time so one binary runs on many kernels.
Q3. Why is XDP much faster than iptables?
A. iptables runs after sk_buff allocation and walks netfilter chains. XDP runs right after the NIC driver, before the sk_buff exists, executes JIT-compiled native code, and skips rule-tree traversal.
Q4. Why are per-CPU maps faster than a global hash?
A. No cache-line contention. Each CPU writes its own slot instead of bouncing a shared line between cores. User space sums slots when reading.
Q5. How does Cilium replace kube-proxy without iptables?
A. A sockops program hooks connect(), looks up the backend in an O(1) BPF map, rewrites the destination, and connects straight to the backend. For same-node pods, sockmap redirect bypasses the network stack entirely.
Q6. Downside of tail calls?
A. Stack is lost — locals don't survive the call. Share state via maps. Max chain depth is 33.
Q7. How do you hook a user-space function?
A. uprobe or uretprobe. Pixie hooks libssl SSL_read/write to see HTTPS payloads post-TLS termination.
If you enjoyed this, check out:
- "Linux Network Stack Deep Dive"
- "Cilium Architecture Complete Guide"
- "io_uring Deep Dive"
- "DPDK vs XDP Compared"