eBPF Deep Dive — The VM, Verifier, XDP, and CO-RE That Made the Linux Kernel Programmable (2025)

TL;DR

eBPF is a VM that runs sandboxed bytecode inside the Linux kernel. You can observe and modify kernel behavior without building kernel modules.
11-register 64-bit RISC-like ISA. clang -target bpf compiles C to bytecode, then the kernel JITs it to native.
The Verifier guarantees safety. Before loading, symbolic execution proves: no infinite loops, pointer bounds checked, no uninitialized reads.
Maps are shared structures readable from kernel and user space. Hash, Array, LRU, LPM Trie, Ring Buffer, Per-CPU — dozens of variants.
Program types decide the hook: kprobe, tracepoint, XDP, TC, LSM, cgroup_skb, and more.
CO-RE + BTF solves the struct-layout drift problem. Compile once, run on any kernel.
XDP runs right after the NIC driver, pushing tens of millions of packets per second per core. DPDK-class performance without leaving the kernel.
In production: Cilium (K8s networking/security), Pixie (observability), Falco (runtime security), bpftrace (ad-hoc tracing), Katran (L4 LB).

1. Why eBPF — A Short History

1.1 The Old Dilemma

Extending the Linux kernel gave you three options:

Upstream patch: years to merge; small features rejected.
Loadable kernel module: a crash panics the box; rebuild per kernel version.
User-space path: syscalls/ioctls; context switches dominate.

For an L4 load balancer: HAProxy pays per-packet user-kernel switches; IPVS is fast but risky; DPDK hits tens of millions of pps but burns a full core and bypasses the kernel stack. eBPF opened a fourth path: run your sandboxed code inside the kernel, with safety proven by the verifier.

1.2 Classic BPF (1992)

Van Jacobson and Steven McCanne built the Berkeley Packet Filter for tcpdump: filter packets in-kernel, 32-bit ISA, two registers. 10-20x faster captures. This is cBPF.

1.3 Alexei Starovoitov's 2014 Rewrite

Alexei Starovoitov introduced eBPF: 11 registers, 64-bit, maps, helper functions, JIT, a much stronger verifier, and hooks beyond networking. "Kernel-module power, user-space safety."

1.4 2025 Ecosystem

Meta's Katran, Netflix FlowLogs, GKE networking, Cloudflare DDoS, Cilium (de facto K8s CNI), RHEL 9 shipping bpftrace/bcc/libbpf. eBPF Foundation sits under the Linux Foundation.

2. The eBPF Virtual Machine

2.1 Instruction Format

Fixed 64-bit (8-byte) instructions:

struct bpf_insn {
    __u8    code;       // 8 bits: opcode
    __u8    dst_reg:4;  // 4 bits: destination register
    __u8    src_reg:4;  // 4 bits: source register
    __s16   off;        // 16 bits: signed offset
    __s32   imm;        // 32 bits: signed immediate
};

2.2 Registers

R0  : return value
R1-R5: function arguments (C ABI)
R6-R9: callee-saved
R10 : frame pointer (read-only, stack access)

R10 is read-only and the only path to stack memory.

2.3 Example

int add(int a, int b) { return a + b; }

Compiled with clang -target bpf -O2:

; 0000000000000000 <add>:
;    0:  bf 10 00 00 00 00 00 00   r0 = r1       ; R0 = a
;    1:  0f 20 00 00 00 00 00 00   r0 += r2      ; R0 += b
;    2:  95 00 00 00 00 00 00 00   exit

2.4 JIT

The kernel JITs bytecode on load (x86_64, ARM64, RISC-V, MIPS, PowerPC, s390x, SPARC). Enabled by default on modern kernels:

sysctl net.core.bpf_jit_enable  # 1 = enabled

Performance approaches native C.

2.5 Instruction Limits and Bounded Loops

Pre-5.2: 4,096 insns per function.
5.2+: 1M insns, bounded by a verifier complexity budget.
5.3+: bounded loops allowed if the verifier can prove an upper bound.

for (int i = 0; i < 64; i++) { /* OK */ }

#pragma unroll
for (int i = 0; i < 4; i++) { /* always OK */ }

for (int i = 0; i < x; i++) { /* NG: x unknown */ }

3. The Verifier — eBPF's Heart

3.1 What It Proves

Termination: every path reaches exit.
Memory safety: pointer access stays in bounds.
Initialization: no uninitialized reads.
Type safety: scalars never dereferenced as pointers.
Complexity bound: verification itself terminates in reasonable time.

On failure, bpf() returns -EINVAL with a detailed log.

3.2 Symbolic Execution

The verifier tracks value ranges per register and forks at every branch:

int x = get_pid();   // x: [0, 4194304]
if (x > 100) {
    // in this branch: x in [101, 4194304]
    use(x);
}

It tracks s32_min/max, u32_min/max, s64_min/max, and pointer kind plus offset.

3.3 Pointer Type System

PTR_TO_CTX        : program context
PTR_TO_PACKET     : packet data
PTR_TO_PACKET_END : packet end
PTR_TO_MAP_KEY    : map key
PTR_TO_MAP_VALUE  : map value
PTR_TO_STACK      : stack memory
PTR_TO_SOCKET     : socket
SCALAR_VALUE      : non-pointer

PTR_TO_MAP_VALUE requires a NULL check before deref:

void *val = bpf_map_lookup_elem(&my_map, &key);
if (!val) return 0;
*(int *)val = 42;

3.4 Packet Bounds Checks

SEC("xdp")
int drop_port_80(struct xdp_md *ctx) {
    void *data = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;
    struct ethhdr *eth = data;

    if (data + sizeof(*eth) > data_end)   // required
        return XDP_PASS;
    if (eth->h_proto == bpf_htons(ETH_P_IP)) { /* ... */ }
    return XDP_PASS;
}

3.5 Reading Verifier Errors

Logs are long but regular:

17: invalid access to packet, off=14 size=2, R1(id=0,off=14,r=0)
R1 offset is outside of the packet

Fix: add a data + N > data_end check before the access.

3.6 Complexity Budget

If too many paths are explored, you get "BPF program is too complex." Limit: 1,000,000 states on 5.2+. Mitigations: prefer bounded loops over unroll, __always_inline helpers, rely on state pruning (4.19+).

4. Maps

4.1 Why

512-byte stack limit.
No direct sharing between programs.
Need user-space I/O.

Maps are typed KV stores managed by the kernel and accessible from both sides.

4.2 Common Types

Type	Use
`BPF_MAP_TYPE_HASH`	per-PID stats
`BPF_MAP_TYPE_ARRAY`	counters, config
`BPF_MAP_TYPE_PERCPU_HASH`	lock-free counters
`BPF_MAP_TYPE_LRU_HASH`	connection tracking
`BPF_MAP_TYPE_LPM_TRIE`	routing tables
`BPF_MAP_TYPE_STACK_TRACE`	profiling
`BPF_MAP_TYPE_RINGBUF`	events to user (5.8+)
`BPF_MAP_TYPE_PROG_ARRAY`	tail calls
`BPF_MAP_TYPE_DEVMAP`	XDP redirect
`BPF_MAP_TYPE_SK_STORAGE`	per-socket state
`BPF_MAP_TYPE_TASK_STORAGE`	per-task state

4.3 Declaration (libbpf)

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __type(key, u32);
    __type(value, u64);
    __uint(max_entries, 10240);
} pid_counts SEC(".maps");

4.4 Access from the Program

u32 pid = bpf_get_current_pid_tgid() >> 32;
u64 *count = bpf_map_lookup_elem(&pid_counts, &pid);
if (!count) {
    u64 one = 1;
    bpf_map_update_elem(&pid_counts, &pid, &one, BPF_ANY);
} else {
    __sync_fetch_and_add(count, 1);
}

4.5 Per-CPU Maps

Per-CPU maps avoid cache-line contention by giving each CPU its own slot. Sum across CPUs in user space:

int ncpus = libbpf_num_possible_cpus();
u64 vals[ncpus];
bpf_map_lookup_elem(fd, &key, vals);
u64 total = 0;
for (int i = 0; i < ncpus; i++) total += vals[i];

4.6 Ring Buffer

Linux 5.8+, MPSC, back-pressure aware, event-loss counters:

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);
} events SEC(".maps");

SEC("tp/sched/sched_process_exec")
int handle_exec(void *ctx) {
    struct event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e) return 0;
    e->pid = bpf_get_current_pid_tgid() >> 32;
    bpf_get_current_comm(&e->comm, sizeof(e->comm));
    bpf_ringbuf_submit(e, 0);
    return 0;
}

5. Program Types and Hooks

5.1 Kprobe / Kretprobe

Hook kernel function entry/exit. Works on almost any kernel function, but names/signatures drift across versions.

SEC("kprobe/do_sys_openat2")
int kprobe_openat(struct pt_regs *ctx) {
    const char *filename = (const char *)PT_REGS_PARM2(ctx);
    char buf[256];
    bpf_probe_read_user_str(buf, sizeof(buf), filename);
    return 0;
}

5.2 Tracepoint

Stable ABI events explicitly defined by kernel developers. Listed under /sys/kernel/debug/tracing/events.

5.3 Fentry / Fexit (BPF Trampoline, 5.5+)

Faster than kprobe. Replaces a 5-byte NOP at function entry with a call.

SEC("fentry/tcp_v4_connect")
int BPF_PROG(connect_entry, struct sock *sk) { return 0; }

SEC("fexit/tcp_v4_connect")
int BPF_PROG(connect_exit, struct sock *sk, int ret) { return 0; }

5.4 XDP

Runs right after the NIC driver RX, before sk_buff allocation.

SEC("xdp")
int xdp_drop_tcp_80(struct xdp_md *ctx) {
    void *data = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;
    struct ethhdr *eth = data;
    if (data + sizeof(*eth) > data_end) return XDP_PASS;
    if (eth->h_proto != bpf_htons(ETH_P_IP)) return XDP_PASS;

    struct iphdr *ip = data + sizeof(*eth);
    if ((void *)(ip + 1) > data_end) return XDP_PASS;
    if (ip->protocol != IPPROTO_TCP) return XDP_PASS;

    struct tcphdr *tcp = (void *)ip + ip->ihl * 4;
    if ((void *)(tcp + 1) > data_end) return XDP_PASS;

    if (tcp->dest == bpf_htons(80)) return XDP_DROP;
    return XDP_PASS;
}

Return codes: XDP_PASS, XDP_DROP, XDP_TX, XDP_REDIRECT, XDP_ABORTED.

5.5 TC BPF

Runs at the qdisc level (egress + ingress). Used by Cilium for identity-based filtering and L7 mark-and-route prep.

5.6 cgroup_skb / cgroup_sock

Per-cgroup network control, used by K8s namespace isolation.

5.7 LSM BPF (5.7+)

Security policies implemented dynamically via eBPF, no recompile needed:

SEC("lsm/file_open")
int BPF_PROG(block_secret_file, struct file *file) {
    char name[256];
    bpf_probe_read_kernel_str(name, sizeof(name),
        file->f_path.dentry->d_iname);
    if (strcmp(name, "secret") == 0) return -EACCES;
    return 0;
}

5.8 Sockops

Hooks into socket lifecycle. Cilium uses this to translate connect(svc_ip) into a direct backend connection, skipping iptables entirely.

6. CO-RE — Compile Once, Run Everywhere

6.1 The Problem

Struct field offsets change across kernel versions. A prebuilt .o reads the wrong bytes on another kernel. BCC's fix was runtime LLVM compilation — hundreds of MB and seconds per load.

6.2 BTF + CO-RE

BTF stores the kernel's type info at /sys/kernel/btf/vmlinux. CO-RE marks field accesses as relocatable so libbpf rewrites offsets at load time against the target kernel.

#include <vmlinux.h>
#include <bpf/bpf_core_read.h>

SEC("kprobe/tcp_sendmsg")
int trace_tcp_sendmsg(struct pt_regs *ctx) {
    struct sock *sk = (struct sock *)PT_REGS_PARM1(ctx);
    u16 sport;
    BPF_CORE_READ_INTO(&sport, sk, __sk_common.skc_num);
    return 0;
}

6.3 Generating vmlinux.h

bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h

No kernel headers needed.

6.4 Feature Detection

if (bpf_core_field_exists(task->comm_size)) {
    /* new kernel */
} else {
    /* old kernel */
}

6.5 What CO-RE Changed

Containers ship a single multi-MB binary (libbpf + CO-RE .bpf.o) instead of hundreds of MB of LLVM + kernel headers. Cilium, Falco, Pixie, Inspektor Gadget all rely on it.

7. XDP — Ultra-Fast Packet Processing

7.1 Position

NIC -> XDP (right after driver RX) -> sk_buff alloc -> TC ingress -> routing -> TC egress -> driver TX -> NIC

XDP runs before sk_buff allocation, saving hundreds of nanoseconds per packet.

7.2 Modes

Native XDP: driver-integrated, fastest.
Generic XDP: kernel emulation, slower.
Offloaded XDP: runs on SmartNIC hardware.

7.3 Numbers

Method	pps (single core)	Latency
Kernel stack (iptables)	~1 Mpps	~10 us
XDP Native	~24 Mpps	~1 us
DPDK	~30 Mpps	~0.5 us

7.4 DDoS Filtering (Cloudflare-style)

LPM-trie blacklists plus per-source SYN-rate maps, dropped at XDP_DROP. Cloudflare has absorbed 5-10 Tbps floods at the NIC edge this way.

7.5 Katran (Meta L4 LB)

DSR (Direct Server Return).
Maglev consistent hashing.
Millions of pps per host.

8. Cilium

8.1 K8s Networking Pain

iptables mode scales O(n) with service count. Even IPVS still pays for sk_buff and netfilter hooks.

8.2 Cilium's Bypass

Cilium attaches eBPF to veth-pair TC hooks and handles service resolution, policy, LB, and encryption directly, skipping iptables.

8.3 Identity-Based Policy

Labels map to 16-bit identities. Policy is an O(1) map lookup on (src_identity, dst_identity, port) regardless of how many policies exist.

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: allow-frontend-to-backend
spec:
  endpointSelector:
    matchLabels:
      app: backend
  ingress:
  - fromEndpoints:
    - matchLabels:
        app: frontend
    toPorts:
    - ports:
      - port: "8080"
        protocol: TCP

8.4 Replacing kube-proxy

Sockops intercepts connect(svc_ip), performs DNAT via O(1) map lookup, and connects straight to the backend. For same-node pods, sockmap redirect bypasses the network stack entirely.

9. Observability: bpftrace, bcc, Pixie

9.1 bpftrace

DTrace-like one-liners:

bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }'

bpftrace -e '
kprobe:tcp_v4_connect { @start[tid] = nsecs; }
kretprobe:tcp_v4_connect /@start[tid]/ {
    @latency = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]);
}'

9.2 bcc

Python harness that embeds BPF C. Needs runtime LLVM; being phased out in favor of libbpf + CO-RE.

9.3 libbpf + skeleton

Production-grade. Single binary, CO-RE-portable:

#include <vmlinux.h>
#include <bpf/bpf_helpers.h>

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);
} events SEC(".maps");

SEC("tp/syscalls/sys_enter_openat")
int handle_openat(void *ctx) {
    u32 *e = bpf_ringbuf_reserve(&events, sizeof(u32), 0);
    if (!e) return 0;
    *e = bpf_get_current_pid_tgid() >> 32;
    bpf_ringbuf_submit(e, 0);
    return 0;
}

char LICENSE[] SEC("license") = "GPL";

9.4 Pixie

DaemonSet uses uprobes on libssl/libc/libgrpc to decode HTTP/gRPC/SQL in-cluster, without code changes.

10. Falco — Runtime Security

Falco watches every syscall via eBPF and fires on rule matches:

- rule: Read sensitive file untrusted
  desc: Detect reading sensitive files
  condition: >
    sensitive_files and open_read
    and proc_name_exists
    and not proc.name in (trusted_procs)
  output: >
    Sensitive file opened (user=%user.name
    command=%proc.cmdline file=%fd.name)
  priority: WARNING

Moving from kernel module to eBPF made it deployable on any K8s node with sub-1 percent overhead.

11. Limits and Caveats

11.1 Verifier Walls

Complex logic must be split via tail calls or bpf_loop (5.17+).

11.2 GPL Constraint

Most useful helpers are GPL-only:

char LICENSE[] SEC("license") = "GPL";

Without it, the verifier refuses the load.

11.3 Overhead

XDP microbenchmarks:

Empty: ~24 Mpps
1 map lookup: ~18 Mpps
3 lookups + logic: ~10 Mpps

11.4 Debugging Is Hard

No GDB, bpf_printk goes to trace_pipe, no stack traces. Use bpftool prog dump jited, read verifier logs carefully, and drive tests via BPF_PROG_TEST_RUN.

11.5 Observer Effect

Uprobe/kprobe adds 5-30 ns per call. Pre-filter in BPF, ship only interesting events to user via ring buffer.

12. Security

12.1 Is eBPF Safe?

The verifier enforces memory safety, not intent. Side channels (Spectre-like), verifier DoS, and kernel-memory leaks by privileged users are real concerns — mitigated by kernel lockdown and CAP_BPF.

12.2 Capabilities

CAP_SYS_ADMIN: legacy broad access.
CAP_BPF (5.8+): scoped BPF.
CAP_PERFMON: tracing.
CAP_NET_ADMIN: network BPF.

12.3 Unprivileged BPF

Disabled by default after Spectre (kernel.unprivileged_bpf_disabled=1). Effectively dead in 2025.

13. Benchmarking and Tuning

13.1 XDP

ip link set dev eth0 xdp obj drop.o sec xdp
ip -s link show eth0

bpftool prog show + bpftool prog profile, plus perf stat -e cycles,instructions.

13.2 Verifier Profiling

bpftool prog load prog.o /sys/fs/bpf/prog \
    log_level 2 log_size 4194304 2>&1 | tee verifier.log

13.3 XDP Tuning

Batch bounds checks across contiguous fields.
Avoid bpf_xdp_adjust_head.
Grow PCIe NIC RX ring.
Pin IRQs via RPS/RFS.
Use huge pages to cut TLB misses.

14. Tail Calls and BPF-to-BPF Calls

14.1 Tail Calls

Jump between programs via a BPF_MAP_TYPE_PROG_ARRAY:

struct {
    __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
    __uint(max_entries, 4);
    __type(key, u32);
    __type(value, u32);
} prog_array SEC(".maps");

SEC("xdp")
int entry(struct xdp_md *ctx) {
    bpf_tail_call(ctx, &prog_array, NEXT_PROG);
    return XDP_PASS;
}

Stack is lost across tail calls; max chain depth 33.

14.2 BPF-to-BPF Calls (4.16+)

Normal function calls. No recursion allowed.

15. Future and Trends

15.1 eBPF for Windows

Microsoft's port is in beta; the same ELF targets both kernels.

15.2 sched_ext (Linux 6.12)

Write the process scheduler itself in eBPF, replacing CFS/EEVDF with custom policies:

SEC("struct_ops/scx_example_enqueue")
void BPF_STRUCT_OPS(enqueue, struct task_struct *p, u64 enq_flags) {
    scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
}

15.3 Rust via aya

use aya_bpf::{macros::xdp, programs::XdpContext};

#[xdp]
pub fn my_xdp(ctx: XdpContext) -> u32 {
    match unsafe { try_my_xdp(ctx) } {
        Ok(ret) => ret,
        Err(_) => xdp_action::XDP_PASS,
    }
}

15.4 Service Mesh

Cilium Service Mesh uses eBPF for L4 + identity + mTLS and per-node envoy for L7, eliminating per-pod sidecars — about 70 percent overhead reduction.

16. Learning Path

Read ebpf.io and the official "What is eBPF."
Use bpftrace one-liners.
Work through BCC's tools/ directory.
Move to libbpf-bootstrap and skeletons.
Read Cilium bpf/, Katran, Tetragon.

Books: "Learning eBPF" (Liz Rice), "Linux Observability with BPF" (Calavera & Fontana). Conferences: eBPF Summit, LPC BPF track, KubeCon.

17. Summary Cheat Sheet

eBPF Cheat Sheet
- VM: 11 regs, 64-bit RISC-like, JIT, 1M insn limit
- Verifier: symbolic execution, memory safety, pointer typing
- Maps: Hash, Array, LRU, LPM trie, Ring buffer, Per-CPU
- Program types: kprobe, tracepoint, fentry/fexit, XDP, TC, cgroup_skb, LSM, sockops
- CO-RE: BTF relocations, single binary, all kernels
- Production: Cilium, Katran, Pixie, Falco, bpftrace
- Tools: bpftool, libbpf, bcc, aya (Rust)

18. Quiz

Q1. What's the most common reason the verifier rejects a program?

A. Missing bounds checks before pointer access or dereferencing a map lookup without a NULL check. In XDP, look for invalid access to packet, off=N size=M or R1 pointer arithmetic on PTR_TO_MAP_VALUE_OR_NULL.

Q2. What does CO-RE solve?

A. Struct-offset drift between kernels. BTF stores each kernel's type info; libbpf rewrites field offsets at load time so one binary runs on many kernels.

Q3. Why is XDP much faster than iptables?

A. iptables runs after sk_buff allocation and walks netfilter chains. XDP runs right after the NIC driver, before the sk_buff exists, executes JIT-compiled native code, and skips rule-tree traversal.

Q4. Why are per-CPU maps faster than a global hash?

A. No cache-line contention. Each CPU writes its own slot instead of bouncing a shared line between cores. User space sums slots when reading.

Q5. How does Cilium replace kube-proxy without iptables?

A. A sockops program hooks connect(), looks up the backend in an O(1) BPF map, rewrites the destination, and connects straight to the backend. For same-node pods, sockmap redirect bypasses the network stack entirely.

Q6. Downside of tail calls?

A. Stack is lost — locals don't survive the call. Share state via maps. Max chain depth is 33.

Q7. How do you hook a user-space function?

A. uprobe or uretprobe. Pixie hooks libssl SSL_read/write to see HTTPS payloads post-TLS termination.

If you enjoyed this, check out:

"Linux Network Stack Deep Dive"
"Cilium Architecture Complete Guide"
"io_uring Deep Dive"
"DPDK vs XDP Compared"