Skip to content
Published on

eBPF Deep Dive — The VM, Verifier, XDP, and CO-RE That Made the Linux Kernel Programmable (2025)

Authors

TL;DR

  • eBPF is a VM that runs sandboxed bytecode inside the Linux kernel. You can observe and modify kernel behavior without building kernel modules.
  • 11-register 64-bit RISC-like ISA. clang -target bpf compiles C to bytecode, then the kernel JITs it to native.
  • The Verifier guarantees safety. Before loading, symbolic execution proves: no infinite loops, pointer bounds checked, no uninitialized reads.
  • Maps are shared structures readable from kernel and user space. Hash, Array, LRU, LPM Trie, Ring Buffer, Per-CPU — dozens of variants.
  • Program types decide the hook: kprobe, tracepoint, XDP, TC, LSM, cgroup_skb, and more.
  • CO-RE + BTF solves the struct-layout drift problem. Compile once, run on any kernel.
  • XDP runs right after the NIC driver, pushing tens of millions of packets per second per core. DPDK-class performance without leaving the kernel.
  • In production: Cilium (K8s networking/security), Pixie (observability), Falco (runtime security), bpftrace (ad-hoc tracing), Katran (L4 LB).

1. Why eBPF — A Short History

1.1 The Old Dilemma

Extending the Linux kernel gave you three options:

  1. Upstream patch: years to merge; small features rejected.
  2. Loadable kernel module: a crash panics the box; rebuild per kernel version.
  3. User-space path: syscalls/ioctls; context switches dominate.

For an L4 load balancer: HAProxy pays per-packet user-kernel switches; IPVS is fast but risky; DPDK hits tens of millions of pps but burns a full core and bypasses the kernel stack. eBPF opened a fourth path: run your sandboxed code inside the kernel, with safety proven by the verifier.

1.2 Classic BPF (1992)

Van Jacobson and Steven McCanne built the Berkeley Packet Filter for tcpdump: filter packets in-kernel, 32-bit ISA, two registers. 10-20x faster captures. This is cBPF.

1.3 Alexei Starovoitov's 2014 Rewrite

Alexei Starovoitov introduced eBPF: 11 registers, 64-bit, maps, helper functions, JIT, a much stronger verifier, and hooks beyond networking. "Kernel-module power, user-space safety."

1.4 2025 Ecosystem

Meta's Katran, Netflix FlowLogs, GKE networking, Cloudflare DDoS, Cilium (de facto K8s CNI), RHEL 9 shipping bpftrace/bcc/libbpf. eBPF Foundation sits under the Linux Foundation.


2. The eBPF Virtual Machine

2.1 Instruction Format

Fixed 64-bit (8-byte) instructions:

struct bpf_insn {
    __u8    code;       // 8 bits: opcode
    __u8    dst_reg:4;  // 4 bits: destination register
    __u8    src_reg:4;  // 4 bits: source register
    __s16   off;        // 16 bits: signed offset
    __s32   imm;        // 32 bits: signed immediate
};

2.2 Registers

R0  : return value
R1-R5: function arguments (C ABI)
R6-R9: callee-saved
R10 : frame pointer (read-only, stack access)

R10 is read-only and the only path to stack memory.

2.3 Example

int add(int a, int b) { return a + b; }

Compiled with clang -target bpf -O2:

; 0000000000000000 <add>:
;    0:  bf 10 00 00 00 00 00 00   r0 = r1       ; R0 = a
;    1:  0f 20 00 00 00 00 00 00   r0 += r2      ; R0 += b
;    2:  95 00 00 00 00 00 00 00   exit

2.4 JIT

The kernel JITs bytecode on load (x86_64, ARM64, RISC-V, MIPS, PowerPC, s390x, SPARC). Enabled by default on modern kernels:

sysctl net.core.bpf_jit_enable  # 1 = enabled

Performance approaches native C.

2.5 Instruction Limits and Bounded Loops

  • Pre-5.2: 4,096 insns per function.
  • 5.2+: 1M insns, bounded by a verifier complexity budget.
  • 5.3+: bounded loops allowed if the verifier can prove an upper bound.
for (int i = 0; i < 64; i++) { /* OK */ }

#pragma unroll
for (int i = 0; i < 4; i++) { /* always OK */ }

for (int i = 0; i < x; i++) { /* NG: x unknown */ }

3. The Verifier — eBPF's Heart

3.1 What It Proves

  1. Termination: every path reaches exit.
  2. Memory safety: pointer access stays in bounds.
  3. Initialization: no uninitialized reads.
  4. Type safety: scalars never dereferenced as pointers.
  5. Complexity bound: verification itself terminates in reasonable time.

On failure, bpf() returns -EINVAL with a detailed log.

3.2 Symbolic Execution

The verifier tracks value ranges per register and forks at every branch:

int x = get_pid();   // x: [0, 4194304]
if (x > 100) {
    // in this branch: x in [101, 4194304]
    use(x);
}

It tracks s32_min/max, u32_min/max, s64_min/max, and pointer kind plus offset.

3.3 Pointer Type System

PTR_TO_CTX        : program context
PTR_TO_PACKET     : packet data
PTR_TO_PACKET_END : packet end
PTR_TO_MAP_KEY    : map key
PTR_TO_MAP_VALUE  : map value
PTR_TO_STACK      : stack memory
PTR_TO_SOCKET     : socket
SCALAR_VALUE      : non-pointer

PTR_TO_MAP_VALUE requires a NULL check before deref:

void *val = bpf_map_lookup_elem(&my_map, &key);
if (!val) return 0;
*(int *)val = 42;

3.4 Packet Bounds Checks

SEC("xdp")
int drop_port_80(struct xdp_md *ctx) {
    void *data = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;
    struct ethhdr *eth = data;

    if (data + sizeof(*eth) > data_end)   // required
        return XDP_PASS;
    if (eth->h_proto == bpf_htons(ETH_P_IP)) { /* ... */ }
    return XDP_PASS;
}

3.5 Reading Verifier Errors

Logs are long but regular:

17: invalid access to packet, off=14 size=2, R1(id=0,off=14,r=0)
R1 offset is outside of the packet

Fix: add a data + N > data_end check before the access.

3.6 Complexity Budget

If too many paths are explored, you get "BPF program is too complex." Limit: 1,000,000 states on 5.2+. Mitigations: prefer bounded loops over unroll, __always_inline helpers, rely on state pruning (4.19+).


4. Maps

4.1 Why

  • 512-byte stack limit.
  • No direct sharing between programs.
  • Need user-space I/O.

Maps are typed KV stores managed by the kernel and accessible from both sides.

4.2 Common Types

TypeUse
BPF_MAP_TYPE_HASHper-PID stats
BPF_MAP_TYPE_ARRAYcounters, config
BPF_MAP_TYPE_PERCPU_HASHlock-free counters
BPF_MAP_TYPE_LRU_HASHconnection tracking
BPF_MAP_TYPE_LPM_TRIErouting tables
BPF_MAP_TYPE_STACK_TRACEprofiling
BPF_MAP_TYPE_RINGBUFevents to user (5.8+)
BPF_MAP_TYPE_PROG_ARRAYtail calls
BPF_MAP_TYPE_DEVMAPXDP redirect
BPF_MAP_TYPE_SK_STORAGEper-socket state
BPF_MAP_TYPE_TASK_STORAGEper-task state

4.3 Declaration (libbpf)

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __type(key, u32);
    __type(value, u64);
    __uint(max_entries, 10240);
} pid_counts SEC(".maps");

4.4 Access from the Program

u32 pid = bpf_get_current_pid_tgid() >> 32;
u64 *count = bpf_map_lookup_elem(&pid_counts, &pid);
if (!count) {
    u64 one = 1;
    bpf_map_update_elem(&pid_counts, &pid, &one, BPF_ANY);
} else {
    __sync_fetch_and_add(count, 1);
}

4.5 Per-CPU Maps

Per-CPU maps avoid cache-line contention by giving each CPU its own slot. Sum across CPUs in user space:

int ncpus = libbpf_num_possible_cpus();
u64 vals[ncpus];
bpf_map_lookup_elem(fd, &key, vals);
u64 total = 0;
for (int i = 0; i < ncpus; i++) total += vals[i];

4.6 Ring Buffer

Linux 5.8+, MPSC, back-pressure aware, event-loss counters:

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);
} events SEC(".maps");

SEC("tp/sched/sched_process_exec")
int handle_exec(void *ctx) {
    struct event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e) return 0;
    e->pid = bpf_get_current_pid_tgid() >> 32;
    bpf_get_current_comm(&e->comm, sizeof(e->comm));
    bpf_ringbuf_submit(e, 0);
    return 0;
}

5. Program Types and Hooks

5.1 Kprobe / Kretprobe

Hook kernel function entry/exit. Works on almost any kernel function, but names/signatures drift across versions.

SEC("kprobe/do_sys_openat2")
int kprobe_openat(struct pt_regs *ctx) {
    const char *filename = (const char *)PT_REGS_PARM2(ctx);
    char buf[256];
    bpf_probe_read_user_str(buf, sizeof(buf), filename);
    return 0;
}

5.2 Tracepoint

Stable ABI events explicitly defined by kernel developers. Listed under /sys/kernel/debug/tracing/events.

5.3 Fentry / Fexit (BPF Trampoline, 5.5+)

Faster than kprobe. Replaces a 5-byte NOP at function entry with a call.

SEC("fentry/tcp_v4_connect")
int BPF_PROG(connect_entry, struct sock *sk) { return 0; }

SEC("fexit/tcp_v4_connect")
int BPF_PROG(connect_exit, struct sock *sk, int ret) { return 0; }

5.4 XDP

Runs right after the NIC driver RX, before sk_buff allocation.

SEC("xdp")
int xdp_drop_tcp_80(struct xdp_md *ctx) {
    void *data = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;
    struct ethhdr *eth = data;
    if (data + sizeof(*eth) > data_end) return XDP_PASS;
    if (eth->h_proto != bpf_htons(ETH_P_IP)) return XDP_PASS;

    struct iphdr *ip = data + sizeof(*eth);
    if ((void *)(ip + 1) > data_end) return XDP_PASS;
    if (ip->protocol != IPPROTO_TCP) return XDP_PASS;

    struct tcphdr *tcp = (void *)ip + ip->ihl * 4;
    if ((void *)(tcp + 1) > data_end) return XDP_PASS;

    if (tcp->dest == bpf_htons(80)) return XDP_DROP;
    return XDP_PASS;
}

Return codes: XDP_PASS, XDP_DROP, XDP_TX, XDP_REDIRECT, XDP_ABORTED.

5.5 TC BPF

Runs at the qdisc level (egress + ingress). Used by Cilium for identity-based filtering and L7 mark-and-route prep.

5.6 cgroup_skb / cgroup_sock

Per-cgroup network control, used by K8s namespace isolation.

5.7 LSM BPF (5.7+)

Security policies implemented dynamically via eBPF, no recompile needed:

SEC("lsm/file_open")
int BPF_PROG(block_secret_file, struct file *file) {
    char name[256];
    bpf_probe_read_kernel_str(name, sizeof(name),
        file->f_path.dentry->d_iname);
    if (strcmp(name, "secret") == 0) return -EACCES;
    return 0;
}

5.8 Sockops

Hooks into socket lifecycle. Cilium uses this to translate connect(svc_ip) into a direct backend connection, skipping iptables entirely.


6. CO-RE — Compile Once, Run Everywhere

6.1 The Problem

Struct field offsets change across kernel versions. A prebuilt .o reads the wrong bytes on another kernel. BCC's fix was runtime LLVM compilation — hundreds of MB and seconds per load.

6.2 BTF + CO-RE

BTF stores the kernel's type info at /sys/kernel/btf/vmlinux. CO-RE marks field accesses as relocatable so libbpf rewrites offsets at load time against the target kernel.

#include <vmlinux.h>
#include <bpf/bpf_core_read.h>

SEC("kprobe/tcp_sendmsg")
int trace_tcp_sendmsg(struct pt_regs *ctx) {
    struct sock *sk = (struct sock *)PT_REGS_PARM1(ctx);
    u16 sport;
    BPF_CORE_READ_INTO(&sport, sk, __sk_common.skc_num);
    return 0;
}

6.3 Generating vmlinux.h

bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h

No kernel headers needed.

6.4 Feature Detection

if (bpf_core_field_exists(task->comm_size)) {
    /* new kernel */
} else {
    /* old kernel */
}

6.5 What CO-RE Changed

Containers ship a single multi-MB binary (libbpf + CO-RE .bpf.o) instead of hundreds of MB of LLVM + kernel headers. Cilium, Falco, Pixie, Inspektor Gadget all rely on it.


7. XDP — Ultra-Fast Packet Processing

7.1 Position

NIC -> XDP (right after driver RX) -> sk_buff alloc -> TC ingress -> routing -> TC egress -> driver TX -> NIC

XDP runs before sk_buff allocation, saving hundreds of nanoseconds per packet.

7.2 Modes

  1. Native XDP: driver-integrated, fastest.
  2. Generic XDP: kernel emulation, slower.
  3. Offloaded XDP: runs on SmartNIC hardware.

7.3 Numbers

Methodpps (single core)Latency
Kernel stack (iptables)~1 Mpps~10 us
XDP Native~24 Mpps~1 us
DPDK~30 Mpps~0.5 us

7.4 DDoS Filtering (Cloudflare-style)

LPM-trie blacklists plus per-source SYN-rate maps, dropped at XDP_DROP. Cloudflare has absorbed 5-10 Tbps floods at the NIC edge this way.

7.5 Katran (Meta L4 LB)

  • DSR (Direct Server Return).
  • Maglev consistent hashing.
  • Millions of pps per host.

8. Cilium

8.1 K8s Networking Pain

iptables mode scales O(n) with service count. Even IPVS still pays for sk_buff and netfilter hooks.

8.2 Cilium's Bypass

Cilium attaches eBPF to veth-pair TC hooks and handles service resolution, policy, LB, and encryption directly, skipping iptables.

8.3 Identity-Based Policy

Labels map to 16-bit identities. Policy is an O(1) map lookup on (src_identity, dst_identity, port) regardless of how many policies exist.

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: allow-frontend-to-backend
spec:
  endpointSelector:
    matchLabels:
      app: backend
  ingress:
  - fromEndpoints:
    - matchLabels:
        app: frontend
    toPorts:
    - ports:
      - port: "8080"
        protocol: TCP

8.4 Replacing kube-proxy

Sockops intercepts connect(svc_ip), performs DNAT via O(1) map lookup, and connects straight to the backend. For same-node pods, sockmap redirect bypasses the network stack entirely.


9. Observability: bpftrace, bcc, Pixie

9.1 bpftrace

DTrace-like one-liners:

bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }'

bpftrace -e '
kprobe:tcp_v4_connect { @start[tid] = nsecs; }
kretprobe:tcp_v4_connect /@start[tid]/ {
    @latency = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]);
}'

9.2 bcc

Python harness that embeds BPF C. Needs runtime LLVM; being phased out in favor of libbpf + CO-RE.

9.3 libbpf + skeleton

Production-grade. Single binary, CO-RE-portable:

#include <vmlinux.h>
#include <bpf/bpf_helpers.h>

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);
} events SEC(".maps");

SEC("tp/syscalls/sys_enter_openat")
int handle_openat(void *ctx) {
    u32 *e = bpf_ringbuf_reserve(&events, sizeof(u32), 0);
    if (!e) return 0;
    *e = bpf_get_current_pid_tgid() >> 32;
    bpf_ringbuf_submit(e, 0);
    return 0;
}

char LICENSE[] SEC("license") = "GPL";

9.4 Pixie

DaemonSet uses uprobes on libssl/libc/libgrpc to decode HTTP/gRPC/SQL in-cluster, without code changes.


10. Falco — Runtime Security

Falco watches every syscall via eBPF and fires on rule matches:

- rule: Read sensitive file untrusted
  desc: Detect reading sensitive files
  condition: >
    sensitive_files and open_read
    and proc_name_exists
    and not proc.name in (trusted_procs)
  output: >
    Sensitive file opened (user=%user.name
    command=%proc.cmdline file=%fd.name)
  priority: WARNING

Moving from kernel module to eBPF made it deployable on any K8s node with sub-1 percent overhead.


11. Limits and Caveats

11.1 Verifier Walls

Complex logic must be split via tail calls or bpf_loop (5.17+).

11.2 GPL Constraint

Most useful helpers are GPL-only:

char LICENSE[] SEC("license") = "GPL";

Without it, the verifier refuses the load.

11.3 Overhead

XDP microbenchmarks:

  • Empty: ~24 Mpps
  • 1 map lookup: ~18 Mpps
  • 3 lookups + logic: ~10 Mpps

11.4 Debugging Is Hard

No GDB, bpf_printk goes to trace_pipe, no stack traces. Use bpftool prog dump jited, read verifier logs carefully, and drive tests via BPF_PROG_TEST_RUN.

11.5 Observer Effect

Uprobe/kprobe adds 5-30 ns per call. Pre-filter in BPF, ship only interesting events to user via ring buffer.


12. Security

12.1 Is eBPF Safe?

The verifier enforces memory safety, not intent. Side channels (Spectre-like), verifier DoS, and kernel-memory leaks by privileged users are real concerns — mitigated by kernel lockdown and CAP_BPF.

12.2 Capabilities

  • CAP_SYS_ADMIN: legacy broad access.
  • CAP_BPF (5.8+): scoped BPF.
  • CAP_PERFMON: tracing.
  • CAP_NET_ADMIN: network BPF.

12.3 Unprivileged BPF

Disabled by default after Spectre (kernel.unprivileged_bpf_disabled=1). Effectively dead in 2025.


13. Benchmarking and Tuning

13.1 XDP

ip link set dev eth0 xdp obj drop.o sec xdp
ip -s link show eth0

bpftool prog show + bpftool prog profile, plus perf stat -e cycles,instructions.

13.2 Verifier Profiling

bpftool prog load prog.o /sys/fs/bpf/prog \
    log_level 2 log_size 4194304 2>&1 | tee verifier.log

13.3 XDP Tuning

  1. Batch bounds checks across contiguous fields.
  2. Avoid bpf_xdp_adjust_head.
  3. Grow PCIe NIC RX ring.
  4. Pin IRQs via RPS/RFS.
  5. Use huge pages to cut TLB misses.

14. Tail Calls and BPF-to-BPF Calls

14.1 Tail Calls

Jump between programs via a BPF_MAP_TYPE_PROG_ARRAY:

struct {
    __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
    __uint(max_entries, 4);
    __type(key, u32);
    __type(value, u32);
} prog_array SEC(".maps");

SEC("xdp")
int entry(struct xdp_md *ctx) {
    bpf_tail_call(ctx, &prog_array, NEXT_PROG);
    return XDP_PASS;
}

Stack is lost across tail calls; max chain depth 33.

14.2 BPF-to-BPF Calls (4.16+)

Normal function calls. No recursion allowed.


15.1 eBPF for Windows

Microsoft's port is in beta; the same ELF targets both kernels.

15.2 sched_ext (Linux 6.12)

Write the process scheduler itself in eBPF, replacing CFS/EEVDF with custom policies:

SEC("struct_ops/scx_example_enqueue")
void BPF_STRUCT_OPS(enqueue, struct task_struct *p, u64 enq_flags) {
    scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
}

15.3 Rust via aya

use aya_bpf::{macros::xdp, programs::XdpContext};

#[xdp]
pub fn my_xdp(ctx: XdpContext) -> u32 {
    match unsafe { try_my_xdp(ctx) } {
        Ok(ret) => ret,
        Err(_) => xdp_action::XDP_PASS,
    }
}

15.4 Service Mesh

Cilium Service Mesh uses eBPF for L4 + identity + mTLS and per-node envoy for L7, eliminating per-pod sidecars — about 70 percent overhead reduction.


16. Learning Path

  1. Read ebpf.io and the official "What is eBPF."
  2. Use bpftrace one-liners.
  3. Work through BCC's tools/ directory.
  4. Move to libbpf-bootstrap and skeletons.
  5. Read Cilium bpf/, Katran, Tetragon.

Books: "Learning eBPF" (Liz Rice), "Linux Observability with BPF" (Calavera & Fontana). Conferences: eBPF Summit, LPC BPF track, KubeCon.


17. Summary Cheat Sheet

eBPF Cheat Sheet
- VM: 11 regs, 64-bit RISC-like, JIT, 1M insn limit
- Verifier: symbolic execution, memory safety, pointer typing
- Maps: Hash, Array, LRU, LPM trie, Ring buffer, Per-CPU
- Program types: kprobe, tracepoint, fentry/fexit, XDP, TC, cgroup_skb, LSM, sockops
- CO-RE: BTF relocations, single binary, all kernels
- Production: Cilium, Katran, Pixie, Falco, bpftrace
- Tools: bpftool, libbpf, bcc, aya (Rust)

18. Quiz

Q1. What's the most common reason the verifier rejects a program?

A. Missing bounds checks before pointer access or dereferencing a map lookup without a NULL check. In XDP, look for invalid access to packet, off=N size=M or R1 pointer arithmetic on PTR_TO_MAP_VALUE_OR_NULL.

Q2. What does CO-RE solve?

A. Struct-offset drift between kernels. BTF stores each kernel's type info; libbpf rewrites field offsets at load time so one binary runs on many kernels.

Q3. Why is XDP much faster than iptables?

A. iptables runs after sk_buff allocation and walks netfilter chains. XDP runs right after the NIC driver, before the sk_buff exists, executes JIT-compiled native code, and skips rule-tree traversal.

Q4. Why are per-CPU maps faster than a global hash?

A. No cache-line contention. Each CPU writes its own slot instead of bouncing a shared line between cores. User space sums slots when reading.

Q5. How does Cilium replace kube-proxy without iptables?

A. A sockops program hooks connect(), looks up the backend in an O(1) BPF map, rewrites the destination, and connects straight to the backend. For same-node pods, sockmap redirect bypasses the network stack entirely.

Q6. Downside of tail calls?

A. Stack is lost — locals don't survive the call. Share state via maps. Max chain depth is 33.

Q7. How do you hook a user-space function?

A. uprobe or uretprobe. Pixie hooks libssl SSL_read/write to see HTTPS payloads post-TLS termination.


If you enjoyed this, check out:

  • "Linux Network Stack Deep Dive"
  • "Cilium Architecture Complete Guide"
  • "io_uring Deep Dive"
  • "DPDK vs XDP Compared"