Skip to content

✍️ 필사 모드: eBPF Complete Guide — A Tiny VM Inside the Kernel: Verifier, JIT, CO-RE, Maps, Attach Points, XDP, LSM, sched_ext (2025)

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Intro — eBPF is Linux's New Nervous System

People hearing about eBPF for the first time usually react with "Wait, that's actually possible?" A userspace program injects a small piece of code at arbitrary points inside the kernel, the code safely reads kernel data structures, and returns results back to userspace. On top of that, the code must pass a verifier before it runs, so bad code cannot bring down the kernel.

This would have sounded insane 30 years ago. Yet in 2025, Linux extends monitoring, security, networking, and even scheduling on exactly this model. Cilium rewrote all of Kubernetes networking with eBPF. Falco and Tetragon do runtime security with eBPF. Datadog's system metrics agent depends on eBPF. Linux 6.12 extended eBPF into the scheduler itself (sched_ext).

This article is for everyone — from those hearing about eBPF for the first time to those who have already used BCC tools. Starting from the 14-line ISA of cBPF in 1992, through Alexei Starovoitov's first eBPF patch in 2014, the magic of the verifier, JIT compilation, CO-RE, and every attach point of modern Linux — all in 1,400 lines.

This article is a sister piece to the Linux Internals Series. While the series covers "what the kernel does," this article covers "how users can extend the kernel."


1. History — From cBPF to eBPF

1.1 1992 — Berkeley Packet Filter

In 1992, Steven McCanne and Van Jacobson published a new packet-filtering mechanism for BSD. The previous packet filter (CSPF) was a tree-based expression and was very slow. Their new model was a simple virtual machine:

  • A 32-bit accumulator (A) and index register (X)
  • 16 32-bit scratch memory slots
  • Simple instructions to read data from a packet or compare with slots
  • Jump instructions to express decision trees

This model was a huge success on BSD and was soon ported to Linux and Solaris. tcpdump and libpcap run on top of it. The expression tcp port 80 is internally compiled into around 20 cBPF instructions.

cBPF practically did not change for 30 years. It was simple, it worked, and it was sufficient for the narrow domain of packet filtering.

1.2 2013 — Alexei Starovoitov's First Patch

In 2013, Alexei Starovoitov, then at PLUMgrid, submitted a large patch to LKML. The title: "extended BPF". Key changes:

  • 32-bit → 64-bit registers
  • 2 → 11 registers (R0-R10)
  • Function call instructions
  • More arithmetic/bit operations
  • An ISA very close to x86_64 — JIT compilation is nearly 1:1

Reactions were initially skeptical. "Why extend BPF? Isn't it for packet filtering?" But Alexei's vision was bigger — a general interface for small, safe code that userspace can run inside the kernel.

1.3 2014 — Linux 3.18 Mainline

In September 2014, the first eBPF patch was merged in Linux 3.18. Initially it was for socket filters (a direct successor to cBPF). But new attach points were added in nearly every release:

  • 2014 (3.18): Basic infrastructure, socket filter
  • 2015 (3.19): kprobe attach
  • 2015 (4.1): tc (traffic control) attach
  • 2016 (4.4): tracepoint attach
  • 2016 (4.8): XDP attach
  • 2017 (4.10): cgroup attach, perf_event attach
  • 2018 (4.15): BTF introduced (foundation of CO-RE)
  • 2018 (4.18): socket lookup
  • 2019 (5.7): LSM attach (KRSI)
  • 2020 (5.8): BPF_RINGBUF
  • 2021 (5.13): stack trace BPF helper
  • 2022 (5.15): bpf_loop helper
  • 2023 (6.4): BPF stack walking improvements
  • 2024 (6.12): sched_ext mainline merge

Today's eBPF is integrated with almost every kernel subsystem. A small VM originally made for packet filtering 30 years ago has changed parts of the very operating model of Linux.

★ Insight ─────────────────────────────────────

  • The name confusion: "eBPF" is not the official name. Kernel code just calls it "BPF". Outside, "eBPF" (short for "extended BPF") is more common. The old BPF is distinguished as "classic BPF" or "cBPF".
  • Why a "packet filter" became a general VM: The key insight was that the model of "running verified, safe code inside the kernel" applies not just to packet filtering but to almost all kernel extensions. eBPF took cBPF's properties of "self-termination guarantee + memory safety" and brought them to a richer ISA.
  • Almost nonexistent outside Linux: BSD has some eBPF ports (BSD with DTrace finds eBPF less compelling), and Windows has had a separate project called ebpf-for-windows since 2021. But saying "eBPF is essentially Linux" is almost correct. ─────────────────────────────────────────────────

2. The eBPF Virtual Machine — ISA and Registers

2.1 Eleven 64-bit Registers

The eBPF VM is very simple:

  • R0: function return value, program exit value
  • R1 to R5: function arguments (1-5)
  • R6 to R9: callee-saved
  • R10: stack frame pointer (read-only)

This interface is intentionally very similar to the x86_64 calling convention. JIT compilation becomes nearly 1:1.

2.2 Stack

Each program has a 512-byte stack. R10 points to its base. It looks small, but it is kept small because the verifier has to track every usage.

2.3 Instruction Encoding

eBPF instructions are fixed at 64 bits:

+----+----+----+--------+
| op | dst | src | offset | imm |
| 8  | 4  | 4  | 16     | 32  |
+----+----+----+--------+
  • op: 8-bit opcode
  • dst/src: 4-bit register indices
  • offset: 16-bit signed offset (jumps, memory accesses)
  • imm: 32-bit immediate value

2.4 Instruction Categories

Core categories:

  1. ALU: arithmetic/logic operations (32-bit and 64-bit)
  2. Memory: load/store
  3. Branch: conditional jumps
  4. Call: helper function calls
  5. Exit: program exit

Example:

BPF_MOV64_IMM(BPF_REG_0, 0)     // R0 = 0
BPF_MOV64_REG(BPF_REG_1, BPF_REG_10)  // R1 = R10 (frame pointer)
BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -8) // R1 += -8
BPF_LD_MAP_FD(BPF_REG_1, map_fd)      // R1 = map fd
BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem) // call helper
BPF_EXIT_INSN()                  // exit

2.5 A 200-line ISA

The entire eBPF ISA can be expressed in about 200 lines of C (see include/uapi/linux/bpf.h). It is very small. Yet a tremendous amount of work can be done with this small ISA.


3. The Lifecycle of an eBPF Program

3.1 Writing — from C to BPF Bytecode

The typical flow:

// hello.bpf.c
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

char LICENSE[] SEC("license") = "GPL";

SEC("kprobe/sys_open")
int hello(struct pt_regs *ctx) {
    char fmt[] = "Hello from kprobe!\n";
    bpf_trace_printk(fmt, sizeof(fmt));
    return 0;
}

Compile:

clang -O2 -target bpf -c hello.bpf.c -o hello.bpf.o

-target bpf is the key. Clang's BPF backend compiles into eBPF instructions. The output is an ELF file containing BPF bytecode.

3.2 Loading — the bpf() System Call

You extract the bytecode from the ELF and send it to the kernel via the bpf(BPF_PROG_LOAD, ...) system call. libbpf encapsulates this process:

struct bpf_object *obj = bpf_object__open("hello.bpf.o");
bpf_object__load(obj);  // verify + JIT
struct bpf_program *prog = bpf_object__find_program_by_name(obj, "hello");
bpf_program__attach(prog);  // attach to kprobe

3.3 The Verifier

When bpf(BPF_PROG_LOAD, ...) is called, the kernel first runs the verifier. The verifier checks whether the program is safe:

  • Are all memory accesses valid?
  • Do all jumps go to defined locations?
  • Are there no infinite loops?
  • Are helper calls made in the appropriate context?
  • Is stack usage within limits?

The verifier generally proceeds in 5 stages (detailed in the next section).

3.4 JIT Compilation

Once verification passes, the JIT compiler compiles BPF bytecode into native machine code. On x86_64, the mapping is nearly 1:1. ARM64 is similar.

JIT is optional but is enabled on almost all modern systems (net.core.bpf_jit_enable=1). It is about 10-100x faster than the interpreter.

3.5 Attach

The JIT-compiled program is connected to a specific attach point. For example, the SEC name kprobe/sys_open means "put a kprobe on the entry of the sys_open function." libbpf calls the BPF_PROG_ATTACH system call to connect it.

From this point on, the BPF program runs every time sys_open is called.


4. The Verifier — The Real Magic of eBPF

4.1 What It Guarantees

The verifier statically guarantees the following:

  1. Memory safety: all loads/stores access valid regions
  2. Type safety: pointers are not treated as integers
  3. Termination guarantee: no infinite loops (or explicit bounds)
  4. Stack safety: no stack overflow
  5. Call safety: helper calls are made with appropriate arguments and context

It guarantees all of this through static analysis without actually running the code. This is a very hard problem.

4.2 How It Works — Abstract Interpretation

The core algorithm of the verifier is abstract interpretation. It simulates every possible execution path, tracking the register/stack state at each point as an "abstract value."

Examples of abstract values:

  • R0 = SCALAR_VALUE, range [0, 100]
  • R1 = PTR_TO_MAP_VALUE, off 0..16
  • R2 = PTR_TO_PACKET, off 14..1500
  • R3 = NOT_INIT

These abstract values are updated as each instruction is processed. At branches, both paths are explored.

4.3 Avoiding Path Explosion — Pruning

Naively exploring every path would cause exponential blowup. The verifier remembers states it has already seen (states_cache) and prunes a path when it re-enters the same state.

This is very effective, but complex programs may still take 10+ seconds to verify. When the number of verified states exceeds a million, the verifier gives up (-EFBIG or -ENOSPC).

4.4 Memory Access Verification

int *p = bpf_map_lookup_elem(&my_map, &key);
*p = 42;  // verification fails!

This code fails verification. bpf_map_lookup_elem can return NULL, but the code dereferences it without a NULL check. Correct code:

int *p = bpf_map_lookup_elem(&my_map, &key);
if (p) {
    *p = 42;  // OK
}

The verifier sees the if (p) check and, inside that branch, tracks that p is not NULL. That is how it knows the *p access is safe.

4.5 Termination Guarantee

Loops are the verifier's headache. Initially, eBPF banned loops entirely. All loops had to be unrolled at compile time.

From 5.3, bounded loops are allowed. If the verifier can prove the loop count is finite, OK:

#pragma unroll
for (int i = 0; i < 10; i++) {
    /* ... */
}

From 5.13, the bpf_loop helper was added, enabling more flexible loops:

static int callback(__u32 idx, void *data) {
    /* ... */
    return 0;  // 0 = continue, 1 = stop
}
bpf_loop(1000, callback, &my_data, 0);

4.6 Stack Usage

Each program has a 512-byte stack. Putting a large struct on the stack quickly exceeds the limit.

SEC("kprobe/sys_open")
int hello(struct pt_regs *ctx) {
    char buf[600];  // verification fails! stack limit exceeded
    return 0;
}

Alternative: use a BPF_MAP_TYPE_PERCPU_ARRAY as a "large scratch region."

4.7 Debugging the Verifier

When the verifier fails, it produces very long error messages. You can view them in detail with bpftool prog show + verifier_log_level=2. The verifier's abstract state is printed alongside each instruction, line by line:

0: (b7) r1 = 0
   R1_w=0 R10=fp0
1: (61) r2 = *(u32 *)(r1 +0)
   R1 invalid mem access 'inv'
processed 2 insns ...

The ability to read these logs is the core skill of an eBPF developer.


5. BPF Maps — Communicating with Userspace

5.1 Why They Are Needed

BPF programs are volatile — all local state disappears when the call ends. To persist data or share it with userspace, you need a separate mechanism. That is BPF Maps.

Maps are key-value stores. BPF programs access them via helper functions, and userspace accesses them via system calls.

5.2 Map Types (17+)

Core types:

TypePurpose
BPF_MAP_TYPE_HASHHash table (most common)
BPF_MAP_TYPE_ARRAYFixed-size array, index-based
BPF_MAP_TYPE_PERCPU_HASHPer-CPU hash (lock-free)
BPF_MAP_TYPE_PERCPU_ARRAYPer-CPU array
BPF_MAP_TYPE_LRU_HASHHash with LRU eviction
BPF_MAP_TYPE_LPM_TRIELongest-prefix-match trie (for routing)
BPF_MAP_TYPE_PROG_ARRAYBPF program array (for tail calls)
BPF_MAP_TYPE_PERF_EVENT_ARRAYFor perf event output
BPF_MAP_TYPE_RINGBUFNewer ring buffer (5.8+)
BPF_MAP_TYPE_QUEUEFIFO
BPF_MAP_TYPE_STACKLIFO
BPF_MAP_TYPE_SK_STORAGEPer-socket storage
BPF_MAP_TYPE_TASK_STORAGEPer-task storage
BPF_MAP_TYPE_INODE_STORAGEPer-inode storage
BPF_MAP_TYPE_CGROUP_STORAGEPer-cgroup storage
BPF_MAP_TYPE_BLOOM_FILTERBloom filter
BPF_MAP_TYPE_USER_RINGBUFUser → kernel ringbuf (5.19+)

5.3 Hash Map Example

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 1024);
    __type(key, __u32);
    __type(value, __u64);
} counter_map SEC(".maps");

SEC("kprobe/sys_open")
int count_opens(struct pt_regs *ctx) {
    __u32 pid = bpf_get_current_pid_tgid() >> 32;
    __u64 *count = bpf_map_lookup_elem(&counter_map, &pid);
    if (count) {
        __sync_fetch_and_add(count, 1);
    } else {
        __u64 init = 1;
        bpf_map_update_elem(&counter_map, &pid, &init, BPF_ANY);
    }
    return 0;
}

This program increments a per-PID counter on each sys_open call. In userspace, you query the same map with bpf_map__lookup_elem.

5.4 PERCPU Variants — Lock-free Counters

BPF_MAP_TYPE_PERCPU_HASH has a separate hash table per CPU. No locks are needed — only BPF programs on the same CPU access that CPU's map.

When userspace reads a PERCPU map, it receives values from all CPUs. Aggregating them is userspace's responsibility.

PERCPU maps are very useful for counters, statistics, and histograms. There is no lock contention, so they are very fast.

5.5 RINGBUF — The New Event Output

Traditionally, when BPF programs sent events to userspace, they used BPF_MAP_TYPE_PERF_EVENT_ARRAY. This is a per-CPU perf ring buffer with a separate ring for each CPU.

BPF_MAP_TYPE_RINGBUF, introduced in 5.8, is more elegant:

  • A single shared ring buffer (no distribution across CPUs)
  • Uses less memory
  • BPF programs can directly reserve/commit variable-size events
struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);
} events SEC(".maps");

SEC("tp/sched/sched_process_exec")
int on_exec(struct trace_event_raw_sched_process_exec *ctx) {
    struct event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e) return 0;

    e->pid = bpf_get_current_pid_tgid() >> 32;
    bpf_get_current_comm(&e->comm, sizeof(e->comm));

    bpf_ringbuf_submit(e, 0);
    return 0;
}

Most new BPF tools use ringbuf. The perf event array is kept for compatibility.


6. Helper Functions — 200+ Kernel Interfaces

6.1 What Is a Helper

BPF programs cannot call arbitrary kernel functions. Instead, they can only call a verified set of "helper functions." Helpers are safe interfaces that the kernel explicitly exposes.

Helpers numbered around 100 at 5.0 and far exceed 200 by 6.x.

6.2 Core Helpers

The most frequently used ones:

HelperPurpose
bpf_map_lookup_elemRead a value from a map
bpf_map_update_elemWrite a value to a map
bpf_map_delete_elemDelete a value from a map
bpf_get_current_pid_tgidCurrent PID/TID
bpf_get_current_uid_gidCurrent UID/GID
bpf_get_current_commCurrent process name
bpf_get_current_taskCurrent task_struct pointer
bpf_ktime_get_nsMonotonic time (nanoseconds)
bpf_trace_printkDebug printf (/sys/kernel/debug/tracing/trace_pipe)
bpf_perf_event_outputOutput event to perf event array
bpf_ringbuf_reserve/bpf_ringbuf_submitringbuf output
bpf_get_stack/bpf_get_stackidStack trace
bpf_probe_read_kernel/bpf_probe_read_userSafe memory reads
bpf_skb_load_bytesRead bytes from a packet
bpf_redirectPacket redirection
bpf_xdp_adjust_headXDP packet header adjustment
bpf_jiffies64Current jiffies
bpf_send_signalSend a signal (5.3+)

6.3 Helper Safety

Each helper specifies its argument types. The verifier checks argument types at call time.

For example, the signature of bpf_map_lookup_elem:

void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)

The verifier checks whether map is a pointer to a map of a type like BPF_MAP_TYPE_HASH and whether key points to memory at least as large as that map's key_size.

6.4 GPL vs non-GPL Helpers

Some helpers are GPL-only. BPF programs with non-GPL licenses cannot call them. bpf_trace_printk is one such helper (being a debug aid).

char LICENSE[] SEC("license") = "GPL";  // can use GPL helpers

Most commercial BPF tools are GPL.


7. Attach Points — Where You Can Hook

The real power of eBPF comes from its variety of attach points. Each attach point has its own "context" (arguments) and set of helpers.

7.1 kprobe / kretprobe

Can attach to the entry/return of nearly any kernel function.

SEC("kprobe/vfs_read")
int on_vfs_read(struct pt_regs *ctx) {
    /* ... */
    return 0;
}

At the entry point, you access arguments via pt_regs. At the return point, you access the return value via PT_REGS_RC(ctx).

Pros: can hook nearly any kernel function. Cons: function signatures may vary across kernel versions. CO-RE mitigates this somewhat.

7.2 fentry / fexit (BPF Trampoline)

A faster alternative introduced in 5.5. kprobe inserts an INT3 instruction, whereas fentry/fexit leverages the ftrace infrastructure to call directly. About 10x faster.

SEC("fentry/vfs_read")
int BPF_PROG(on_vfs_read_entry, struct file *file, char *buf, size_t count) {
    /* direct access to arguments, thanks to BTF */
    return 0;
}

7.3 tracepoint

Stable trace points pre-embedded in kernel code. Unlike kprobes, they do not break when function names change.

SEC("tp/sched/sched_process_exec")
int on_exec(struct trace_event_raw_sched_process_exec *ctx) {
    /* direct access to tracepoint arguments */
    return 0;
}

All available tracepoints can be listed under /sys/kernel/debug/tracing/events/.

7.4 raw_tracepoint

A faster version of tracepoint. It receives tracepoint arguments as-is without decoding them — slightly more coding burden, but faster.

7.5 uprobe / uretprobe

Can also attach to userspace functions. For example, attach to libc's malloc to trace every call.

SEC("uprobe//usr/lib/libc.so.6:malloc")
int on_malloc(struct pt_regs *ctx) {
    size_t size = PT_REGS_PARM1(ctx);
    /* ... */
    return 0;
}

Very powerful but expensive — each call triggers a user→kernel trap.

7.6 perf_event

Attach to perf events (CPU cycles, cache misses, etc.). The foundation of CPU profiling.

SEC("perf_event")
int sample(struct bpf_perf_event_data *ctx) {
    /* called every N cycles */
    return 0;
}

7.7 XDP — eXpress Data Path

Attaches at the very front of the network interface's receive path. Called right after the driver, before a packet is converted into an sk_buff. Very fast.

SEC("xdp")
int xdp_prog(struct xdp_md *ctx) {
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;

    if (data + sizeof(struct ethhdr) > data_end)
        return XDP_PASS;

    struct ethhdr *eth = data;
    if (eth->h_proto == bpf_htons(ETH_P_IP)) {
        /* IP packet */
    }
    return XDP_PASS;  // or XDP_DROP, XDP_TX, XDP_REDIRECT
}

XDP actions:

  • XDP_PASS: forward to the normal network stack
  • XDP_DROP: drop
  • XDP_TX: retransmit back out the same NIC
  • XDP_REDIRECT: send to another NIC
  • XDP_ABORTED: error

XDP is very popular for DDoS protection, load balancing, and packet rewriting. Cloudflare uses XDP extensively in its infrastructure.

7.8 tc (Traffic Control)

Attaches at a slightly later stage than XDP. Since the sk_buff has already been built, it has richer information (cgroup, socket, etc.). Not as fast as XDP but more flexible.

Attach via tc-bpf:

tc qdisc add dev eth0 clsact
tc filter add dev eth0 ingress bpf da obj my_prog.bpf.o

7.9 cgroup hooks

Can attach to specific syscalls of all processes within a cgroup. For example, blocking all connect calls in a cgroup:

SEC("cgroup/connect4")
int restrict_connect(struct bpf_sock_addr *ctx) {
    if (ctx->user_port == bpf_htons(22)) {
        return 0;  // block SSH
    }
    return 1;
}

The foundation of container security policies.

7.10 LSM hooks (KRSI)

Introduced in 5.7. BPF programs can be attached to all Linux Security Module hooks. You can replace or augment SELinux/AppArmor with BPF.

SEC("lsm/file_open")
int BPF_PROG(check_file_open, struct file *file, int ret) {
    /* can inspect file opens and deny them */
    return -EPERM;  // or 0 = OK
}

Tetragon operates on this model.

7.11 sched_ext (6.12+)

The newest attach point. Allows userspace to write scheduling policies in BPF. Covered in detail in the Linux scheduler article.

7.12 socket lookup, sock_ops, sk_msg

Can attach at various stages of socket processing. Cilium's sidecar-less service mesh leverages this.


8. CO-RE — Compile Once, Run Everywhere

8.1 The Problem

eBPF programs often read kernel data structures (task_struct, sk_buff, etc.). But the layouts of these structs differ across kernel versions and compile options. If the machine you built on differs from the machine you run on, things break.

Old BCC solved this with "compile at runtime." It required clang and kernel headers installed on every machine, recompiling every run. Slow, disk-hungry, and unfit for production.

8.2 BTF — BPF Type Format

BTF is a lightweight debug-info format that embeds metadata of kernel data structures. A simplified version of DWARF. From 5.2, the kernel exposes its own BTF at /sys/kernel/btf/vmlinux.

BTF contains the field names and offsets of every kernel struct. With it, userspace tools can figure out "where is task_struct->mm in this kernel."

8.3 How CO-RE Works

CO-RE builds on BTF to let BPF programs run across different kernel versions.

#include <vmlinux.h>          // header generated from the host kernel's BTF
#include <bpf/bpf_core_read.h>

SEC("kprobe/sys_open")
int hello(struct pt_regs *ctx) {
    struct task_struct *task = (void *)bpf_get_current_task();
    pid_t pid = BPF_CORE_READ(task, pid);  // macro magic
    /* ... */
    return 0;
}

The BPF_CORE_READ macro does not inline "the field offset" at compile time. Instead, it leaves relocation information in the ELF saying "look this field up via BTF."

At runtime, libbpf processes those relocations — it looks at the host kernel's BTF and fills in the actual offsets. The same BPF ELF works on both 5.10 and 6.5 kernels.

8.4 Generating vmlinux.h

bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h

This file is about 4MB and defines every kernel struct. BPF code includes it.

8.5 Field Existence Checks

Another CO-RE feature: checking whether a field exists. Lets you respond flexibly when fields are newly added or removed.

if (bpf_core_field_exists(task->cgroups)) {
    /* kernel has this field */
} else {
    /* kernel does not */
}

8.6 Impact

Thanks to CO-RE, BPF tools have become truly portable. Modern tools like Falco, Cilium, Tetragon, and Pixie all use CO-RE. A binary built once works on any kernel (given that BTF is available).


9. Userspace Tools — libbpf, BCC, bpftrace

9.1 BCC — The Old Way

BCC (BPF Compiler Collection) is the oldest BPF toolkit. It makes writing BPF programs easy via Python/Lua wrappers.

from bcc import BPF

prog = """
int hello(void *ctx) {
    bpf_trace_printk("Hello!\\n");
    return 0;
}
"""

b = BPF(text=prog)
b.attach_kprobe(event="sys_open", fn_name="hello")
b.trace_print()

Problems: compiles on every run. Requires clang + kernel headers. Large disk footprint. Unfit for production.

BCC is still used in many tools and has great value as an example archive (there are over 200 tools in /usr/share/bcc/tools/).

9.2 libbpf — The Modern Way

libbpf is a C library that encapsulates loading/attaching BPF programs. It supports CO-RE.

#include <bpf/libbpf.h>

int main() {
    struct bpf_object *obj = bpf_object__open_file("hello.bpf.o", NULL);
    bpf_object__load(obj);
    struct bpf_program *prog = bpf_object__find_program_by_name(obj, "hello");
    bpf_program__attach(prog);

    while (1) sleep(1);
    return 0;
}

Build once and deploy. Datadog, Cilium, and Tetragon are all libbpf-based.

9.3 bpftrace — A DSL

The fastest entry-level tool. You write BPF programs in a one-liner via an awk-like DSL.

# count every sys_open call
bpftrace -e 'kprobe:sys_open { @[comm] = count(); }'

# latency distribution of vfs_read (histogram)
bpftrace -e '
kprobe:vfs_read { @start[tid] = nsecs; }
kretprobe:vfs_read /@start[tid]/ {
    @lat = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]);
}'

# page fault tracking with stack traces
bpftrace -e 'tracepoint:exceptions:page_fault_user { @[ustack] = count(); }'

bpftrace uses libbpf under the hood. It converts the DSL into BPF C code, compiles it with clang, and loads it with libbpf.

9.4 Which Tool to Use

  • Ad hoc diagnostics: bpftrace
  • Production tools / long-running agents: libbpf (C/Rust/Go)
  • Reference / reusing existing BCC tools: BCC

Most new BPF code is migrating to libbpf. BCC is increasingly being relegated to the role of "example archive."


10. Case 1 — bpftrace One-liner Diagnostics

Real-world examples showcasing the power of bpftrace:

10.1 Which process reads the most from disk

bpftrace -e '
tracepoint:block:block_rq_issue { @[comm] = sum(args->bytes); }
'

10.2 Who makes the most syscalls

bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

10.3 Track TCP retransmissions

bpftrace -e '
kprobe:tcp_retransmit_skb {
    @[comm] = count();
}
'

10.4 Which function takes the longest

bpftrace -e '
kprobe:vfs_read { @start[tid] = nsecs; }
kretprobe:vfs_read /@start[tid]/ {
    $duration = nsecs - @start[tid];
    @hist = hist($duration / 1000);
    delete(@start[tid]);
}'

10.5 Full context on OOM kill

bpftrace -e '
kprobe:oom_kill_process {
    printf("OOM kill: comm=%s pid=%d ustack=%s\n",
        comm, pid, ustack);
}'

Each one-liner does what a sophisticated tool would have had to. A new benchmark for operations debugging.


11. Case 2 — XDP DDoS Defense

11.1 Scenario

You are under a UDP flood attack. Hundreds of millions of packets per second pour in and the NIC is overwhelmed. The defense code must run before a packet is converted into an sk_buff.

11.2 The XDP Program

#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/udp.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>

struct {
    __uint(type, BPF_MAP_TYPE_LPM_TRIE);
    __type(key, struct bpf_lpm_trie_key);
    __type(value, __u32);
    __uint(max_entries, 1024);
    __uint(map_flags, BPF_F_NO_PREALLOC);
} blacklist SEC(".maps");

SEC("xdp")
int xdp_drop(struct xdp_md *ctx) {
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;

    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end) return XDP_PASS;
    if (eth->h_proto != bpf_htons(ETH_P_IP)) return XDP_PASS;

    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end) return XDP_PASS;

    /* look up the IP in the LPM trie */
    struct {
        __u32 prefixlen;
        __u32 addr;
    } key = { .prefixlen = 32, .addr = ip->saddr };

    if (bpf_map_lookup_elem(&blacklist, &key)) {
        return XDP_DROP;  // drop immediately if blacklisted
    }

    return XDP_PASS;
}

char LICENSE[] SEC("license") = "GPL";

11.3 Attach

ip link set dev eth0 xdpgeneric obj xdp_drop.bpf.o sec xdp

11.4 Results

This program drops packets before they are converted into sk_buffs. Roughly 10x faster (numbers like 24Mpps vs 2.5Mpps). Cloudflare uses very similar patterns in its infrastructure.

Packets XDP drops have almost no impact on system metrics — no sk_buff allocation, so no memory use and almost no CPU.


12. Case 3 — Cilium's Kubernetes Networking

12.1 The Vision

Cilium is a project that rewrites Kubernetes networking/security/observability with eBPF. It completely replaces iptables-based kube-proxy.

12.2 What Is Different

Traditional Kubernetes networking:

  • Tens of thousands of iptables rules (scales with the number of services)
  • Every new service updates iptables on every node
  • Packet processing goes through conntrack, NAT, and routing
  • Genuinely slow in large clusters

Cilium's model:

  • Service/endpoint info stored in BPF maps
  • Packet processing is done directly by BPF programs
  • Almost no need for iptables
  • Consistent performance even in large clusters

12.3 Sidecar-less Service Mesh

Cilium leverages sock_ops and sk_msg to intercept L7 traffic without a sidecar proxy. The traditional Istio/Linkerd model puts an Envoy sidecar in every pod, which is expensive in memory/CPU/latency.

Cilium's sidecarless model runs one (or zero) proxy per node and redirects traffic to that proxy with BPF.

12.4 Tetragon — Security

Another tool by the Cilium team. Uses LSM hooks and tracepoints to monitor all container activity. You can know in real time "this container just read /etc/passwd."

Similar to traditional Falco but deeper. On policy violation, it can send a signal or block immediately.

★ Insight ─────────────────────────────────────

  • New companies made by eBPF: Isovalent (the Cilium company, acquired by Cisco in 2024), Polar Signals (continuous profiling), Pixie (observability), Groundcover, Levitate Security. All new categories made possible by eBPF.
  • The end of the iptables era: In Kubernetes infrastructure, iptables is now treated as legacy. nftables replaced some of it, but the real future is eBPF. Cilium is becoming the de facto standard.
  • The meaning of a sidecar-less mesh: In Kubernetes clusters, sidecar proxies often consume more than 30% of node memory. Replacing them with eBPF frees that memory. That is a huge cost difference. ─────────────────────────────────────────────────

13. Case 4 — Falco Runtime Security

Falco is a runtime security tool by Sysdig. It detects abnormal behavior in containers and sends alerts. Examples: containers reading /etc/shadow, attempts to escalate to root, suspicious shell spawns, etc.

Traditionally it used a sysdig kernel module, but recent versions migrated to eBPF. It intercepts every syscall and feeds them to a rule engine.

# Example Falco rule
- rule: Read sensitive file
  desc: An attempt to read sensitive file
  condition: open_read and sensitive_files
  output: Sensitive file opened (user=%user.name file=%fd.name)
  priority: WARNING

Thanks to eBPF, no kernel module is needed and it works across any kernel version.


14. Case 5 — bpftune Automatic Tuning

A tool by Oracle. It monitors system metrics with eBPF and automatically adjusts sysctl values. Examples:

  • TCP connections time out frequently → lower tcp_keepalive_time
  • Memory pressure occurs frequently → adjust vm.swappiness
  • Disk IO bottleneck → increase readahead size

Traditionally, system tuning was done manually by humans. bpftune automates this in a data-driven way. Without eBPF, you would have had to run a separate tool for every metric.


15. Security — The Dangers of eBPF

As powerful as eBPF is, it is dangerous in the wrong hands.

15.1 BPF Permissions

Loading a BPF program typically requires CAP_BPF (5.8+) and CAP_PERF_MON or CAP_NET_ADMIN. Previously it required CAP_SYS_ADMIN (near-equivalent to root).

15.2 Bypassing the Verifier

The verifier is static analysis and not 100% perfect. Several past CVEs allowed tricking the verifier into enabling arbitrary memory reads/writes:

  • CVE-2022-23222: flaw in BPF pointer-arithmetic verification
  • CVE-2021-45402: flaw in 32-bit branch verification
  • CVE-2021-3490: flaw in ALU32 boundary tracking

Such flaws are patched quickly when discovered, but as the verifier grows more complex, the chance of new flaws also rises.

15.3 unprivileged_bpf_disabled

Linux can block normal users from using BPF by default:

echo 1 > /proc/sys/kernel/unprivileged_bpf_disabled

Most distros turn this on by default. Normal users cannot load BPF programs.

15.4 BPF LSM and BPF Self-Protection

The existence of BPF LSM means that BPF can govern BPF. Policies like "deny BPF program loads in this cgroup" can themselves be expressed in BPF.

15.5 Side Channels

BPF programs can leak privileged information. There is research showing side-channel attacks like Spectre are possible with BPF. The verifier rejects some concerning patterns, but perfect defense is hard.


16. Debugging — bpftool

bpftool is the Swiss army knife of BPF infrastructure.

16.1 Listing Loaded Programs

bpftool prog list
1: kprobe  name hello  tag a1b2c3d4e5f60718
        loaded_at 2026-04-15T10:30:00+0900  uid 0
        xlated 200B  jited 256B  memlock 4096B
        btf_id 5

16.2 Dumping a Program's BPF Code

bpftool prog dump xlated id 1

Shows the BPF instructions that passed verification.

bpftool prog dump jited id 1

Shows the JIT-compiled native code.

16.3 Viewing Maps

bpftool map list
bpftool map dump id 5

16.4 Dumping BTF

bpftool btf dump file /sys/kernel/btf/vmlinux | less

16.5 Verifier Logs

To see detailed verifier logs on program load:

bpftool prog load my.bpf.o /sys/fs/bpf/my_prog --log_level 7

17. The Future — eBPF's Next Steps

17.1 sched_ext

Covered in the Linux scheduler article. Allows userspace to write scheduling policies in BPF. Mainline-merged in 6.12.

17.2 struct_ops

Enables BPF programs to become implementations of kernel interfaces. For example, you can implement a TCP congestion-control algorithm in BPF via the bpf_struct_ops mechanism.

17.3 BPF for Filesystem Operations

Work is ongoing to allow attaching BPF to filesystem hooks. User-defined filesystem policies (e.g., cache policies, placement policies) become possible.

17.4 BPF in eBPF

BPF calling itself? Possible in some abstraction work. A generalization of tail calls.

17.5 Expansion to Other OSes

  • ebpf-for-windows: Backed by Microsoft. An attempt to bring eBPF infrastructure into the Windows kernel.
  • uBPF: A library for running a BPF VM in userspace. Used by AWS Firecracker.

eBPF may become a cross-OS standard.


18. Conclusion — eBPF Is Not Ending

If you have read this far, you should be able to answer:

  • What is eBPF, and how is it different from cBPF?
  • How does the verifier guarantee safety?
  • How do BPF maps communicate with userspace?
  • What attach points exist?
  • What does CO-RE solve?
  • What is the difference between libbpf, BCC, and bpftrace?
  • Why is XDP fast?
  • What did Cilium rewrite?

But this article is only the beginning. eBPF gets new features every year and expands into new areas every year. eBPF a year from now will be very different from today.

The best way to learn eBPF is to do it:

  1. Start diagnosing your system with bpftrace one-liners
  2. Read and modify the tools in BCC's /usr/share/bcc/tools/
  3. Write your own tool with libbpf examples
  4. When you hit verifier errors, read them line by line

Going through these steps makes eBPF no longer feel like magic, but like a powerful yet comprehensible tool.

This article also wraps up its sister-piece relationship with the Linux Internals Series. While the series covered "what the kernel does," this article covered "how users can safely extend the kernel." Put together, they paint the basic spirit of modern Linux systems.

Next articles will cover a [Cilium internals deep dive] or [new categories of tools made with BPF].


Appendix A — References

Appendix B — FAQ

Q: How should I start learning eBPF? A: Start with bpftrace. Following along with one-liners naturally teaches you the BPF model. After that, read BCC tool code, and finally write your own tool with libbpf-bootstrap.

Q: Can an eBPF program crash the kernel? A: Possible if the verifier mistakenly lets through a flawed program. But that is very rare and patched quickly. In normal use, BPF does not cause kernel crashes.

Q: Should I use kprobe or tracepoint? A: Tracepoint, if one is available. It is stable and fast. For places without a tracepoint, use kprobe (or fentry).

Q: BCC or libbpf? A: New code should be libbpf, no question. BCC is for maintaining old tools or for learning.

Q: Is XDP or tc faster? A: XDP. It processes packets before they become sk_buffs. tc is after sk_buff, so it is slightly slower but has richer information.

Q: Does Cilium really replace iptables entirely? A: Almost. In Cilium mode, kube-proxy's iptables rules are not created. Some basic host rules may still exist.

Q: What is the relationship between eBPF and DTrace? A: DTrace is Solaris's dynamic tracing infrastructure. A similar model, but it was built earlier and went a different direction. eBPF started from cBPF and is more general. Today's eBPF can do nearly everything DTrace did.

Q: Can eBPF track Java/Go GC pauses? A: Yes. Attaching uprobes to the GC entry/return functions lets you measure GC time. A usable alternative to JVM Flight Recorder.


Appendix C — Mini Glossary

  • BPF: Berkeley Packet Filter. Started in BSD in 1992.
  • cBPF: classic BPF. The old packet-filter ISA.
  • eBPF: extended BPF. Modern BPF from 2014+.
  • Verifier: Module that statically verifies the safety of BPF programs.
  • JIT: Just-In-Time compiler. Converts BPF bytecode into native code.
  • BTF: BPF Type Format. Metadata for kernel data structures.
  • CO-RE: Compile Once, Run Everywhere. Portable BPF based on BTF.
  • libbpf: Library for loading/attaching BPF programs.
  • BCC: BPF Compiler Collection. The old BPF toolkit.
  • bpftrace: An awk-like DSL for BPF.
  • bpftool: BPF infrastructure debugging tool.
  • kprobe: Hook at the entry of a kernel function.
  • kretprobe: Hook at the return of a kernel function.
  • fentry/fexit: Faster kprobe alternatives based on the ftrace trampoline.
  • tracepoint: Stable tracing points embedded in the kernel.
  • uprobe: Hook on a userspace function.
  • XDP: eXpress Data Path. BPF hook right after the NIC driver.
  • tc: Traffic Control. BPF hook after the sk_buff is built.
  • LSM: Linux Security Module. Security hooks.
  • KRSI: Kernel Runtime Security Instrumentation. Another name for BPF LSM.
  • sched_ext: Scheduler written in BPF (6.12+).
  • struct_ops: Mechanism where BPF becomes the implementation of a kernel interface.
  • PERCPU map: A map split per CPU. No locks.
  • RINGBUF: The newer BPF event ring buffer (5.8+).
  • BPF tail call: A BPF program jumps to another BPF program.
  • Cilium: Kubernetes networking/security rewritten in BPF.
  • Tetragon: BPF-based runtime security (Cilium team).
  • Falco: BPF-based runtime security (Sysdig).
  • bpftune: Automatic system tuning with BPF (Oracle).

This article is a sister piece to the Linux Internals Series. The series covered what the kernel does for users. This article covered how users can safely step into the kernel. Together, the two scenes paint the spirit of modern Linux.

현재 단락 (1/573)

People hearing about eBPF for the first time usually react with "Wait, that's actually possible?" **...

작성 글자: 0원문 글자: 34,448작성 단락: 0/573