Skip to content

필사 모드: eBPF Observability in Practice — Opening the Black Box with bpftrace

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

It is 2 a.m. and an alert fires: "API responses intermittently take 3 seconds." You open the APM dashboard, but nothing inside the application code looks slow. CPU and memory are normal. The problem lives somewhere between the application and the hardware — inside the black box called the kernel.

The most powerful way to open that black box is eBPF-based observability. The previous post covered eBPF fundamentals (programs, maps, the verifier); this one is relentlessly practical. We will validate hypotheses in five minutes with bpftrace one-liners, dig into disk, network, and scheduler behavior with the BCC toolbox, dissect CPU usage with flame graphs, and then walk through three real-world incident scenarios step by step.

Where Traditional Tools Fall Short

Every traditional observability tool has its blind spots.

| Tool family | Good at | Limitation |

| --- | --- | --- |

| Metrics (Prometheus etc.) | Trends, alerting | Individual events hide behind aggregated averages |

| Logs | Whatever the application chose to record | What was not logged appears not to exist |

| APM tracing | Per-request decomposition | Requires instrumentation; the kernel portion is a gap |

| strace / ltrace | Exhaustive syscall observation | ptrace-based, slows the target by an order of magnitude |

| perf | CPU profiling | Dump-then-postprocess workflow, limited custom logic |

eBPF observability differs in three ways:

1. No code changes, no redeploys. You observe already-running processes and the kernel in place.

2. Low overhead. Events are filtered and aggregated inside the kernel, and only the distilled results cross into user space — in stark contrast to the context-switch storm strace creates.

3. Kernel and application through one lens. Syscalls, disk IO, the scheduler, TCP retransmits, and user-function latency connect in a single line of investigation.

The big picture in ASCII:

Observed layers and their eBPF hooks

+------------------------------------------------------------+

| Application <── uprobe / USDT (function latency, GC) |

+------------------------------------------------------------+

| Libraries (libc) <── uprobe (malloc, pthread, ...) |

+------------------------------------------------------------+

| Syscall boundary <── tracepoint syscalls:* (open, read...) |

+------------------------------------------------------------+

| Kernel subsystems |

| VFS/filesystem <── kprobe/kfunc (vfs_read, ...) |

| Block layer <── tracepoint block:* (IO latency) |

| Network stack <── kprobe tcp_* / tracepoint sock:* |

| Scheduler <── tracepoint sched:* (run queue wait) |

+------------------------------------------------------------+

| Hardware events <── perf_event (CPU cycles, cache misses) |

+------------------------------------------------------------+

bpftrace Syntax in Five Minutes

bpftrace is a DSL that resembles awk. The basic structure is "probe / filter / action."

probe /filter/ { action }

probe : where to attach (tracepoint, kprobe, uprobe, profile...)

filter : which events to keep (e.g. pid == 1234, optional)

action : what to do (print, aggregate into a map)

Knowing a handful of built-in variables and functions is enough to start.

| Element | Meaning |

| --- | --- |

| pid, tid, comm | Process ID, thread ID, command name |

| args | Tracepoint argument struct |

| retval | Return value in kretprobe/uretprobe |

| nsecs | Nanosecond timestamp |

| kstack, ustack | Kernel/user stack trace |

| count(), sum(), avg() | Aggregation functions |

| hist(), lhist() | log2 / linear histograms |

Ten one-liners for the real world

1. Trace file opens — who opens which files:

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat

{ printf("%-16s %s\n", comm, str(args->filename)); }'

2. Trace process execution — every command run on the system:

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_execve

{ printf("%-16s -> %s\n", comm, str(args->filename)); }'

3. Top syscall counts — which process calls which syscalls the most:

sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter

{ @[comm] = count(); }'

4. read syscall latency histogram — latency distribution per syscall:

sudo bpftrace -e '

tracepoint:syscalls:sys_enter_read { @start[tid] = nsecs; }

tracepoint:syscalls:sys_exit_read /@start[tid]/

{ @usecs = hist((nsecs - @start[tid]) / 1000); delete(@start[tid]); }'

5. New TCP connections — which process connects where:

sudo bpftrace -e 'kprobe:tcp_connect

{ printf("%-16s pid=%d\n", comm, pid); }'

6. Block IO size distribution — histogram of disk request sizes:

sudo bpftrace -e 'tracepoint:block:block_rq_issue

{ @bytes = hist(args->bytes); }'

7. Total disk IO bytes per process:

sudo bpftrace -e 'tracepoint:block:block_rq_issue

{ @[comm] = sum(args->bytes); }'

8. CPU sampling at 99Hz — where does kernel time go:

sudo bpftrace -e 'profile:hz:99 { @[kstack] = count(); }'

9. Processes with the most page faults:

sudo bpftrace -e 'software:page-faults:1 { @[comm] = count(); }'

10. Signal tracing — who sends kill to whom:

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_kill

{ printf("%s (pid %d) -> pid %d, sig %d\n",

comm, pid, args->pid, args->sig); }'

Internalize just these ten and you can start every conversation with measurement instead of guesswork. Explore the available probes with:

sudo bpftrace -l 'tracepoint:syscalls:*' | head -20

sudo bpftrace -lv 'tracepoint:syscalls:sys_enter_openat' # inspect arguments

A Tour of the BCC Toolbox

If bpftrace is for ad hoc questions, the BCC tool collection (and its libbpf-based rewrite, libbpf-tools) is a set of polished, ready-made instruments. Most distributions ship them as the bpfcc-tools or bcc-tools package.

Tool selection by symptom

| Symptom / question | Tool | What it shows |

| --- | --- | --- |

| What keeps spawning processes | execsnoop | Every execve with arguments |

| Which files are being opened | opensnoop | open calls and result codes |

| Is the disk slow | biolatency | Block IO latency histogram |

| Which IOs are slow | biosnoop | Per-IO latency, size, process |

| TCP connection lifetimes | tcplife | Per-connection lifetime, bytes in/out |

| Who initiates connections | tcpconnect | Active connect tracing |

| Who accepts connections | tcpaccept | Passive accept tracing |

| Are there retransmits | tcpretrans | TCP retransmission events |

| Are syscalls slow | syscount | Per-syscall counts/latency |

| Is run queue wait long | runqlat | Scheduler run queue latency histogram |

| Are cache misses frequent | cachestat | Page cache hits/misses |

| Is the filesystem slow | ext4slower etc. | FS operations slower than a threshold |

Typical invocations:

Print only ext4 operations slower than 10ms

sudo ext4slower-bpfcc 10

Block IO latency histogram at 1-second intervals

sudo biolatency-bpfcc -m 1

Observe lifetimes of all TCP sessions, containers included

sudo tcplife-bpfcc

The 60-second triage routine

The sequence I use for incident first response:

sudo execsnoop-bpfcc # unexpected process storms? (cron, health checks)

sudo runqlat-bpfcc 5 1 # CPU contention: run queue wait distribution

sudo biolatency-bpfcc 5 1 # disk: IO latency distribution

sudo tcpretrans-bpfcc # network: any retransmissions?

sudo syscount-bpfcc -L -d 5 # top syscalls by latency

Five commands and one minute are enough to narrow the problem down to a layer: CPU, disk, network, or syscalls.

How to Read Latency Histograms

Among all eBPF tool outputs, the log2 histogram carries the most information. Here is a sample biolatency output:

usecs : count distribution

128 -> 255 : 1402 |************************** |

256 -> 511 : 2012 |****************************************|

512 -> 1023 : 803 |*************** |

1024 -> 2047 : 95 |* |

...

65536 -> 131071 : 41 | |

131072 -> 262143 : 17 | |

Three things to read:

1. Where the mode sits: most IOs cluster between 256 and 511 microseconds. That is the "normal" characteristic of the device.

2. Whether a tail exists: there are 17 events in the 130ms-and-above range. They barely register in the average, yet they are exactly what ruins the p99.9 user experience.

3. Whether the distribution is bimodal: two peaks are a strong signal that two distinct paths are mixed together — "cache hit vs miss," "local vs remote," "happy path vs retry." The mean then points to a value between the peaks that no real request ever experiences.

In short: discard the average and look at the distribution. That is the central lesson of eBPF observability.

CPU Profiling and Flame Graphs

perf-based vs eBPF-based

| Aspect | perf record | eBPF (profile probe / profile-bpfcc) |

| --- | --- | --- |

| Collection | Dump samples to file, postprocess | Aggregate per-stack inside kernel maps |

| Data volume | perf.data can grow to gigabytes | Only aggregates cross over, small |

| Overhead | Includes disk-write cost | Generally lower |

| Flexibility | Standardized workflow, broad PMU support | Custom logic such as conditional collection |

Two routes to a flame graph from 99Hz whole-system sampling:

Route A: perf-based

sudo perf record -F 99 -a -g -- sleep 30

sudo perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg

Route B: eBPF-based (in-kernel aggregation, folded output directly)

sudo profile-bpfcc -F 99 -af 30 > out.folded

flamegraph.pl out.folded > cpu.svg

Reading a flame graph is simple: horizontal width is share of CPU time, and the call stack deepens as you go up. Find the widest plateau, then walk downward to understand why that function is being called. To see user stacks, the target must either be built to preserve frame pointers (-fno-omit-frame-pointer) or be supported by DWARF/BTF-based unwinding as in recent profilers.

Off-CPU analysis

A CPU flame graph only shows time spent running. Yet most latency comes from waiting — on disk, locks, or the network. That is where off-CPU analysis comes in.

Aggregate, per stack, the time threads spent blocked off-CPU

sudo offcputime-bpfcc -df 30 > offcpu.folded

flamegraph.pl --colors=io offcpu.folded > offcpu.svg

Truths invisible in the CPU flame graph — such as "80 percent of time waiting on a lock" — frequently surface in the off-CPU graph. Make a habit of reading on-CPU and off-CPU graphs as a pair.

In Containers and Kubernetes

The principle: eBPF is a node-level technology

Because eBPF programs attach to the kernel, one program observes every container on the node at once. Containers are ultimately processes isolated by cgroups and namespaces, so filtering by PID or cgroup ID lets you single out a specific Pod. A typical approach when working directly on a node:

Find the PID of a specific container and filter to that process

PID=$(crictl inspect --output go-template \

--template '{{.info.pid}}' CONTAINER_ID)

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat /pid == '$PID'/

{ printf("%s\n", str(args->filename)); }'

Common patterns include deploying a debug-tools image as a DaemonSet, or using kubectl debug node to spin up a temporary pod on the node.

Ecosystem tools (within what I know)

| Tool | Approach | Characteristics |

| --- | --- | --- |

| Pixie | In-cluster eBPF agents | Automatic protocol capture (HTTP, DBs), script-based queries |

| Parca | eBPF continuous profiling | Always-on low-overhead profiler, time-series flame graphs |

| Grafana Beyla | eBPF auto-instrumentation | HTTP/gRPC RED metrics and traces with zero code changes |

| Cilium Hubble | CNI-integrated observability | Network flow visibility, service maps |

| Inspektor Gadget | BCC-style tools packaged for k8s | kubectl plugin runs node tools per Pod |

The common pattern: a node agent, not a sidecar. Thanks to eBPF, one DaemonSet per node observes everything — no per-pod proxy or agent injection required.

Tracing Applications with USDT and uprobes

User space can be traced the same way as the kernel.

A uprobe attaches to an arbitrary function in a binary — for example, to measure the latency of a function in a C/C++/Go/Rust binary.

Find traceable symbols in a binary

sudo bpftrace -l 'uprobe:/usr/local/bin/myapp:*' | grep -i order

Latency histogram for a specific function

sudo bpftrace -e '

uprobe:/usr/local/bin/myapp:process_order { @start[tid] = nsecs; }

uretprobe:/usr/local/bin/myapp:process_order /@start[tid]/

{ @latency_us = hist((nsecs - @start[tid]) / 1000);

delete(@start[tid]); }'

USDT (User Statically-Defined Tracing) probes are stable trace points the application embeds in advance. PostgreSQL, the JVM, Python, and Node.js all ship USDT probes.

List USDT probes of a process

sudo bpftrace -lp $(pgrep -o postgres) 'usdt:*' | head

Trace PostgreSQL query starts (USDT example)

sudo bpftrace -e 'usdt:/usr/lib/postgresql/16/bin/postgres:query__start

{ printf("query: %s\n", str(arg0)); }'

Two caveats: uprobes cannot catch functions that were inlined, and stripped binaries need a symbol table. Also, attaching a uprobe to a very hot user function accumulates trap costs, so estimate the call frequency first.

Three Production Troubleshooting Scenarios

Scenario 1: Disk IO surge

Symptom: disk utilization spikes at certain times of day, accompanied by service latency.

Step 1: latency distribution — is the disk actually slow, and how much?

sudo biolatency-bpfcc -m 5 3

Step 2: the culprit process — who is generating the IO?

sudo biotop-bpfcc 5

Step 3: individual IOs — what files, what pattern?

sudo biosnoop-bpfcc | head -50

Step 4: down to file level — which files concentrate reads/writes?

sudo filetop-bpfcc -C 5

Step 5: who triggers writes — page cache flush or direct writes?

sudo bpftrace -e 'tracepoint:block:block_rq_issue

{ @[comm, args->rwbs] = count(); }'

A typical conclusion: a backup job or log rotation produces sequential writes on the same disk, fattening the tail latency of the database random reads. Fixes include separating IO scheduling classes (ionice), cgroup io.max limits, or moving to a separate volume.

Scenario 2: Intermittent latency (p99 spikes)

Symptom: average response time is fine, but p99 spikes periodically. The APM offers no clues.

Step 1: identify the layer — is it CPU wait?

sudo runqlat-bpfcc 5 12 # 12 runs over a minute: do they align with spikes?

Step 2: find the CPU thief — who suddenly burns CPU?

sudo profile-bpfcc -F 99 -af 30 > spike.folded

Step 3: check blocking — where does off-CPU time go?

sudo offcputime-bpfcc -df 30 -p $(pgrep -o myapp) > offcpu.folded

Step 4: test the GC/mutex hypothesis (e.g., futex wait distribution)

sudo bpftrace -e '

tracepoint:syscalls:sys_enter_futex { @start[tid] = nsecs; }

tracepoint:syscalls:sys_exit_futex /@start[tid]/

{ @futex_ms = hist((nsecs - @start[tid]) / 1000000);

delete(@start[tid]); }'

A typical conclusion: a cron-driven metrics collector grabs CPU every minute and inflates run queue wait, or a lock convoy forms around a specific mutex. If the runqlat histogram shifts right only at spike times, the CPU contention hypothesis strengthens.

Scenario 3: Connection leak

Symptom: file descriptors grow over time until the service fails with too many open files.

Step 1: observe connection lifetimes — are some never closing?

sudo tcplife-bpfcc -p $(pgrep -o myapp)

Step 2: open vs close balance — which side leaks?

sudo bpftrace -e '

kprobe:tcp_connect { @open[comm] = count(); }

kprobe:tcp_close { @close[comm] = count(); }

interval:s:10 { print(@open); print(@close);

clear(@open); clear(@close); }'

Step 3: the leak site stack — connections created where never get closed?

sudo bpftrace -e 'kprobe:tcp_connect /comm == "myapp"/

{ @[ustack] = count(); }'

Step 4: general FD growth check (it might not be sockets)

sudo opensnoop-bpfcc -p $(pgrep -o myapp) -d 30

A typical conclusion: an error path fails to close an HTTP client response body, so connections accumulate outside the pool. The user stacks from step 3 point to the exact code path causing the leak.

Overhead Management and Safety Rules

Principles for running eBPF tools on production machines:

1. Estimate the frequency first. Before attaching a probe, count how many times per second the event fires.

sudo bpftrace -e 'kprobe:vfs_read { @ = count(); }

interval:s:1 { print(@); clear(@); }'

2. Aggregate inside the kernel; receive only summaries. Printing per event with printf becomes its own load under high event rates. Default to count/hist aggregation.

3. Bound the duration. Most BCC tools accept count/duration arguments. Avoid open-ended runs; stop as soon as the measurement is done.

4. Push filters as early as possible. Filter by PID, cgroup, or port on the kernel side first.

5. Start with a canary. Do not roll out fleet-wide at once; measure overhead on one machine, then expand.

6. Mind map memory. For aggregations whose key cardinality can explode (such as stack trace keys), check the map size limits.

7. Split privileges. Many tools work with CAP_BPF plus CAP_PERFMON instead of full root. See the fundamentals post for details.

Operational Checklist

- [ ] bpftrace and the bcc tools are installed on nodes and version-compatible with the kernel

- [ ] BTF is enabled in the kernel (/sys/kernel/btf/vmlinux exists)

- [ ] The 60-second triage routine is written into the incident runbook

- [ ] Key service binaries are built to preserve frame pointers

- [ ] There is an agreed location to store and share profiling artifacts (flame graphs)

- [ ] In container environments, node access or an Inspektor Gadget-style path is prepared

- [ ] An execution policy for eBPF tools (who, with which capabilities) is defined

- [ ] The team agrees on guardrails for high-frequency probes (estimate frequency first)

Closing Thoughts

The essence of eBPF observability is not a list of tools but a shift in how questions are asked. Instead of guessing "probably the disk," you confirm with a biolatency histogram in 30 seconds. Instead of saying "must be GC," you prove it with an off-CPU flame graph. Once the kernel stops being a black box, the entire incident-response conversation happens on top of data.

If you understand the program/map/verifier principles from the fundamentals post, every tool in this article starts to read as "a program attached to a tracepoint, aggregating into a per-CPU map, streaming through a ring buffer." In the next post, the same technology turns toward security: with Tetragon, Falco, and BPF LSM we will go beyond observing — to detecting and blocking.

References

- [bpftrace official documentation](https://bpftrace.org/)

- [bpftrace one-liner tutorial (GitHub)](https://github.com/bpftrace/bpftrace/blob/master/docs/tutorial_one_liners.md)

- [BCC (BPF Compiler Collection) tools](https://github.com/iovisor/bcc)

- [Brendan Gregg — eBPF tracing resources](https://www.brendangregg.com/ebpf.html)

- [Brendan Gregg — Flame graphs](https://www.brendangregg.com/flamegraphs.html)

- [Brendan Gregg — Off-CPU analysis](https://www.brendangregg.com/offcpuanalysis.html)

- [Linux kernel BPF documentation](https://docs.kernel.org/bpf/)

- [perf wiki (kernel.org)](https://perf.wiki.kernel.org/)

- [Pixie official documentation](https://docs.px.dev/)

- [Parca — continuous profiling](https://www.parca.dev/)

- [Grafana Beyla documentation](https://grafana.com/docs/beyla/latest/)

- [Inspektor Gadget](https://www.inspektor-gadget.io/)

- [Cilium Hubble documentation](https://docs.cilium.io/en/stable/observability/hubble/)

현재 단락 (1/225)

It is 2 a.m. and an alert fires: "API responses intermittently take 3 seconds." You open the APM das...

작성 글자: 0원문 글자: 16,580작성 단락: 0/225