eBPF Observability in Practice — Opening the Black Box with bpftrace

Introduction
Where Traditional Tools Fall Short
bpftrace Syntax in Five Minutes
- Ten one-liners for the real world
A Tour of the BCC Toolbox
- Tool selection by symptom
- The 60-second triage routine
How to Read Latency Histograms
CPU Profiling and Flame Graphs
- perf-based vs eBPF-based
- Off-CPU analysis
In Containers and Kubernetes
- The principle: eBPF is a node-level technology
- Ecosystem tools (within what I know)
Tracing Applications with USDT and uprobes
Three Production Troubleshooting Scenarios
Overhead Management and Safety Rules
Operational Checklist
Closing Thoughts
References

Introduction

It is 2 a.m. and an alert fires: "API responses intermittently take 3 seconds." You open the APM dashboard, but nothing inside the application code looks slow. CPU and memory are normal. The problem lives somewhere between the application and the hardware — inside the black box called the kernel.

The most powerful way to open that black box is eBPF-based observability. The previous post covered eBPF fundamentals (programs, maps, the verifier); this one is relentlessly practical. We will validate hypotheses in five minutes with bpftrace one-liners, dig into disk, network, and scheduler behavior with the BCC toolbox, dissect CPU usage with flame graphs, and then walk through three real-world incident scenarios step by step.

Where Traditional Tools Fall Short

Every traditional observability tool has its blind spots.

Tool family	Good at	Limitation
Metrics (Prometheus etc.)	Trends, alerting	Individual events hide behind aggregated averages
Logs	Whatever the application chose to record	What was not logged appears not to exist
APM tracing	Per-request decomposition	Requires instrumentation; the kernel portion is a gap
strace / ltrace	Exhaustive syscall observation	ptrace-based, slows the target by an order of magnitude
perf	CPU profiling	Dump-then-postprocess workflow, limited custom logic

eBPF observability differs in three ways:

No code changes, no redeploys. You observe already-running processes and the kernel in place.
Low overhead. Events are filtered and aggregated inside the kernel, and only the distilled results cross into user space — in stark contrast to the context-switch storm strace creates.
Kernel and application through one lens. Syscalls, disk IO, the scheduler, TCP retransmits, and user-function latency connect in a single line of investigation.

The big picture in ASCII:

            Observed layers and their eBPF hooks
+------------------------------------------------------------+
| Application       <── uprobe / USDT (function latency, GC) |
+------------------------------------------------------------+
| Libraries (libc)  <── uprobe (malloc, pthread, ...)         |
+------------------------------------------------------------+
| Syscall boundary  <── tracepoint syscalls:* (open, read...) |
+------------------------------------------------------------+
| Kernel subsystems                                           |
|   VFS/filesystem  <── kprobe/kfunc (vfs_read, ...)          |
|   Block layer     <── tracepoint block:* (IO latency)       |
|   Network stack   <── kprobe tcp_* / tracepoint sock:*      |
|   Scheduler       <── tracepoint sched:* (run queue wait)   |
+------------------------------------------------------------+
| Hardware events   <── perf_event (CPU cycles, cache misses) |
+------------------------------------------------------------+

bpftrace Syntax in Five Minutes

bpftrace is a DSL that resembles awk. The basic structure is "probe / filter / action."

probe /filter/ { action }

probe  : where to attach (tracepoint, kprobe, uprobe, profile...)
filter : which events to keep (e.g. pid == 1234, optional)
action : what to do (print, aggregate into a map)

Knowing a handful of built-in variables and functions is enough to start.

Element	Meaning
pid, tid, comm	Process ID, thread ID, command name
args	Tracepoint argument struct
retval	Return value in kretprobe/uretprobe
nsecs	Nanosecond timestamp
kstack, ustack	Kernel/user stack trace
count(), sum(), avg()	Aggregation functions
hist(), lhist()	log2 / linear histograms

Ten one-liners for the real world

Trace file opens — who opens which files:

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat
  { printf("%-16s %s\n", comm, str(args->filename)); }'

Trace process execution — every command run on the system:

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_execve
  { printf("%-16s -> %s\n", comm, str(args->filename)); }'

Top syscall counts — which process calls which syscalls the most:

sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter
  { @[comm] = count(); }'

read syscall latency histogram — latency distribution per syscall:

sudo bpftrace -e '
tracepoint:syscalls:sys_enter_read { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_read /@start[tid]/
  { @usecs = hist((nsecs - @start[tid]) / 1000); delete(@start[tid]); }'

New TCP connections — which process connects where:

sudo bpftrace -e 'kprobe:tcp_connect
  { printf("%-16s pid=%d\n", comm, pid); }'

Block IO size distribution — histogram of disk request sizes:

sudo bpftrace -e 'tracepoint:block:block_rq_issue
  { @bytes = hist(args->bytes); }'

Total disk IO bytes per process:

sudo bpftrace -e 'tracepoint:block:block_rq_issue
  { @[comm] = sum(args->bytes); }'

CPU sampling at 99Hz — where does kernel time go:

sudo bpftrace -e 'profile:hz:99 { @[kstack] = count(); }'

Processes with the most page faults:

sudo bpftrace -e 'software:page-faults:1 { @[comm] = count(); }'

Signal tracing — who sends kill to whom:

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_kill
  { printf("%s (pid %d) -> pid %d, sig %d\n",
           comm, pid, args->pid, args->sig); }'

Internalize just these ten and you can start every conversation with measurement instead of guesswork. Explore the available probes with:

sudo bpftrace -l 'tracepoint:syscalls:*' | head -20
sudo bpftrace -lv 'tracepoint:syscalls:sys_enter_openat'   # inspect arguments

A Tour of the BCC Toolbox

If bpftrace is for ad hoc questions, the BCC tool collection (and its libbpf-based rewrite, libbpf-tools) is a set of polished, ready-made instruments. Most distributions ship them as the bpfcc-tools or bcc-tools package.

Tool selection by symptom

Symptom / question	Tool	What it shows
What keeps spawning processes	execsnoop	Every execve with arguments
Which files are being opened	opensnoop	open calls and result codes
Is the disk slow	biolatency	Block IO latency histogram
Which IOs are slow	biosnoop	Per-IO latency, size, process
TCP connection lifetimes	tcplife	Per-connection lifetime, bytes in/out
Who initiates connections	tcpconnect	Active connect tracing
Who accepts connections	tcpaccept	Passive accept tracing
Are there retransmits	tcpretrans	TCP retransmission events
Are syscalls slow	syscount	Per-syscall counts/latency
Is run queue wait long	runqlat	Scheduler run queue latency histogram
Are cache misses frequent	cachestat	Page cache hits/misses
Is the filesystem slow	ext4slower etc.	FS operations slower than a threshold

Typical invocations:

# Print only ext4 operations slower than 10ms
sudo ext4slower-bpfcc 10

# Block IO latency histogram at 1-second intervals
sudo biolatency-bpfcc -m 1

# Observe lifetimes of all TCP sessions, containers included
sudo tcplife-bpfcc

The 60-second triage routine

The sequence I use for incident first response:

sudo execsnoop-bpfcc      # unexpected process storms? (cron, health checks)
sudo runqlat-bpfcc 5 1    # CPU contention: run queue wait distribution
sudo biolatency-bpfcc 5 1 # disk: IO latency distribution
sudo tcpretrans-bpfcc     # network: any retransmissions?
sudo syscount-bpfcc -L -d 5  # top syscalls by latency

Five commands and one minute are enough to narrow the problem down to a layer: CPU, disk, network, or syscalls.

How to Read Latency Histograms

Among all eBPF tool outputs, the log2 histogram carries the most information. Here is a sample biolatency output:

usecs               : count     distribution
   128 -> 255       : 1402     |**************************              |
   256 -> 511       : 2012     |****************************************|
   512 -> 1023      : 803      |***************                         |
  1024 -> 2047      : 95       |*                                       |
  ...
 65536 -> 131071    : 41       |                                        |
131072 -> 262143    : 17       |                                        |

Three things to read:

Where the mode sits: most IOs cluster between 256 and 511 microseconds. That is the "normal" characteristic of the device.
Whether a tail exists: there are 17 events in the 130ms-and-above range. They barely register in the average, yet they are exactly what ruins the p99.9 user experience.
Whether the distribution is bimodal: two peaks are a strong signal that two distinct paths are mixed together — "cache hit vs miss," "local vs remote," "happy path vs retry." The mean then points to a value between the peaks that no real request ever experiences.

In short: discard the average and look at the distribution. That is the central lesson of eBPF observability.

CPU Profiling and Flame Graphs

perf-based vs eBPF-based

Aspect	perf record	eBPF (profile probe / profile-bpfcc)
Collection	Dump samples to file, postprocess	Aggregate per-stack inside kernel maps
Data volume	perf.data can grow to gigabytes	Only aggregates cross over, small
Overhead	Includes disk-write cost	Generally lower
Flexibility	Standardized workflow, broad PMU support	Custom logic such as conditional collection

Two routes to a flame graph from 99Hz whole-system sampling:

# Route A: perf-based
sudo perf record -F 99 -a -g -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg

# Route B: eBPF-based (in-kernel aggregation, folded output directly)
sudo profile-bpfcc -F 99 -af 30 > out.folded
flamegraph.pl out.folded > cpu.svg

Reading a flame graph is simple: horizontal width is share of CPU time, and the call stack deepens as you go up. Find the widest plateau, then walk downward to understand why that function is being called. To see user stacks, the target must either be built to preserve frame pointers (-fno-omit-frame-pointer) or be supported by DWARF/BTF-based unwinding as in recent profilers.

Off-CPU analysis

A CPU flame graph only shows time spent running. Yet most latency comes from waiting — on disk, locks, or the network. That is where off-CPU analysis comes in.

# Aggregate, per stack, the time threads spent blocked off-CPU
sudo offcputime-bpfcc -df 30 > offcpu.folded
flamegraph.pl --colors=io offcpu.folded > offcpu.svg

Truths invisible in the CPU flame graph — such as "80 percent of time waiting on a lock" — frequently surface in the off-CPU graph. Make a habit of reading on-CPU and off-CPU graphs as a pair.

In Containers and Kubernetes

The principle: eBPF is a node-level technology

Because eBPF programs attach to the kernel, one program observes every container on the node at once. Containers are ultimately processes isolated by cgroups and namespaces, so filtering by PID or cgroup ID lets you single out a specific Pod. A typical approach when working directly on a node:

# Find the PID of a specific container and filter to that process
PID=$(crictl inspect --output go-template \
      --template '{{.info.pid}}' CONTAINER_ID)
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat /pid == '$PID'/
  { printf("%s\n", str(args->filename)); }'

Common patterns include deploying a debug-tools image as a DaemonSet, or using kubectl debug node to spin up a temporary pod on the node.

Ecosystem tools (within what I know)

Tool	Approach	Characteristics
Pixie	In-cluster eBPF agents	Automatic protocol capture (HTTP, DBs), script-based queries
Parca	eBPF continuous profiling	Always-on low-overhead profiler, time-series flame graphs
Grafana Beyla	eBPF auto-instrumentation	HTTP/gRPC RED metrics and traces with zero code changes
Cilium Hubble	CNI-integrated observability	Network flow visibility, service maps
Inspektor Gadget	BCC-style tools packaged for k8s	kubectl plugin runs node tools per Pod

The common pattern: a node agent, not a sidecar. Thanks to eBPF, one DaemonSet per node observes everything — no per-pod proxy or agent injection required.

Tracing Applications with USDT and uprobes

User space can be traced the same way as the kernel.

A uprobe attaches to an arbitrary function in a binary — for example, to measure the latency of a function in a C/C++/Go/Rust binary.

# Find traceable symbols in a binary
sudo bpftrace -l 'uprobe:/usr/local/bin/myapp:*' | grep -i order

# Latency histogram for a specific function
sudo bpftrace -e '
uprobe:/usr/local/bin/myapp:process_order { @start[tid] = nsecs; }
uretprobe:/usr/local/bin/myapp:process_order /@start[tid]/
  { @latency_us = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]); }'

USDT (User Statically-Defined Tracing) probes are stable trace points the application embeds in advance. PostgreSQL, the JVM, Python, and Node.js all ship USDT probes.

# List USDT probes of a process
sudo bpftrace -lp $(pgrep -o postgres) 'usdt:*' | head

# Trace PostgreSQL query starts (USDT example)
sudo bpftrace -e 'usdt:/usr/lib/postgresql/16/bin/postgres:query__start
  { printf("query: %s\n", str(arg0)); }'

Two caveats: uprobes cannot catch functions that were inlined, and stripped binaries need a symbol table. Also, attaching a uprobe to a very hot user function accumulates trap costs, so estimate the call frequency first.

Three Production Troubleshooting Scenarios

Scenario 1: Disk IO surge

Symptom: disk utilization spikes at certain times of day, accompanied by service latency.

# Step 1: latency distribution — is the disk actually slow, and how much?
sudo biolatency-bpfcc -m 5 3

# Step 2: the culprit process — who is generating the IO?
sudo biotop-bpfcc 5

# Step 3: individual IOs — what files, what pattern?
sudo biosnoop-bpfcc | head -50

# Step 4: down to file level — which files concentrate reads/writes?
sudo filetop-bpfcc -C 5

# Step 5: who triggers writes — page cache flush or direct writes?
sudo bpftrace -e 'tracepoint:block:block_rq_issue
  { @[comm, args->rwbs] = count(); }'

A typical conclusion: a backup job or log rotation produces sequential writes on the same disk, fattening the tail latency of the database random reads. Fixes include separating IO scheduling classes (ionice), cgroup io.max limits, or moving to a separate volume.

Scenario 2: Intermittent latency (p99 spikes)

Symptom: average response time is fine, but p99 spikes periodically. The APM offers no clues.

# Step 1: identify the layer — is it CPU wait?
sudo runqlat-bpfcc 5 12     # 12 runs over a minute: do they align with spikes?

# Step 2: find the CPU thief — who suddenly burns CPU?
sudo profile-bpfcc -F 99 -af 30 > spike.folded

# Step 3: check blocking — where does off-CPU time go?
sudo offcputime-bpfcc -df 30 -p $(pgrep -o myapp) > offcpu.folded

# Step 4: test the GC/mutex hypothesis (e.g., futex wait distribution)
sudo bpftrace -e '
tracepoint:syscalls:sys_enter_futex { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_futex /@start[tid]/
  { @futex_ms = hist((nsecs - @start[tid]) / 1000000);
    delete(@start[tid]); }'

A typical conclusion: a cron-driven metrics collector grabs CPU every minute and inflates run queue wait, or a lock convoy forms around a specific mutex. If the runqlat histogram shifts right only at spike times, the CPU contention hypothesis strengthens.

Scenario 3: Connection leak

Symptom: file descriptors grow over time until the service fails with too many open files.

# Step 1: observe connection lifetimes — are some never closing?
sudo tcplife-bpfcc -p $(pgrep -o myapp)

# Step 2: open vs close balance — which side leaks?
sudo bpftrace -e '
kprobe:tcp_connect { @open[comm] = count(); }
kprobe:tcp_close   { @close[comm] = count(); }
interval:s:10 { print(@open); print(@close);
                clear(@open); clear(@close); }'

# Step 3: the leak site stack — connections created where never get closed?
sudo bpftrace -e 'kprobe:tcp_connect /comm == "myapp"/
  { @[ustack] = count(); }'

# Step 4: general FD growth check (it might not be sockets)
sudo opensnoop-bpfcc -p $(pgrep -o myapp) -d 30

A typical conclusion: an error path fails to close an HTTP client response body, so connections accumulate outside the pool. The user stacks from step 3 point to the exact code path causing the leak.

Overhead Management and Safety Rules

Principles for running eBPF tools on production machines:

Estimate the frequency first. Before attaching a probe, count how many times per second the event fires.

sudo bpftrace -e 'kprobe:vfs_read { @ = count(); }
                  interval:s:1 { print(@); clear(@); }'

Aggregate inside the kernel; receive only summaries. Printing per event with printf becomes its own load under high event rates. Default to count/hist aggregation.
Bound the duration. Most BCC tools accept count/duration arguments. Avoid open-ended runs; stop as soon as the measurement is done.
Push filters as early as possible. Filter by PID, cgroup, or port on the kernel side first.
Start with a canary. Do not roll out fleet-wide at once; measure overhead on one machine, then expand.
Mind map memory. For aggregations whose key cardinality can explode (such as stack trace keys), check the map size limits.
Split privileges. Many tools work with CAP_BPF plus CAP_PERFMON instead of full root. See the fundamentals post for details.

Operational Checklist

bpftrace and the bcc tools are installed on nodes and version-compatible with the kernel
BTF is enabled in the kernel (/sys/kernel/btf/vmlinux exists)
The 60-second triage routine is written into the incident runbook
Key service binaries are built to preserve frame pointers
There is an agreed location to store and share profiling artifacts (flame graphs)
In container environments, node access or an Inspektor Gadget-style path is prepared
An execution policy for eBPF tools (who, with which capabilities) is defined
The team agrees on guardrails for high-frequency probes (estimate frequency first)

Closing Thoughts

The essence of eBPF observability is not a list of tools but a shift in how questions are asked. Instead of guessing "probably the disk," you confirm with a biolatency histogram in 30 seconds. Instead of saying "must be GC," you prove it with an off-CPU flame graph. Once the kernel stops being a black box, the entire incident-response conversation happens on top of data.

If you understand the program/map/verifier principles from the fundamentals post, every tool in this article starts to read as "a program attached to a tracepoint, aggregating into a per-CPU map, streaming through a ring buffer." In the next post, the same technology turns toward security: with Tetragon, Falco, and BPF LSM we will go beyond observing — to detecting and blocking.