Skip to content
Published on

eBPF Observability in Practice — Opening the Black Box with bpftrace

Authors

Introduction

It is 2 a.m. and an alert fires: "API responses intermittently take 3 seconds." You open the APM dashboard, but nothing inside the application code looks slow. CPU and memory are normal. The problem lives somewhere between the application and the hardware — inside the black box called the kernel.

The most powerful way to open that black box is eBPF-based observability. The previous post covered eBPF fundamentals (programs, maps, the verifier); this one is relentlessly practical. We will validate hypotheses in five minutes with bpftrace one-liners, dig into disk, network, and scheduler behavior with the BCC toolbox, dissect CPU usage with flame graphs, and then walk through three real-world incident scenarios step by step.

Where Traditional Tools Fall Short

Every traditional observability tool has its blind spots.

Tool familyGood atLimitation
Metrics (Prometheus etc.)Trends, alertingIndividual events hide behind aggregated averages
LogsWhatever the application chose to recordWhat was not logged appears not to exist
APM tracingPer-request decompositionRequires instrumentation; the kernel portion is a gap
strace / ltraceExhaustive syscall observationptrace-based, slows the target by an order of magnitude
perfCPU profilingDump-then-postprocess workflow, limited custom logic

eBPF observability differs in three ways:

  1. No code changes, no redeploys. You observe already-running processes and the kernel in place.
  2. Low overhead. Events are filtered and aggregated inside the kernel, and only the distilled results cross into user space — in stark contrast to the context-switch storm strace creates.
  3. Kernel and application through one lens. Syscalls, disk IO, the scheduler, TCP retransmits, and user-function latency connect in a single line of investigation.

The big picture in ASCII:

            Observed layers and their eBPF hooks
+------------------------------------------------------------+
| Application       <── uprobe / USDT (function latency, GC) |
+------------------------------------------------------------+
| Libraries (libc)  <── uprobe (malloc, pthread, ...)         |
+------------------------------------------------------------+
| Syscall boundary  <── tracepoint syscalls:* (open, read...) |
+------------------------------------------------------------+
| Kernel subsystems                                           |
|   VFS/filesystem  <── kprobe/kfunc (vfs_read, ...)          |
|   Block layer     <── tracepoint block:* (IO latency)       |
|   Network stack   <── kprobe tcp_* / tracepoint sock:*      |
|   Scheduler       <── tracepoint sched:* (run queue wait)   |
+------------------------------------------------------------+
| Hardware events   <── perf_event (CPU cycles, cache misses) |
+------------------------------------------------------------+

bpftrace Syntax in Five Minutes

bpftrace is a DSL that resembles awk. The basic structure is "probe / filter / action."

probe /filter/ { action }

probe  : where to attach (tracepoint, kprobe, uprobe, profile...)
filter : which events to keep (e.g. pid == 1234, optional)
action : what to do (print, aggregate into a map)

Knowing a handful of built-in variables and functions is enough to start.

ElementMeaning
pid, tid, commProcess ID, thread ID, command name
argsTracepoint argument struct
retvalReturn value in kretprobe/uretprobe
nsecsNanosecond timestamp
kstack, ustackKernel/user stack trace
count(), sum(), avg()Aggregation functions
hist(), lhist()log2 / linear histograms

Ten one-liners for the real world

  1. Trace file opens — who opens which files:
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat
  { printf("%-16s %s\n", comm, str(args->filename)); }'
  1. Trace process execution — every command run on the system:
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_execve
  { printf("%-16s -> %s\n", comm, str(args->filename)); }'
  1. Top syscall counts — which process calls which syscalls the most:
sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter
  { @[comm] = count(); }'
  1. read syscall latency histogram — latency distribution per syscall:
sudo bpftrace -e '
tracepoint:syscalls:sys_enter_read { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_read /@start[tid]/
  { @usecs = hist((nsecs - @start[tid]) / 1000); delete(@start[tid]); }'
  1. New TCP connections — which process connects where:
sudo bpftrace -e 'kprobe:tcp_connect
  { printf("%-16s pid=%d\n", comm, pid); }'
  1. Block IO size distribution — histogram of disk request sizes:
sudo bpftrace -e 'tracepoint:block:block_rq_issue
  { @bytes = hist(args->bytes); }'
  1. Total disk IO bytes per process:
sudo bpftrace -e 'tracepoint:block:block_rq_issue
  { @[comm] = sum(args->bytes); }'
  1. CPU sampling at 99Hz — where does kernel time go:
sudo bpftrace -e 'profile:hz:99 { @[kstack] = count(); }'
  1. Processes with the most page faults:
sudo bpftrace -e 'software:page-faults:1 { @[comm] = count(); }'
  1. Signal tracing — who sends kill to whom:
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_kill
  { printf("%s (pid %d) -> pid %d, sig %d\n",
           comm, pid, args->pid, args->sig); }'

Internalize just these ten and you can start every conversation with measurement instead of guesswork. Explore the available probes with:

sudo bpftrace -l 'tracepoint:syscalls:*' | head -20
sudo bpftrace -lv 'tracepoint:syscalls:sys_enter_openat'   # inspect arguments

A Tour of the BCC Toolbox

If bpftrace is for ad hoc questions, the BCC tool collection (and its libbpf-based rewrite, libbpf-tools) is a set of polished, ready-made instruments. Most distributions ship them as the bpfcc-tools or bcc-tools package.

Tool selection by symptom

Symptom / questionToolWhat it shows
What keeps spawning processesexecsnoopEvery execve with arguments
Which files are being openedopensnoopopen calls and result codes
Is the disk slowbiolatencyBlock IO latency histogram
Which IOs are slowbiosnoopPer-IO latency, size, process
TCP connection lifetimestcplifePer-connection lifetime, bytes in/out
Who initiates connectionstcpconnectActive connect tracing
Who accepts connectionstcpacceptPassive accept tracing
Are there retransmitstcpretransTCP retransmission events
Are syscalls slowsyscountPer-syscall counts/latency
Is run queue wait longrunqlatScheduler run queue latency histogram
Are cache misses frequentcachestatPage cache hits/misses
Is the filesystem slowext4slower etc.FS operations slower than a threshold

Typical invocations:

# Print only ext4 operations slower than 10ms
sudo ext4slower-bpfcc 10

# Block IO latency histogram at 1-second intervals
sudo biolatency-bpfcc -m 1

# Observe lifetimes of all TCP sessions, containers included
sudo tcplife-bpfcc

The 60-second triage routine

The sequence I use for incident first response:

sudo execsnoop-bpfcc      # unexpected process storms? (cron, health checks)
sudo runqlat-bpfcc 5 1    # CPU contention: run queue wait distribution
sudo biolatency-bpfcc 5 1 # disk: IO latency distribution
sudo tcpretrans-bpfcc     # network: any retransmissions?
sudo syscount-bpfcc -L -d 5  # top syscalls by latency

Five commands and one minute are enough to narrow the problem down to a layer: CPU, disk, network, or syscalls.

How to Read Latency Histograms

Among all eBPF tool outputs, the log2 histogram carries the most information. Here is a sample biolatency output:

usecs               : count     distribution
   128 -> 255       : 1402     |**************************              |
   256 -> 511       : 2012     |****************************************|
   512 -> 1023      : 803      |***************                         |
  1024 -> 2047      : 95       |*                                       |
  ...
 65536 -> 131071    : 41       |                                        |
131072 -> 262143    : 17       |                                        |

Three things to read:

  1. Where the mode sits: most IOs cluster between 256 and 511 microseconds. That is the "normal" characteristic of the device.
  2. Whether a tail exists: there are 17 events in the 130ms-and-above range. They barely register in the average, yet they are exactly what ruins the p99.9 user experience.
  3. Whether the distribution is bimodal: two peaks are a strong signal that two distinct paths are mixed together — "cache hit vs miss," "local vs remote," "happy path vs retry." The mean then points to a value between the peaks that no real request ever experiences.

In short: discard the average and look at the distribution. That is the central lesson of eBPF observability.

CPU Profiling and Flame Graphs

perf-based vs eBPF-based

Aspectperf recordeBPF (profile probe / profile-bpfcc)
CollectionDump samples to file, postprocessAggregate per-stack inside kernel maps
Data volumeperf.data can grow to gigabytesOnly aggregates cross over, small
OverheadIncludes disk-write costGenerally lower
FlexibilityStandardized workflow, broad PMU supportCustom logic such as conditional collection

Two routes to a flame graph from 99Hz whole-system sampling:

# Route A: perf-based
sudo perf record -F 99 -a -g -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg

# Route B: eBPF-based (in-kernel aggregation, folded output directly)
sudo profile-bpfcc -F 99 -af 30 > out.folded
flamegraph.pl out.folded > cpu.svg

Reading a flame graph is simple: horizontal width is share of CPU time, and the call stack deepens as you go up. Find the widest plateau, then walk downward to understand why that function is being called. To see user stacks, the target must either be built to preserve frame pointers (-fno-omit-frame-pointer) or be supported by DWARF/BTF-based unwinding as in recent profilers.

Off-CPU analysis

A CPU flame graph only shows time spent running. Yet most latency comes from waiting — on disk, locks, or the network. That is where off-CPU analysis comes in.

# Aggregate, per stack, the time threads spent blocked off-CPU
sudo offcputime-bpfcc -df 30 > offcpu.folded
flamegraph.pl --colors=io offcpu.folded > offcpu.svg

Truths invisible in the CPU flame graph — such as "80 percent of time waiting on a lock" — frequently surface in the off-CPU graph. Make a habit of reading on-CPU and off-CPU graphs as a pair.

In Containers and Kubernetes

The principle: eBPF is a node-level technology

Because eBPF programs attach to the kernel, one program observes every container on the node at once. Containers are ultimately processes isolated by cgroups and namespaces, so filtering by PID or cgroup ID lets you single out a specific Pod. A typical approach when working directly on a node:

# Find the PID of a specific container and filter to that process
PID=$(crictl inspect --output go-template \
      --template '{{.info.pid}}' CONTAINER_ID)
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat /pid == '$PID'/
  { printf("%s\n", str(args->filename)); }'

Common patterns include deploying a debug-tools image as a DaemonSet, or using kubectl debug node to spin up a temporary pod on the node.

Ecosystem tools (within what I know)

ToolApproachCharacteristics
PixieIn-cluster eBPF agentsAutomatic protocol capture (HTTP, DBs), script-based queries
ParcaeBPF continuous profilingAlways-on low-overhead profiler, time-series flame graphs
Grafana BeylaeBPF auto-instrumentationHTTP/gRPC RED metrics and traces with zero code changes
Cilium HubbleCNI-integrated observabilityNetwork flow visibility, service maps
Inspektor GadgetBCC-style tools packaged for k8skubectl plugin runs node tools per Pod

The common pattern: a node agent, not a sidecar. Thanks to eBPF, one DaemonSet per node observes everything — no per-pod proxy or agent injection required.

Tracing Applications with USDT and uprobes

User space can be traced the same way as the kernel.

A uprobe attaches to an arbitrary function in a binary — for example, to measure the latency of a function in a C/C++/Go/Rust binary.

# Find traceable symbols in a binary
sudo bpftrace -l 'uprobe:/usr/local/bin/myapp:*' | grep -i order

# Latency histogram for a specific function
sudo bpftrace -e '
uprobe:/usr/local/bin/myapp:process_order { @start[tid] = nsecs; }
uretprobe:/usr/local/bin/myapp:process_order /@start[tid]/
  { @latency_us = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]); }'

USDT (User Statically-Defined Tracing) probes are stable trace points the application embeds in advance. PostgreSQL, the JVM, Python, and Node.js all ship USDT probes.

# List USDT probes of a process
sudo bpftrace -lp $(pgrep -o postgres) 'usdt:*' | head

# Trace PostgreSQL query starts (USDT example)
sudo bpftrace -e 'usdt:/usr/lib/postgresql/16/bin/postgres:query__start
  { printf("query: %s\n", str(arg0)); }'

Two caveats: uprobes cannot catch functions that were inlined, and stripped binaries need a symbol table. Also, attaching a uprobe to a very hot user function accumulates trap costs, so estimate the call frequency first.

Three Production Troubleshooting Scenarios

Scenario 1: Disk IO surge

Symptom: disk utilization spikes at certain times of day, accompanied by service latency.

# Step 1: latency distribution — is the disk actually slow, and how much?
sudo biolatency-bpfcc -m 5 3

# Step 2: the culprit process — who is generating the IO?
sudo biotop-bpfcc 5

# Step 3: individual IOs — what files, what pattern?
sudo biosnoop-bpfcc | head -50

# Step 4: down to file level — which files concentrate reads/writes?
sudo filetop-bpfcc -C 5

# Step 5: who triggers writes — page cache flush or direct writes?
sudo bpftrace -e 'tracepoint:block:block_rq_issue
  { @[comm, args->rwbs] = count(); }'

A typical conclusion: a backup job or log rotation produces sequential writes on the same disk, fattening the tail latency of the database random reads. Fixes include separating IO scheduling classes (ionice), cgroup io.max limits, or moving to a separate volume.

Scenario 2: Intermittent latency (p99 spikes)

Symptom: average response time is fine, but p99 spikes periodically. The APM offers no clues.

# Step 1: identify the layer — is it CPU wait?
sudo runqlat-bpfcc 5 12     # 12 runs over a minute: do they align with spikes?

# Step 2: find the CPU thief — who suddenly burns CPU?
sudo profile-bpfcc -F 99 -af 30 > spike.folded

# Step 3: check blocking — where does off-CPU time go?
sudo offcputime-bpfcc -df 30 -p $(pgrep -o myapp) > offcpu.folded

# Step 4: test the GC/mutex hypothesis (e.g., futex wait distribution)
sudo bpftrace -e '
tracepoint:syscalls:sys_enter_futex { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_futex /@start[tid]/
  { @futex_ms = hist((nsecs - @start[tid]) / 1000000);
    delete(@start[tid]); }'

A typical conclusion: a cron-driven metrics collector grabs CPU every minute and inflates run queue wait, or a lock convoy forms around a specific mutex. If the runqlat histogram shifts right only at spike times, the CPU contention hypothesis strengthens.

Scenario 3: Connection leak

Symptom: file descriptors grow over time until the service fails with too many open files.

# Step 1: observe connection lifetimes — are some never closing?
sudo tcplife-bpfcc -p $(pgrep -o myapp)

# Step 2: open vs close balance — which side leaks?
sudo bpftrace -e '
kprobe:tcp_connect { @open[comm] = count(); }
kprobe:tcp_close   { @close[comm] = count(); }
interval:s:10 { print(@open); print(@close);
                clear(@open); clear(@close); }'

# Step 3: the leak site stack — connections created where never get closed?
sudo bpftrace -e 'kprobe:tcp_connect /comm == "myapp"/
  { @[ustack] = count(); }'

# Step 4: general FD growth check (it might not be sockets)
sudo opensnoop-bpfcc -p $(pgrep -o myapp) -d 30

A typical conclusion: an error path fails to close an HTTP client response body, so connections accumulate outside the pool. The user stacks from step 3 point to the exact code path causing the leak.

Overhead Management and Safety Rules

Principles for running eBPF tools on production machines:

  1. Estimate the frequency first. Before attaching a probe, count how many times per second the event fires.
sudo bpftrace -e 'kprobe:vfs_read { @ = count(); }
                  interval:s:1 { print(@); clear(@); }'
  1. Aggregate inside the kernel; receive only summaries. Printing per event with printf becomes its own load under high event rates. Default to count/hist aggregation.
  2. Bound the duration. Most BCC tools accept count/duration arguments. Avoid open-ended runs; stop as soon as the measurement is done.
  3. Push filters as early as possible. Filter by PID, cgroup, or port on the kernel side first.
  4. Start with a canary. Do not roll out fleet-wide at once; measure overhead on one machine, then expand.
  5. Mind map memory. For aggregations whose key cardinality can explode (such as stack trace keys), check the map size limits.
  6. Split privileges. Many tools work with CAP_BPF plus CAP_PERFMON instead of full root. See the fundamentals post for details.

Operational Checklist

  • bpftrace and the bcc tools are installed on nodes and version-compatible with the kernel
  • BTF is enabled in the kernel (/sys/kernel/btf/vmlinux exists)
  • The 60-second triage routine is written into the incident runbook
  • Key service binaries are built to preserve frame pointers
  • There is an agreed location to store and share profiling artifacts (flame graphs)
  • In container environments, node access or an Inspektor Gadget-style path is prepared
  • An execution policy for eBPF tools (who, with which capabilities) is defined
  • The team agrees on guardrails for high-frequency probes (estimate frequency first)

Closing Thoughts

The essence of eBPF observability is not a list of tools but a shift in how questions are asked. Instead of guessing "probably the disk," you confirm with a biolatency histogram in 30 seconds. Instead of saying "must be GC," you prove it with an off-CPU flame graph. Once the kernel stops being a black box, the entire incident-response conversation happens on top of data.

If you understand the program/map/verifier principles from the fundamentals post, every tool in this article starts to read as "a program attached to a tracepoint, aggregating into a per-CPU map, streaming through a ring buffer." In the next post, the same technology turns toward security: with Tetragon, Falco, and BPF LSM we will go beyond observing — to detecting and blocking.

References