- Authors
- Name

- The Inflection Point of the 2026 Observability Stack
- What eBPF Auto-Instrumentation Changes
- eBPF-Based SLI Auto-Collection
- Unified Operating Model: OTel + eBPF + SLO
- Automated SLO Governance
- Operations Cycle: Weekly/Monthly Routines
- Troubleshooting
- 2026 Roadmap-Based Preparation
- Quiz
- References
The Inflection Point of the 2026 Observability Stack
In 2025-2026, three technologies are converging in the observability space, fundamentally changing the operating model itself.
OpenTelemetry (OTel) has now become the de facto observability standard as a CNCF graduated project. It unifies Metrics, Logs, and Traces through a single SDK and protocol (OTLP), enabling backend replacement without vendor lock-in.
eBPF-based auto-instrumentation (OBI: OpenTelemetry eBPF Instrumentation) gained momentum in May 2025 when Grafana Labs donated the Beyla project to OTel. It automatically captures HTTP, gRPC, and SQL calls at the kernel level without code changes. In 2026, it targets a stable 1.0 release (opentelemetry.io/blog/2026/obi-goals).
SLO (Service Level Objective) has evolved beyond simple dashboard numbers into an operational framework that controls release decisions and on-call priorities through error budget policies.
This article designs a 2026 operating model that combines these three technologies.
What eBPF Auto-Instrumentation Changes
eBPF Instrumentation vs Traditional SDK Instrumentation Comparison
| Item | Traditional OTel SDK Instrumentation | eBPF Auto-Instrumentation (OBI) |
|---|---|---|
| Code changes | Required (SDK addition, instrumentation code) | Not required (kernel-level auto-capture) |
| Supported languages | Java, Python, Go, .NET, JS, etc. | Language-agnostic (binary level) |
| Overhead | 1-3% CPU | Under 1% CPU (kernel space execution) |
| Capture depth | Detailed down to business logic | L7 protocol level (HTTP, gRPC, SQL) |
| Context propagation | Fully supported | Supported for HTTP/gRPC, limited for some protocols |
| Custom attributes | Freely addable | Limited (only what can be extracted from protocols) |
| Deployment method | Together with application | DaemonSet or sidecar |
| Operational burden | Per-service individual application | Cluster-wide batch application |
eBPF Auto-Instrumentation Deployment (Kubernetes)
# otel-ebpf-instrumentation.yaml
# Deploy OBI (OpenTelemetry eBPF Instrumentation) as DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-ebpf-instrumentation
namespace: observability
spec:
selector:
matchLabels:
app: obi
template:
metadata:
labels:
app: obi
spec:
hostPID: true # Required for eBPF probe attachment
hostNetwork: false
serviceAccountName: obi
containers:
- name: obi
image: ghcr.io/open-telemetry/opentelemetry-ebpf-instrumentation:v0.9.0
securityContext:
privileged: true # Required for loading eBPF programs
runAsUser: 0
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: 'http://otel-collector:4318'
- name: OTEL_SERVICE_NAME
value: 'auto-detected' # Auto-extracted from process name
- name: OTEL_EBPF_TRACK_REQUEST_HEADERS
value: 'true'
# Namespace filter for monitoring targets
- name: OTEL_EBPF_KUBE_NAMESPACE
value: 'production,staging'
# Protocols to capture
- name: OTEL_EBPF_PROTOCOLS
value: 'HTTP,GRPC,SQL,REDIS'
volumeMounts:
- name: sys-kernel
mountPath: /sys/kernel
readOnly: true
- name: bpf-maps
mountPath: /sys/fs/bpf
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 100m
memory: 128Mi
volumes:
- name: sys-kernel
hostPath:
path: /sys/kernel
- name: bpf-maps
hostPath:
path: /sys/fs/bpf
Hybrid Strategy: eBPF and SDK
Not all services can be instrumented with eBPF alone. eBPF automatically provides L7 protocol-level metrics and traces, but cannot provide business logic-level custom spans or metrics.
[Service Instrumentation Strategy Matrix]
Business Logic Observability Need
Low High
+--------------+--------------+
High | eBPF only | eBPF + SDK |
importance | (infra | (core |
| services) | business) |
+--------------+--------------+
Low | eBPF only | SDK only |
importance | (legacy, | (data |
| 3rd party) | pipelines) |
+--------------+--------------+
Specific Application Examples:
# Hybrid strategy: eBPF auto-captures L7 traffic,
# SDK focuses only on business logic
from opentelemetry import trace
tracer = trace.get_tracer("recommendation-engine", "2.1.0")
async def get_recommendations(user_id: str, context: dict):
# What eBPF auto-captures:
# - HTTP request/response metrics (latency, status code)
# - gRPC call metrics
# - SQL query execution time
# - Redis command execution time
# What SDK adds: detailed business logic-level information
with tracer.start_as_current_span(
"generate_recommendations",
attributes={
"user.segment": context.get("segment", "unknown"),
"model.version": "v3.2",
"candidate.count": 1000,
}
) as span:
# Model inference span
with tracer.start_as_current_span("ml_inference") as ml_span:
scores = await ml_model.predict(user_id, context)
ml_span.set_attribute("inference.latency_ms", scores.latency_ms)
ml_span.set_attribute("inference.model_name", "rec-v3.2-prod")
# Filtering span
with tracer.start_as_current_span("business_filter") as filter_span:
filtered = apply_business_rules(scores.items, context)
filter_span.set_attribute("filter.input_count", len(scores.items))
filter_span.set_attribute("filter.output_count", len(filtered))
filter_span.set_attribute("filter.removed_reasons", {
"out_of_stock": 12,
"age_restricted": 3,
"region_blocked": 1,
})
span.set_attribute("result.count", len(filtered))
return filtered
eBPF-Based SLI Auto-Collection
SLIs can be extracted from data that eBPF automatically captures. It becomes possible to batch-collect SLIs for all services without code changes.
Metrics Auto-Generated by OBI
# Key metrics generated by OBI (Prometheus format)
# HTTP server request duration
http_server_request_duration_seconds_bucket{
http_request_method="GET",
http_response_status_code="200",
url_path="/api/v1/orders",
service_name="order-service",
le="0.005"
} 1234
# HTTP server request count
http_server_request_duration_seconds_count{
http_request_method="GET",
http_response_status_code="200",
url_path="/api/v1/orders",
service_name="order-service",
} 5678
# gRPC server request duration
rpc_server_duration_seconds_bucket{
rpc_method="GetUser",
rpc_service="user.UserService",
rpc_grpc_status_code="OK",
service_name="user-service",
le="0.1"
} 9012
# SQL query duration
db_client_operation_duration_seconds_bucket{
db_system="postgresql",
db_operation="SELECT",
service_name="order-service",
le="0.05"
} 3456
Recording Rules for Extracting SLIs from eBPF Metrics
# prometheus_rules/ebpf_sli_rules.yaml
groups:
- name: ebpf_sli_from_obi
interval: 30s
rules:
# Availability SLI: ratio of HTTP responses excluding 5xx
- record: sli:http_availability:ratio_rate5m
expr: |
sum(rate(http_server_request_duration_seconds_count{
http_response_status_code!~"5.."
}[5m])) by (service_name)
/
sum(rate(http_server_request_duration_seconds_count[5m])) by (service_name)
# Latency SLI: ratio of responses within 300ms
- record: sli:http_latency:ratio_rate5m
expr: |
sum(rate(http_server_request_duration_seconds_bucket{
le="0.3"
}[5m])) by (service_name)
/
sum(rate(http_server_request_duration_seconds_count[5m])) by (service_name)
# gRPC Availability SLI
- record: sli:grpc_availability:ratio_rate5m
expr: |
sum(rate(rpc_server_duration_seconds_count{
rpc_grpc_status_code="OK"
}[5m])) by (service_name)
/
sum(rate(rpc_server_duration_seconds_count[5m])) by (service_name)
# DB Query Latency SLI: ratio within 50ms
- record: sli:db_latency:ratio_rate5m
expr: |
sum(rate(db_client_operation_duration_seconds_bucket{
le="0.05"
}[5m])) by (service_name, db_system)
/
sum(rate(db_client_operation_duration_seconds_count[5m])) by (service_name, db_system)
Unified Operating Model: OTel + eBPF + SLO
The operating model that combines these three technologies works through the following flow.
Data Flow
[Application Pods]
|
+-- eBPF (OBI DaemonSet)
| +-- Auto-capture: HTTP/gRPC/SQL metrics + traces
|
+-- OTel SDK (optional)
| +-- Manual instrumentation: business spans + custom metrics
|
+-- Both data sent to OTel Collector
|
v
[OTel Collector Gateway]
+-- Attribute standardization (semantic conventions)
+-- Merge eBPF data and SDK data
+-- Tail-based sampling
+-- Per-backend routing
|
+----+----+
v v
[Metrics DB] [Traces DB]
(Mimir) (Tempo)
| |
v v
[Prometheus Recording Rules]
+-- SLI calculation (from eBPF metrics)
+-- Error budget calculation
+-- Burn rate alerts
|
v
[Error Budget Policy Engine]
+-- Budget >= 50%: Normal releases
+-- Budget 20-50%: Canary required
+-- Budget < 20%: Release freeze
+-- CI/CD pipeline integration
Unified Collector Configuration
# otel-collector-unified.yaml
# Unified Gateway receiving both eBPF and SDK data
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
# Standardize service names generated by eBPF
# OBI extracts service names from process names,
# which may differ from service names set in SDK
transform/service_name:
trace_statements:
- context: resource
statements:
# Map auto-detected OBI names to standard names
- replace_pattern(attributes["service.name"], "^python3?$", "unknown-python-service")
- replace_pattern(attributes["service.name"], "^java$", "unknown-java-service")
metric_statements:
- context: resource
statements:
- replace_pattern(attributes["service.name"], "^python3?$", "unknown-python-service")
# Connect eBPF traces and SDK traces with the same trace_id
# Since OBI reads traceparent from HTTP headers,
# eBPF spans automatically join traces created by SDK
batch:
send_batch_size: 2048
timeout: 10s
tail_sampling:
decision_wait: 15s
num_traces: 200000
policies:
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
- name: slow-requests
type: latency
latency:
threshold_ms: 500
- name: sample-normal
type: probabilistic
probabilistic:
sampling_percentage: 5
exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
prometheusremotewrite:
endpoint: http://mimir:9009/api/v1/push
resource_to_telemetry_conversion:
enabled: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [transform/service_name, tail_sampling, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [transform/service_name, batch]
exporters: [prometheusremotewrite]
Automated SLO Governance
Service Catalog-Based Auto SLO Provisioning
When a new service is deployed, eBPF automatically collects metrics, and alerts are auto-generated according to SLO policies registered in the service catalog.
# slo_provisioner.py
# Read SLO definitions from service catalog and generate Prometheus alert rules
import yaml
from pathlib import Path
def generate_slo_alerts(service_catalog_path: str, output_dir: str):
"""Read SLO definitions from service catalog and auto-generate Prometheus alert rules"""
catalog = yaml.safe_load(open(service_catalog_path))
for service in catalog["services"]:
name = service["name"]
tier = service["tier"]
slo = service["slo"]
# Apply defaults based on tier
availability_target = slo.get("availability", TIER_DEFAULTS[tier]["availability"])
latency_threshold_ms = slo.get("latency_threshold_ms", TIER_DEFAULTS[tier]["latency_ms"])
latency_target = slo.get("latency_target", TIER_DEFAULTS[tier]["latency_target"])
# Generate Prometheus alert rules
alert_rules = generate_burn_rate_alerts(
service_name=name,
availability_target=availability_target,
latency_threshold_ms=latency_threshold_ms,
latency_target=latency_target,
)
output_file = Path(output_dir) / f"slo-{name}.yaml"
yaml.dump(alert_rules, open(output_file, "w"), default_flow_style=False)
print(f"Generated SLO alerts for {name}: {output_file}")
TIER_DEFAULTS = {
"tier1": {"availability": 0.9995, "latency_ms": 200, "latency_target": 0.99},
"tier2": {"availability": 0.999, "latency_ms": 500, "latency_target": 0.95},
"tier3": {"availability": 0.995, "latency_ms": 2000, "latency_target": 0.90},
}
def generate_burn_rate_alerts(
service_name: str,
availability_target: float,
latency_threshold_ms: int,
latency_target: float,
) -> dict:
"""Generate multi-window burn rate alert rules"""
error_budget = 1.0 - availability_target
latency_threshold_sec = latency_threshold_ms / 1000.0
return {
"groups": [{
"name": f"slo-{service_name}",
"rules": [
# Critical: burn rate 14, 1h/5m window
{
"alert": f"SLO_{service_name}_BurnRate_Critical",
"expr": (
f'(\n'
f' 1 - sli:http_availability:ratio_rate1h{{service_name="{service_name}"}}\n'
f') > {14 * error_budget}\n'
f'and\n'
f'(\n'
f' 1 - sli:http_availability:ratio_rate5m{{service_name="{service_name}"}}\n'
f') > {14 * error_budget}'
),
"for": "1m",
"labels": {
"severity": "critical",
"service": service_name,
"burn_rate": "14",
},
"annotations": {
"summary": f"{service_name}: SLO critical burn rate (14x)",
"runbook": f"https://wiki.internal/runbook/slo/{service_name}",
},
},
# Warning: burn rate 6, 6h/30m window
{
"alert": f"SLO_{service_name}_BurnRate_Warning",
"expr": (
f'(\n'
f' 1 - sli:http_availability:ratio_rate6h{{service_name="{service_name}"}}\n'
f') > {6 * error_budget}\n'
f'and\n'
f'(\n'
f' 1 - sli:http_availability:ratio_rate30m{{service_name="{service_name}"}}\n'
f') > {6 * error_budget}'
),
"for": "5m",
"labels": {
"severity": "warning",
"service": service_name,
"burn_rate": "6",
},
},
],
}],
}
Service Catalog Example
# service_catalog.yaml
services:
- name: payment-api
tier: tier1
team: payment
slo:
availability: 0.9995
latency_threshold_ms: 200
latency_target: 0.99
- name: recommendation-engine
tier: tier2
team: ml-platform
slo:
availability: 0.999
latency_threshold_ms: 500
- name: notification-service
tier: tier3
team: platform
# Uses tier3 defaults
- name: internal-admin
tier: tier3
team: platform
slo:
availability: 0.99
latency_threshold_ms: 3000
Operations Cycle: Weekly/Monthly Routines
Weekly SLO Review (30 minutes)
Attendees: SRE Lead, Service Owners, Product Manager
1. Error Budget Status Check (5 min)
- Review overall service budget remaining dashboard
- Identify Yellow/Red status services
2. Last Week's Incident SLO Impact (10 min)
- Budget percentage consumed by each incident
- Identify recurring patterns
3. Release Plan Review (10 min)
- Risk assessment for this week's planned releases
- Determine release strategy based on budget status
(canary ratio, rollback criteria, etc.)
4. Action Items (5 min)
- Previous week's action items completion status
- Assign new action items
Monthly SLO Tuning Review (1 hour)
Attendees: Engineering VP, SRE Team, Service Owners
1. SLO Target Appropriateness Review
- Are SLO targets appropriate compared to actual SLI trends over the last 3 months?
- Too generous SLO: potential for unnecessary resource waste
- Too tight SLO: innovation speed decline
2. eBPF Instrumentation Coverage Check
- Are new services being auto-instrumented?
- Need for OBI version updates
- Need for new protocol support (MQTT, AMQP, etc.)
3. Cost Review
- Observability data storage cost trends
- Need for sampling ratio adjustments
- Retention period policy review
4. Alert Quality Review
- Last month's alert firing count
- False positive rate
- False negative cases
Troubleshooting
1. eBPF Program Load Failure
Error: failed to load BPF program: operation not permitted
Cause: Container lacks CAP_BPF or CAP_SYS_ADMIN capabilities
Solution:
# Add required capabilities to securityContext
securityContext:
privileged: true
# Or with minimum privileges:
capabilities:
add:
- BPF
- SYS_ADMIN
- NET_ADMIN
- PERFMON
2. Service Name Shows as "python3" in eBPF Metrics
Cause: OBI extracts service names from process names, but Python services expose the interpreter name
Solution:
# Method 1: Configure mapping via OBI environment variables
env:
- name: OTEL_EBPF_SERVICE_NAME_MAP
value: 'python3.11:/usr/local/bin/gunicorn=order-service'
# Method 2: Map in Collector's transform processor
processors:
transform/service_name:
metric_statements:
- context: resource
statements:
- set(attributes["service.name"], "order-service")
where attributes["k8s.deployment.name"] == "order-service"
3. eBPF Traces and SDK Traces Appear Separately
Symptom: Same request but eBPF-generated traces and SDK-generated traces have different trace_ids
Cause: OBI fails to read existing traceparent from HTTP headers, or SDK starts a trace before OBI causing duplicate contexts
Solution:
# Configure OBI to respect existing context
env:
- name: OTEL_EBPF_CONTEXT_PROPAGATION
value: 'true'
- name: OTEL_EBPF_CONTEXT_PROPAGATION_MODE
value: 'reuse' # Reuse existing context if present, create new if not
4. SLI Recording Rule Returns NaN
Cause: Denominator (total requests) is 0 during that time window. Late night hours with no traffic or newly deployed services.
Solution:
# Prevent NaN: return 1 if denominator is 0 (treated as no errors)
sli:http_availability:ratio_rate5m = (
sum(rate(http_server_request_duration_seconds_count{
http_response_status_code!~"5.."
}[5m])) by (service_name)
/
(sum(rate(http_server_request_duration_seconds_count[5m])) by (service_name) > 0)
) or vector(1)
5. eBPF Overhead Higher Than Expected
Symptom: OBI DaemonSet CPU usage exceeds 500m
Diagnosis and Solution:
# 1. Check which probes are consuming the most CPU
kubectl exec -n observability obi-xxx -- /obi debug perf-stats
# 2. Disable unnecessary protocol detection
# Remove unused protocols from OTEL_EBPF_PROTOCOLS
# Example: remove REDIS if not using Redis
# 3. Exclude high-traffic pods
env:
- name: OTEL_EBPF_EXCLUDE_NAMESPACES
value: "kube-system,monitoring"
- name: OTEL_EBPF_EXCLUDE_PODS
value: "load-generator-*" # Exclude load test pods
2026 Roadmap-Based Preparation
According to OBI's 2026 roadmap (opentelemetry.io/blog/2026/obi-goals), the following features are planned.
| Planned Feature | Current Status | Preparation Items |
|---|---|---|
| Stable 1.0 release | Alpha/Beta | Establish staging test plan before production |
| .NET instrumentation support | Early testing | Identify .NET service inventory, evaluate SDK replacement potential |
| Messaging systems (MQTT, AMQP, NATS) | In development | Document current instrumentation methods for message queue-based services |
| gRPC full context propagation | Improving | Verify trace connectivity status between gRPC services |
| Cloud SDK instrumentation (AWS, GCP, Azure) | Planned | Evaluate cloud API call observability requirements |
Quiz
Q1. Why can't eBPF auto-instrumentation completely replace SDK instrumentation?
Answer: eBPF captures L7 protocols (HTTP, gRPC, SQL) at the kernel level, but cannot generate application business logic-level custom spans or business metrics (e.g., order amounts, recommendation scores). A hybrid strategy using both eBPF and SDK is necessary for core business services.
Q2. Why does the OBI DaemonSet require privileged permissions?
Answer: Loading and executing eBPF programs in the kernel requires high-level capabilities like CAP_BPF and CAP_SYS_ADMIN. This is because eBPF probes need to attach to kernel functions, inspect network packets, and trace system calls of other processes.
Q3. What are the advantages of automatically extracting SLIs from eBPF metrics?
Answer: Availability and latency SLIs for all services in the cluster can be batch-collected without code changes. When a new service is deployed, eBPF automatically generates metrics, so when integrated with a service catalog, SLO alerts can be auto-provisioned as well.
Q4. When do eBPF traces and SDK traces end up with different trace_ids?
Answer: When OBI fails to read the traceparent in the HTTP header, or when SDK has already started a trace and OBI creates a separate trace. Setting OTEL_EBPF_CONTEXT_PROPAGATION_MODE to "reuse" solves this problem by reusing existing context when available.
Q5. What are the prerequisites for service catalog-based auto SLO provisioning?
Answer: eBPF auto-instrumentation must be deployed cluster-wide, service names must be standardized (e.g., k8s deployment name-based mapping), and each service's tier and SLO definition must be registered in the service catalog. When these three conditions are met, Prometheus recording rules and alert rules can be auto-generated.
Q6. Why should the SLI recording rule return 1 instead of NaN when the denominator is 0?
Answer: When NaN occurs during periods with no traffic, burn rate alerts cannot be calculated properly. Since there are no errors when there are no requests, treating availability as 1 (100%) is reasonable. However, if traffic is 0 for an extended period, a separate "service unresponsive" alert should be configured.
Q7. What is the difference in focus between weekly and monthly reviews in the observability operating model?
Answer: The weekly review is a tactical meeting focused on current error budget status and this week's release plans. The monthly review is a strategic meeting that examines the appropriateness of SLO targets themselves, eBPF coverage, cost trends, and alert quality.