Skip to content
Published on

Observability: OTel eBPF SLO Operating Model 2026

Authors
  • Name
    Twitter
Observability: OTel eBPF SLO Operating Model 2026

The Inflection Point of the 2026 Observability Stack

In 2025-2026, three technologies are converging in the observability space, fundamentally changing the operating model itself.

OpenTelemetry (OTel) has now become the de facto observability standard as a CNCF graduated project. It unifies Metrics, Logs, and Traces through a single SDK and protocol (OTLP), enabling backend replacement without vendor lock-in.

eBPF-based auto-instrumentation (OBI: OpenTelemetry eBPF Instrumentation) gained momentum in May 2025 when Grafana Labs donated the Beyla project to OTel. It automatically captures HTTP, gRPC, and SQL calls at the kernel level without code changes. In 2026, it targets a stable 1.0 release (opentelemetry.io/blog/2026/obi-goals).

SLO (Service Level Objective) has evolved beyond simple dashboard numbers into an operational framework that controls release decisions and on-call priorities through error budget policies.

This article designs a 2026 operating model that combines these three technologies.

What eBPF Auto-Instrumentation Changes

eBPF Instrumentation vs Traditional SDK Instrumentation Comparison

ItemTraditional OTel SDK InstrumentationeBPF Auto-Instrumentation (OBI)
Code changesRequired (SDK addition, instrumentation code)Not required (kernel-level auto-capture)
Supported languagesJava, Python, Go, .NET, JS, etc.Language-agnostic (binary level)
Overhead1-3% CPUUnder 1% CPU (kernel space execution)
Capture depthDetailed down to business logicL7 protocol level (HTTP, gRPC, SQL)
Context propagationFully supportedSupported for HTTP/gRPC, limited for some protocols
Custom attributesFreely addableLimited (only what can be extracted from protocols)
Deployment methodTogether with applicationDaemonSet or sidecar
Operational burdenPer-service individual applicationCluster-wide batch application

eBPF Auto-Instrumentation Deployment (Kubernetes)

# otel-ebpf-instrumentation.yaml
# Deploy OBI (OpenTelemetry eBPF Instrumentation) as DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-ebpf-instrumentation
  namespace: observability
spec:
  selector:
    matchLabels:
      app: obi
  template:
    metadata:
      labels:
        app: obi
    spec:
      hostPID: true # Required for eBPF probe attachment
      hostNetwork: false
      serviceAccountName: obi
      containers:
        - name: obi
          image: ghcr.io/open-telemetry/opentelemetry-ebpf-instrumentation:v0.9.0
          securityContext:
            privileged: true # Required for loading eBPF programs
            runAsUser: 0
          env:
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: 'http://otel-collector:4318'
            - name: OTEL_SERVICE_NAME
              value: 'auto-detected' # Auto-extracted from process name
            - name: OTEL_EBPF_TRACK_REQUEST_HEADERS
              value: 'true'
            # Namespace filter for monitoring targets
            - name: OTEL_EBPF_KUBE_NAMESPACE
              value: 'production,staging'
            # Protocols to capture
            - name: OTEL_EBPF_PROTOCOLS
              value: 'HTTP,GRPC,SQL,REDIS'
          volumeMounts:
            - name: sys-kernel
              mountPath: /sys/kernel
              readOnly: true
            - name: bpf-maps
              mountPath: /sys/fs/bpf
          resources:
            limits:
              cpu: 500m
              memory: 512Mi
            requests:
              cpu: 100m
              memory: 128Mi
      volumes:
        - name: sys-kernel
          hostPath:
            path: /sys/kernel
        - name: bpf-maps
          hostPath:
            path: /sys/fs/bpf

Hybrid Strategy: eBPF and SDK

Not all services can be instrumented with eBPF alone. eBPF automatically provides L7 protocol-level metrics and traces, but cannot provide business logic-level custom spans or metrics.

[Service Instrumentation Strategy Matrix]

                   Business Logic Observability Need
                   Low               High
                +--------------+--------------+
    High        |  eBPF only   |  eBPF + SDK  |
    importance  |  (infra      |  (core       |
                |   services)  |   business)  |
                +--------------+--------------+
    Low         |  eBPF only   |  SDK only    |
    importance  |  (legacy,    |  (data       |
                |   3rd party) |   pipelines) |
                +--------------+--------------+

Specific Application Examples:

# Hybrid strategy: eBPF auto-captures L7 traffic,
# SDK focuses only on business logic

from opentelemetry import trace

tracer = trace.get_tracer("recommendation-engine", "2.1.0")

async def get_recommendations(user_id: str, context: dict):
    # What eBPF auto-captures:
    # - HTTP request/response metrics (latency, status code)
    # - gRPC call metrics
    # - SQL query execution time
    # - Redis command execution time

    # What SDK adds: detailed business logic-level information
    with tracer.start_as_current_span(
        "generate_recommendations",
        attributes={
            "user.segment": context.get("segment", "unknown"),
            "model.version": "v3.2",
            "candidate.count": 1000,
        }
    ) as span:
        # Model inference span
        with tracer.start_as_current_span("ml_inference") as ml_span:
            scores = await ml_model.predict(user_id, context)
            ml_span.set_attribute("inference.latency_ms", scores.latency_ms)
            ml_span.set_attribute("inference.model_name", "rec-v3.2-prod")

        # Filtering span
        with tracer.start_as_current_span("business_filter") as filter_span:
            filtered = apply_business_rules(scores.items, context)
            filter_span.set_attribute("filter.input_count", len(scores.items))
            filter_span.set_attribute("filter.output_count", len(filtered))
            filter_span.set_attribute("filter.removed_reasons", {
                "out_of_stock": 12,
                "age_restricted": 3,
                "region_blocked": 1,
            })

        span.set_attribute("result.count", len(filtered))
        return filtered

eBPF-Based SLI Auto-Collection

SLIs can be extracted from data that eBPF automatically captures. It becomes possible to batch-collect SLIs for all services without code changes.

Metrics Auto-Generated by OBI

# Key metrics generated by OBI (Prometheus format)

# HTTP server request duration
http_server_request_duration_seconds_bucket{
  http_request_method="GET",
  http_response_status_code="200",
  url_path="/api/v1/orders",
  service_name="order-service",
  le="0.005"
} 1234

# HTTP server request count
http_server_request_duration_seconds_count{
  http_request_method="GET",
  http_response_status_code="200",
  url_path="/api/v1/orders",
  service_name="order-service",
} 5678

# gRPC server request duration
rpc_server_duration_seconds_bucket{
  rpc_method="GetUser",
  rpc_service="user.UserService",
  rpc_grpc_status_code="OK",
  service_name="user-service",
  le="0.1"
} 9012

# SQL query duration
db_client_operation_duration_seconds_bucket{
  db_system="postgresql",
  db_operation="SELECT",
  service_name="order-service",
  le="0.05"
} 3456

Recording Rules for Extracting SLIs from eBPF Metrics

# prometheus_rules/ebpf_sli_rules.yaml
groups:
  - name: ebpf_sli_from_obi
    interval: 30s
    rules:
      # Availability SLI: ratio of HTTP responses excluding 5xx
      - record: sli:http_availability:ratio_rate5m
        expr: |
          sum(rate(http_server_request_duration_seconds_count{
            http_response_status_code!~"5.."
          }[5m])) by (service_name)
          /
          sum(rate(http_server_request_duration_seconds_count[5m])) by (service_name)

      # Latency SLI: ratio of responses within 300ms
      - record: sli:http_latency:ratio_rate5m
        expr: |
          sum(rate(http_server_request_duration_seconds_bucket{
            le="0.3"
          }[5m])) by (service_name)
          /
          sum(rate(http_server_request_duration_seconds_count[5m])) by (service_name)

      # gRPC Availability SLI
      - record: sli:grpc_availability:ratio_rate5m
        expr: |
          sum(rate(rpc_server_duration_seconds_count{
            rpc_grpc_status_code="OK"
          }[5m])) by (service_name)
          /
          sum(rate(rpc_server_duration_seconds_count[5m])) by (service_name)

      # DB Query Latency SLI: ratio within 50ms
      - record: sli:db_latency:ratio_rate5m
        expr: |
          sum(rate(db_client_operation_duration_seconds_bucket{
            le="0.05"
          }[5m])) by (service_name, db_system)
          /
          sum(rate(db_client_operation_duration_seconds_count[5m])) by (service_name, db_system)

Unified Operating Model: OTel + eBPF + SLO

The operating model that combines these three technologies works through the following flow.

Data Flow

[Application Pods]
     |
     +-- eBPF (OBI DaemonSet)
     |   +-- Auto-capture: HTTP/gRPC/SQL metrics + traces
     |
     +-- OTel SDK (optional)
     |   +-- Manual instrumentation: business spans + custom metrics
     |
     +-- Both data sent to OTel Collector
              |
              v
     [OTel Collector Gateway]
     +-- Attribute standardization (semantic conventions)
     +-- Merge eBPF data and SDK data
     +-- Tail-based sampling
     +-- Per-backend routing
              |
         +----+----+
         v         v
   [Metrics DB]  [Traces DB]
   (Mimir)       (Tempo)
         |         |
         v         v
   [Prometheus Recording Rules]
   +-- SLI calculation (from eBPF metrics)
   +-- Error budget calculation
   +-- Burn rate alerts
              |
              v
   [Error Budget Policy Engine]
   +-- Budget >= 50%: Normal releases
   +-- Budget 20-50%: Canary required
   +-- Budget < 20%: Release freeze
   +-- CI/CD pipeline integration

Unified Collector Configuration

# otel-collector-unified.yaml
# Unified Gateway receiving both eBPF and SDK data
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  # Standardize service names generated by eBPF
  # OBI extracts service names from process names,
  # which may differ from service names set in SDK
  transform/service_name:
    trace_statements:
      - context: resource
        statements:
          # Map auto-detected OBI names to standard names
          - replace_pattern(attributes["service.name"], "^python3?$", "unknown-python-service")
          - replace_pattern(attributes["service.name"], "^java$", "unknown-java-service")
    metric_statements:
      - context: resource
        statements:
          - replace_pattern(attributes["service.name"], "^python3?$", "unknown-python-service")

  # Connect eBPF traces and SDK traces with the same trace_id
  # Since OBI reads traceparent from HTTP headers,
  # eBPF spans automatically join traces created by SDK
  batch:
    send_batch_size: 2048
    timeout: 10s

  tail_sampling:
    decision_wait: 15s
    num_traces: 200000
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-requests
        type: latency
        latency:
          threshold_ms: 500
      - name: sample-normal
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  prometheusremotewrite:
    endpoint: http://mimir:9009/api/v1/push
    resource_to_telemetry_conversion:
      enabled: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [transform/service_name, tail_sampling, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [transform/service_name, batch]
      exporters: [prometheusremotewrite]

Automated SLO Governance

Service Catalog-Based Auto SLO Provisioning

When a new service is deployed, eBPF automatically collects metrics, and alerts are auto-generated according to SLO policies registered in the service catalog.

# slo_provisioner.py
# Read SLO definitions from service catalog and generate Prometheus alert rules

import yaml
from pathlib import Path

def generate_slo_alerts(service_catalog_path: str, output_dir: str):
    """Read SLO definitions from service catalog and auto-generate Prometheus alert rules"""
    catalog = yaml.safe_load(open(service_catalog_path))

    for service in catalog["services"]:
        name = service["name"]
        tier = service["tier"]
        slo = service["slo"]

        # Apply defaults based on tier
        availability_target = slo.get("availability", TIER_DEFAULTS[tier]["availability"])
        latency_threshold_ms = slo.get("latency_threshold_ms", TIER_DEFAULTS[tier]["latency_ms"])
        latency_target = slo.get("latency_target", TIER_DEFAULTS[tier]["latency_target"])

        # Generate Prometheus alert rules
        alert_rules = generate_burn_rate_alerts(
            service_name=name,
            availability_target=availability_target,
            latency_threshold_ms=latency_threshold_ms,
            latency_target=latency_target,
        )

        output_file = Path(output_dir) / f"slo-{name}.yaml"
        yaml.dump(alert_rules, open(output_file, "w"), default_flow_style=False)
        print(f"Generated SLO alerts for {name}: {output_file}")

TIER_DEFAULTS = {
    "tier1": {"availability": 0.9995, "latency_ms": 200, "latency_target": 0.99},
    "tier2": {"availability": 0.999,  "latency_ms": 500, "latency_target": 0.95},
    "tier3": {"availability": 0.995,  "latency_ms": 2000, "latency_target": 0.90},
}

def generate_burn_rate_alerts(
    service_name: str,
    availability_target: float,
    latency_threshold_ms: int,
    latency_target: float,
) -> dict:
    """Generate multi-window burn rate alert rules"""
    error_budget = 1.0 - availability_target
    latency_threshold_sec = latency_threshold_ms / 1000.0

    return {
        "groups": [{
            "name": f"slo-{service_name}",
            "rules": [
                # Critical: burn rate 14, 1h/5m window
                {
                    "alert": f"SLO_{service_name}_BurnRate_Critical",
                    "expr": (
                        f'(\n'
                        f'  1 - sli:http_availability:ratio_rate1h{{service_name="{service_name}"}}\n'
                        f') > {14 * error_budget}\n'
                        f'and\n'
                        f'(\n'
                        f'  1 - sli:http_availability:ratio_rate5m{{service_name="{service_name}"}}\n'
                        f') > {14 * error_budget}'
                    ),
                    "for": "1m",
                    "labels": {
                        "severity": "critical",
                        "service": service_name,
                        "burn_rate": "14",
                    },
                    "annotations": {
                        "summary": f"{service_name}: SLO critical burn rate (14x)",
                        "runbook": f"https://wiki.internal/runbook/slo/{service_name}",
                    },
                },
                # Warning: burn rate 6, 6h/30m window
                {
                    "alert": f"SLO_{service_name}_BurnRate_Warning",
                    "expr": (
                        f'(\n'
                        f'  1 - sli:http_availability:ratio_rate6h{{service_name="{service_name}"}}\n'
                        f') > {6 * error_budget}\n'
                        f'and\n'
                        f'(\n'
                        f'  1 - sli:http_availability:ratio_rate30m{{service_name="{service_name}"}}\n'
                        f') > {6 * error_budget}'
                    ),
                    "for": "5m",
                    "labels": {
                        "severity": "warning",
                        "service": service_name,
                        "burn_rate": "6",
                    },
                },
            ],
        }],
    }

Service Catalog Example

# service_catalog.yaml
services:
  - name: payment-api
    tier: tier1
    team: payment
    slo:
      availability: 0.9995
      latency_threshold_ms: 200
      latency_target: 0.99

  - name: recommendation-engine
    tier: tier2
    team: ml-platform
    slo:
      availability: 0.999
      latency_threshold_ms: 500

  - name: notification-service
    tier: tier3
    team: platform
    # Uses tier3 defaults

  - name: internal-admin
    tier: tier3
    team: platform
    slo:
      availability: 0.99
      latency_threshold_ms: 3000

Operations Cycle: Weekly/Monthly Routines

Weekly SLO Review (30 minutes)

Attendees: SRE Lead, Service Owners, Product Manager

1. Error Budget Status Check (5 min)
   - Review overall service budget remaining dashboard
   - Identify Yellow/Red status services

2. Last Week's Incident SLO Impact (10 min)
   - Budget percentage consumed by each incident
   - Identify recurring patterns

3. Release Plan Review (10 min)
   - Risk assessment for this week's planned releases
   - Determine release strategy based on budget status
     (canary ratio, rollback criteria, etc.)

4. Action Items (5 min)
   - Previous week's action items completion status
   - Assign new action items

Monthly SLO Tuning Review (1 hour)

Attendees: Engineering VP, SRE Team, Service Owners

1. SLO Target Appropriateness Review
   - Are SLO targets appropriate compared to actual SLI trends over the last 3 months?
   - Too generous SLO: potential for unnecessary resource waste
   - Too tight SLO: innovation speed decline

2. eBPF Instrumentation Coverage Check
   - Are new services being auto-instrumented?
   - Need for OBI version updates
   - Need for new protocol support (MQTT, AMQP, etc.)

3. Cost Review
   - Observability data storage cost trends
   - Need for sampling ratio adjustments
   - Retention period policy review

4. Alert Quality Review
   - Last month's alert firing count
   - False positive rate
   - False negative cases

Troubleshooting

1. eBPF Program Load Failure

Error: failed to load BPF program: operation not permitted

Cause: Container lacks CAP_BPF or CAP_SYS_ADMIN capabilities

Solution:

# Add required capabilities to securityContext
securityContext:
  privileged: true
  # Or with minimum privileges:
  capabilities:
    add:
      - BPF
      - SYS_ADMIN
      - NET_ADMIN
      - PERFMON

2. Service Name Shows as "python3" in eBPF Metrics

Cause: OBI extracts service names from process names, but Python services expose the interpreter name

Solution:

# Method 1: Configure mapping via OBI environment variables
env:
  - name: OTEL_EBPF_SERVICE_NAME_MAP
    value: 'python3.11:/usr/local/bin/gunicorn=order-service'

# Method 2: Map in Collector's transform processor
processors:
  transform/service_name:
    metric_statements:
      - context: resource
        statements:
          - set(attributes["service.name"], "order-service")
            where attributes["k8s.deployment.name"] == "order-service"

3. eBPF Traces and SDK Traces Appear Separately

Symptom: Same request but eBPF-generated traces and SDK-generated traces have different trace_ids

Cause: OBI fails to read existing traceparent from HTTP headers, or SDK starts a trace before OBI causing duplicate contexts

Solution:

# Configure OBI to respect existing context
env:
  - name: OTEL_EBPF_CONTEXT_PROPAGATION
    value: 'true'
  - name: OTEL_EBPF_CONTEXT_PROPAGATION_MODE
    value: 'reuse' # Reuse existing context if present, create new if not

4. SLI Recording Rule Returns NaN

Cause: Denominator (total requests) is 0 during that time window. Late night hours with no traffic or newly deployed services.

Solution:

# Prevent NaN: return 1 if denominator is 0 (treated as no errors)
sli:http_availability:ratio_rate5m = (
  sum(rate(http_server_request_duration_seconds_count{
    http_response_status_code!~"5.."
  }[5m])) by (service_name)
  /
  (sum(rate(http_server_request_duration_seconds_count[5m])) by (service_name) > 0)
) or vector(1)

5. eBPF Overhead Higher Than Expected

Symptom: OBI DaemonSet CPU usage exceeds 500m

Diagnosis and Solution:

# 1. Check which probes are consuming the most CPU
kubectl exec -n observability obi-xxx -- /obi debug perf-stats

# 2. Disable unnecessary protocol detection
# Remove unused protocols from OTEL_EBPF_PROTOCOLS
# Example: remove REDIS if not using Redis

# 3. Exclude high-traffic pods
env:
  - name: OTEL_EBPF_EXCLUDE_NAMESPACES
    value: "kube-system,monitoring"
  - name: OTEL_EBPF_EXCLUDE_PODS
    value: "load-generator-*"  # Exclude load test pods

2026 Roadmap-Based Preparation

According to OBI's 2026 roadmap (opentelemetry.io/blog/2026/obi-goals), the following features are planned.

Planned FeatureCurrent StatusPreparation Items
Stable 1.0 releaseAlpha/BetaEstablish staging test plan before production
.NET instrumentation supportEarly testingIdentify .NET service inventory, evaluate SDK replacement potential
Messaging systems (MQTT, AMQP, NATS)In developmentDocument current instrumentation methods for message queue-based services
gRPC full context propagationImprovingVerify trace connectivity status between gRPC services
Cloud SDK instrumentation (AWS, GCP, Azure)PlannedEvaluate cloud API call observability requirements

Quiz

Q1. Why can't eBPF auto-instrumentation completely replace SDK instrumentation? Answer: eBPF captures L7 protocols (HTTP, gRPC, SQL) at the kernel level, but cannot generate application business logic-level custom spans or business metrics (e.g., order amounts, recommendation scores). A hybrid strategy using both eBPF and SDK is necessary for core business services.

Q2. Why does the OBI DaemonSet require privileged permissions? Answer: Loading and executing eBPF programs in the kernel requires high-level capabilities like CAP_BPF and CAP_SYS_ADMIN. This is because eBPF probes need to attach to kernel functions, inspect network packets, and trace system calls of other processes.

Q3. What are the advantages of automatically extracting SLIs from eBPF metrics? Answer: Availability and latency SLIs for all services in the cluster can be batch-collected without code changes. When a new service is deployed, eBPF automatically generates metrics, so when integrated with a service catalog, SLO alerts can be auto-provisioned as well.

Q4. When do eBPF traces and SDK traces end up with different trace_ids? Answer: When OBI fails to read the traceparent in the HTTP header, or when SDK has already started a trace and OBI creates a separate trace. Setting OTEL_EBPF_CONTEXT_PROPAGATION_MODE to "reuse" solves this problem by reusing existing context when available.

Q5. What are the prerequisites for service catalog-based auto SLO provisioning? Answer: eBPF auto-instrumentation must be deployed cluster-wide, service names must be standardized (e.g., k8s deployment name-based mapping), and each service's tier and SLO definition must be registered in the service catalog. When these three conditions are met, Prometheus recording rules and alert rules can be auto-generated.

Q6. Why should the SLI recording rule return 1 instead of NaN when the denominator is 0?

Answer: When NaN occurs during periods with no traffic, burn rate alerts cannot be calculated properly. Since there are no errors when there are no requests, treating availability as 1 (100%) is reasonable. However, if traffic is 0 for an extended period, a separate "service unresponsive" alert should be configured.

Q7. What is the difference in focus between weekly and monthly reviews in the observability operating model?

Answer: The weekly review is a tactical meeting focused on current error budget status and this week's release plans. The monthly review is a strategic meeting that examines the appropriateness of SLO targets themselves, eBPF coverage, cost trends, and alert quality.

References