Split View: 관측성: OTel eBPF SLO 운영모델 2026

관측성: OTel eBPF SLO 운영모델 2026

2026년 관측성 스택의 변곡점
eBPF 자동 계측이 바꾸는 것
eBPF 기반 SLI 자동 수집
- OBI가 자동 생성하는 메트릭
- eBPF 메트릭에서 SLI 추출하는 Recording Rules
통합 운영 모델: OTel + eBPF + SLO
- 데이터 흐름
- 통합 Collector 설정
자동화된 SLO 거버넌스
- 서비스 카탈로그 기반 자동 SLO 프로비저닝
- 서비스 카탈로그 예시
운영 사이클: 주간/월간 루틴
- 주간 SLO 리뷰 (30분)
- 월간 SLO 튜닝 리뷰 (1시간)
트러블슈팅
2026년 로드맵 기반 준비 사항
퀴즈
References

2026년 관측성 스택의 변곡점

2025-2026년에 관측성 영역에서 세 가지 기술이 교차하며 운영 모델 자체가 바뀌고 있다.

**OpenTelemetry(OTel)**는 이제 CNCF graduated project로서 사실상의 관측성 표준이 되었다. Metrics, Logs, Traces를 하나의 SDK와 프로토콜(OTLP)로 통합하며, vendor lock-in 없이 백엔드를 교체할 수 있다.

**eBPF 기반 자동 계측(OBI: OpenTelemetry eBPF Instrumentation)**은 2025년 5월 Grafana Labs가 Beyla 프로젝트를 OTel에 기증하면서 본격화되었다. 코드 변경 없이 커널 수준에서 HTTP, gRPC, SQL 호출을 자동으로 캡처한다. 2026년에는 안정 1.0 릴리스를 목표로 하고 있다(opentelemetry.io/blog/2026/obi-goals).

**SLO(Service Level Objective)**는 단순한 대시보드 숫자에서 벗어나, error budget policy를 통해 릴리스 의사결정과 온콜 우선순위를 제어하는 운영 프레임워크로 진화하고 있다.

이 글은 이 세 가지를 결합한 2026년 운영 모델을 설계한다.

eBPF 자동 계측이 바꾸는 것

eBPF 계측 vs 기존 SDK 계측 비교

항목	기존 OTel SDK 계측	eBPF 자동 계측 (OBI)
코드 변경	필요 (SDK 추가, instrumentation 코드)	불필요 (커널 수준 자동 캡처)
지원 언어	Java, Python, Go, .NET, JS 등	언어 무관 (바이너리 수준)
오버헤드	1-3% CPU	< 1% CPU (커널 공간 실행)
캡처 깊이	비즈니스 로직까지 상세	L7 프로토콜 수준 (HTTP, gRPC, SQL)
Context propagation	완전 지원	HTTP/gRPC에서 지원, 일부 프로토콜 제한
커스텀 속성	자유롭게 추가 가능	제한적 (프로토콜에서 추출 가능한 것만)
배포 방식	애플리케이션과 함께	DaemonSet 또는 sidecar
운영 부담	서비스별 개별 적용	클러스터 전체 일괄 적용

eBPF 자동 계측 배포 (Kubernetes)

# otel-ebpf-instrumentation.yaml
# OBI(OpenTelemetry eBPF Instrumentation)를 DaemonSet으로 배포
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-ebpf-instrumentation
  namespace: observability
spec:
  selector:
    matchLabels:
      app: obi
  template:
    metadata:
      labels:
        app: obi
    spec:
      hostPID: true # eBPF 프로브 연결을 위해 필요
      hostNetwork: false
      serviceAccountName: obi
      containers:
        - name: obi
          image: ghcr.io/open-telemetry/opentelemetry-ebpf-instrumentation:v0.9.0
          securityContext:
            privileged: true # eBPF 프로그램 로드를 위해 필요
            runAsUser: 0
          env:
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: 'http://otel-collector:4318'
            - name: OTEL_SERVICE_NAME
              value: 'auto-detected' # 프로세스 이름에서 자동 추출
            - name: OTEL_EBPF_TRACK_REQUEST_HEADERS
              value: 'true'
            # 모니터링 대상 네임스페이스 필터
            - name: OTEL_EBPF_KUBE_NAMESPACE
              value: 'production,staging'
            # 캡처할 프로토콜
            - name: OTEL_EBPF_PROTOCOLS
              value: 'HTTP,GRPC,SQL,REDIS'
          volumeMounts:
            - name: sys-kernel
              mountPath: /sys/kernel
              readOnly: true
            - name: bpf-maps
              mountPath: /sys/fs/bpf
          resources:
            limits:
              cpu: 500m
              memory: 512Mi
            requests:
              cpu: 100m
              memory: 128Mi
      volumes:
        - name: sys-kernel
          hostPath:
            path: /sys/kernel
        - name: bpf-maps
          hostPath:
            path: /sys/fs/bpf

eBPF와 SDK의 하이브리드 전략

모든 서비스를 eBPF만으로 계측할 수는 없다. eBPF는 L7 프로토콜 수준의 메트릭과 트레이스를 자동으로 제공하지만, 비즈니스 로직 수준의 커스텀 span이나 메트릭은 제공하지 못한다.

[서비스 계측 전략 매트릭스]

                   비즈니스 로직 관측 필요도
                   낮음              높음
                ┌─────────────┬─────────────┐
    중요도 높음  │  eBPF only  │  eBPF + SDK │
                │  (인프라     │  (핵심      │
                │   서비스)    │   비즈니스)  │
                ├─────────────┼─────────────┤
    중요도 낮음  │  eBPF only  │  SDK only   │
                │  (레거시,    │  (데이터    │
                │   3rd party) │   파이프라인)│
                └─────────────┴─────────────┘

구체적 적용 예시:

# 하이브리드 전략: eBPF가 L7 트래픽을 자동 캡처하고,
# SDK는 비즈니스 로직에만 집중

from opentelemetry import trace

tracer = trace.get_tracer("recommendation-engine", "2.1.0")

async def get_recommendations(user_id: str, context: dict):
    # eBPF가 자동으로 캡처하는 것:
    # - HTTP 요청/응답 메트릭 (latency, status code)
    # - gRPC 호출 메트릭
    # - SQL 쿼리 실행 시간
    # - Redis 명령 실행 시간

    # SDK로 추가하는 것: 비즈니스 로직 수준의 상세 정보
    with tracer.start_as_current_span(
        "generate_recommendations",
        attributes={
            "user.segment": context.get("segment", "unknown"),
            "model.version": "v3.2",
            "candidate.count": 1000,
        }
    ) as span:
        # 모델 추론 span
        with tracer.start_as_current_span("ml_inference") as ml_span:
            scores = await ml_model.predict(user_id, context)
            ml_span.set_attribute("inference.latency_ms", scores.latency_ms)
            ml_span.set_attribute("inference.model_name", "rec-v3.2-prod")

        # 필터링 span
        with tracer.start_as_current_span("business_filter") as filter_span:
            filtered = apply_business_rules(scores.items, context)
            filter_span.set_attribute("filter.input_count", len(scores.items))
            filter_span.set_attribute("filter.output_count", len(filtered))
            filter_span.set_attribute("filter.removed_reasons", {
                "out_of_stock": 12,
                "age_restricted": 3,
                "region_blocked": 1,
            })

        span.set_attribute("result.count", len(filtered))
        return filtered

eBPF 기반 SLI 자동 수집

eBPF가 자동으로 캡처하는 데이터에서 SLI를 추출할 수 있다. 코드 변경 없이 모든 서비스의 SLI를 일괄 수집하는 것이 가능해진다.

OBI가 자동 생성하는 메트릭

# OBI가 생성하는 주요 메트릭 (Prometheus 형식)

# HTTP 서버 요청 지속시간
http_server_request_duration_seconds_bucket{
  http_request_method="GET",
  http_response_status_code="200",
  url_path="/api/v1/orders",
  service_name="order-service",
  le="0.005"
} 1234

# HTTP 서버 요청 수
http_server_request_duration_seconds_count{
  http_request_method="GET",
  http_response_status_code="200",
  url_path="/api/v1/orders",
  service_name="order-service",
} 5678

# gRPC 서버 요청 지속시간
rpc_server_duration_seconds_bucket{
  rpc_method="GetUser",
  rpc_service="user.UserService",
  rpc_grpc_status_code="OK",
  service_name="user-service",
  le="0.1"
} 9012

# SQL 쿼리 지속시간
db_client_operation_duration_seconds_bucket{
  db_system="postgresql",
  db_operation="SELECT",
  service_name="order-service",
  le="0.05"
} 3456

eBPF 메트릭에서 SLI 추출하는 Recording Rules

# prometheus_rules/ebpf_sli_rules.yaml
groups:
  - name: ebpf_sli_from_obi
    interval: 30s
    rules:
      # 가용성 SLI: 5xx를 제외한 HTTP 응답 비율
      - record: sli:http_availability:ratio_rate5m
        expr: |
          sum(rate(http_server_request_duration_seconds_count{
            http_response_status_code!~"5.."
          }[5m])) by (service_name)
          /
          sum(rate(http_server_request_duration_seconds_count[5m])) by (service_name)

      # 지연시간 SLI: 300ms 이내 응답 비율
      - record: sli:http_latency:ratio_rate5m
        expr: |
          sum(rate(http_server_request_duration_seconds_bucket{
            le="0.3"
          }[5m])) by (service_name)
          /
          sum(rate(http_server_request_duration_seconds_count[5m])) by (service_name)

      # gRPC 가용성 SLI
      - record: sli:grpc_availability:ratio_rate5m
        expr: |
          sum(rate(rpc_server_duration_seconds_count{
            rpc_grpc_status_code="OK"
          }[5m])) by (service_name)
          /
          sum(rate(rpc_server_duration_seconds_count[5m])) by (service_name)

      # DB 쿼리 지연시간 SLI: 50ms 이내 비율
      - record: sli:db_latency:ratio_rate5m
        expr: |
          sum(rate(db_client_operation_duration_seconds_bucket{
            le="0.05"
          }[5m])) by (service_name, db_system)
          /
          sum(rate(db_client_operation_duration_seconds_count[5m])) by (service_name, db_system)

통합 운영 모델: OTel + eBPF + SLO

세 기술을 결합한 운영 모델은 다음과 같은 흐름으로 동작한다.

데이터 흐름

[Application Pods]
     │
     ├── eBPF (OBI DaemonSet)
     │   └── 자동 캡처: HTTP/gRPC/SQL 메트릭 + 트레이스
     │
     ├── OTel SDK (선택적)
     │   └── 수동 계측: 비즈니스 span + 커스텀 메트릭
     │
     └── 두 데이터 모두 OTel Collector로 전송
              │
              ▼
     [OTel Collector Gateway]
     ├── 속성 표준화 (semantic conventions)
     ├── eBPF 데이터와 SDK 데이터 merge
     ├── Tail-based sampling
     └── 백엔드별 라우팅
              │
         ┌────┴────┐
         ▼         ▼
   [Metrics DB]  [Traces DB]
   (Mimir)       (Tempo)
         │         │
         ▼         ▼
   [Prometheus Recording Rules]
   ├── SLI 계산 (from eBPF metrics)
   ├── Error budget 계산
   └── Burn rate 알림
              │
              ▼
   [Error Budget Policy Engine]
   ├── Budget >= 50%: 정상 릴리스
   ├── Budget 20-50%: Canary 필수
   ├── Budget < 20%: 릴리스 동결
   └── CI/CD 파이프라인 연동

통합 Collector 설정

# otel-collector-unified.yaml
# eBPF 데이터와 SDK 데이터를 모두 수신하는 통합 Gateway
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  # eBPF가 생성한 서비스 이름을 표준화
  # OBI는 프로세스 이름에서 서비스 이름을 추출하는데,
  # 이것이 SDK에서 설정한 서비스 이름과 다를 수 있음
  transform/service_name:
    trace_statements:
      - context: resource
        statements:
          # OBI가 자동 감지한 이름을 표준 이름으로 매핑
          - replace_pattern(attributes["service.name"], "^python3?$", "unknown-python-service")
          - replace_pattern(attributes["service.name"], "^java$", "unknown-java-service")
    metric_statements:
      - context: resource
        statements:
          - replace_pattern(attributes["service.name"], "^python3?$", "unknown-python-service")

  # eBPF 트레이스와 SDK 트레이스를 동일 trace_id로 연결
  # OBI가 HTTP 헤더에서 traceparent를 읽으므로,
  # SDK가 생성한 trace에 eBPF span이 자동으로 합류
  batch:
    send_batch_size: 2048
    timeout: 10s

  tail_sampling:
    decision_wait: 15s
    num_traces: 200000
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-requests
        type: latency
        latency:
          threshold_ms: 500
      - name: sample-normal
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  prometheusremotewrite:
    endpoint: http://mimir:9009/api/v1/push
    resource_to_telemetry_conversion:
      enabled: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [transform/service_name, tail_sampling, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [transform/service_name, batch]
      exporters: [prometheusremotewrite]

자동화된 SLO 거버넌스

서비스 카탈로그 기반 자동 SLO 프로비저닝

새 서비스가 배포되면 eBPF가 자동으로 메트릭을 수집하고, 서비스 카탈로그에 등록된 SLO 정책에 따라 알림이 자동 생성되는 구조다.

# slo_provisioner.py
# 서비스 카탈로그에서 SLO 정의를 읽고 Prometheus 알림 규칙 생성

import yaml
from pathlib import Path

def generate_slo_alerts(service_catalog_path: str, output_dir: str):
    """서비스 카탈로그에서 SLO 정의를 읽고 Prometheus 알림 규칙 자동 생성"""
    catalog = yaml.safe_load(open(service_catalog_path))

    for service in catalog["services"]:
        name = service["name"]
        tier = service["tier"]
        slo = service["slo"]

        # 티어에 따른 기본값 적용
        availability_target = slo.get("availability", TIER_DEFAULTS[tier]["availability"])
        latency_threshold_ms = slo.get("latency_threshold_ms", TIER_DEFAULTS[tier]["latency_ms"])
        latency_target = slo.get("latency_target", TIER_DEFAULTS[tier]["latency_target"])

        # Prometheus 알림 규칙 생성
        alert_rules = generate_burn_rate_alerts(
            service_name=name,
            availability_target=availability_target,
            latency_threshold_ms=latency_threshold_ms,
            latency_target=latency_target,
        )

        output_file = Path(output_dir) / f"slo-{name}.yaml"
        yaml.dump(alert_rules, open(output_file, "w"), default_flow_style=False)
        print(f"Generated SLO alerts for {name}: {output_file}")

TIER_DEFAULTS = {
    "tier1": {"availability": 0.9995, "latency_ms": 200, "latency_target": 0.99},
    "tier2": {"availability": 0.999,  "latency_ms": 500, "latency_target": 0.95},
    "tier3": {"availability": 0.995,  "latency_ms": 2000, "latency_target": 0.90},
}

def generate_burn_rate_alerts(
    service_name: str,
    availability_target: float,
    latency_threshold_ms: int,
    latency_target: float,
) -> dict:
    """Multi-window burn rate 알림 규칙 생성"""
    error_budget = 1.0 - availability_target
    latency_threshold_sec = latency_threshold_ms / 1000.0

    return {
        "groups": [{
            "name": f"slo-{service_name}",
            "rules": [
                # Critical: burn rate 14, 1h/5m window
                {
                    "alert": f"SLO_{service_name}_BurnRate_Critical",
                    "expr": (
                        f'(\n'
                        f'  1 - sli:http_availability:ratio_rate1h{{service_name="{service_name}"}}\n'
                        f') > {14 * error_budget}\n'
                        f'and\n'
                        f'(\n'
                        f'  1 - sli:http_availability:ratio_rate5m{{service_name="{service_name}"}}\n'
                        f') > {14 * error_budget}'
                    ),
                    "for": "1m",
                    "labels": {
                        "severity": "critical",
                        "service": service_name,
                        "burn_rate": "14",
                    },
                    "annotations": {
                        "summary": f"{service_name}: SLO critical burn rate (14x)",
                        "runbook": f"https://wiki.internal/runbook/slo/{service_name}",
                    },
                },
                # Warning: burn rate 6, 6h/30m window
                {
                    "alert": f"SLO_{service_name}_BurnRate_Warning",
                    "expr": (
                        f'(\n'
                        f'  1 - sli:http_availability:ratio_rate6h{{service_name="{service_name}"}}\n'
                        f') > {6 * error_budget}\n'
                        f'and\n'
                        f'(\n'
                        f'  1 - sli:http_availability:ratio_rate30m{{service_name="{service_name}"}}\n'
                        f') > {6 * error_budget}'
                    ),
                    "for": "5m",
                    "labels": {
                        "severity": "warning",
                        "service": service_name,
                        "burn_rate": "6",
                    },
                },
            ],
        }],
    }

서비스 카탈로그 예시

# service_catalog.yaml
services:
  - name: payment-api
    tier: tier1
    team: payment
    slo:
      availability: 0.9995
      latency_threshold_ms: 200
      latency_target: 0.99

  - name: recommendation-engine
    tier: tier2
    team: ml-platform
    slo:
      availability: 0.999
      latency_threshold_ms: 500

  - name: notification-service
    tier: tier3
    team: platform
    # tier3 기본값 사용

  - name: internal-admin
    tier: tier3
    team: platform
    slo:
      availability: 0.99
      latency_threshold_ms: 3000

운영 사이클: 주간/월간 루틴

주간 SLO 리뷰 (30분)

참석: SRE 리드, 서비스 오너, 제품 매니저

1. Error Budget 현황 확인 (5분)
   - 전체 서비스 budget 잔량 대시보드 리뷰
   - Yellow/Red 상태 서비스 식별

2. 지난주 Incident SLO 영향도 (10분)
   - 각 incident가 소진한 budget 비율
   - 반복 패턴 확인

3. 릴리스 계획 리뷰 (10분)
   - 이번 주 예정된 릴리스의 risk 평가
   - Budget 상태에 따른 릴리스 전략 결정
     (canary 비율, rollback 기준 등)

4. Action Items (5분)
   - 이전 주 action items 완료 상태
   - 새 action items 할당

월간 SLO 튜닝 리뷰 (1시간)

참석: Engineering VP, SRE 팀, 서비스 오너들

1. SLO 목표 적정성 검토
   - 지난 3개월 실제 SLI 추이 대비 SLO 목표가 적절한가?
   - 너무 여유로운 SLO: 불필요한 리소스 낭비 가능성
   - 너무 빠듯한 SLO: 혁신 속도 저하

2. eBPF 계측 커버리지 확인
   - 신규 서비스가 자동 계측되고 있는가?
   - OBI 버전 업데이트 필요성
   - 새로운 프로토콜 지원 필요성 (MQTT, AMQP 등)

3. 비용 리뷰
   - 관측성 데이터 스토리지 비용 추이
   - 샘플링 비율 조정 필요성
   - 보존 기간 정책 확인

4. 알림 품질 리뷰
   - 지난 달 알림 발화 횟수
   - 오탐(false positive) 비율
   - 미탐(false negative) 사례

트러블슈팅

1. eBPF 프로그램 로드 실패

Error: failed to load BPF program: operation not permitted

원인: 컨테이너에 CAP_BPF 또는 CAP_SYS_ADMIN 권한이 없음

해결:

# securityContext에 필요한 권한 추가
securityContext:
  privileged: true
  # 또는 최소 권한으로:
  capabilities:
    add:
      - BPF
      - SYS_ADMIN
      - NET_ADMIN
      - PERFMON

2. eBPF 메트릭에서 서비스 이름이 "python3"으로 표시

원인: OBI가 프로세스 이름에서 서비스 이름을 추출하는데, Python 서비스는 인터프리터 이름이 노출됨

해결:

# 방법 1: OBI 환경변수로 매핑 설정
env:
  - name: OTEL_EBPF_SERVICE_NAME_MAP
    value: 'python3.11:/usr/local/bin/gunicorn=order-service'

# 방법 2: Collector의 transform processor에서 매핑
processors:
  transform/service_name:
    metric_statements:
      - context: resource
        statements:
          - set(attributes["service.name"], "order-service")
            where attributes["k8s.deployment.name"] == "order-service"

3. eBPF 트레이스와 SDK 트레이스가 별도로 보임

증상: 같은 요청인데 eBPF가 생성한 트레이스와 SDK가 생성한 트레이스가 서로 다른 trace_id를 가짐

원인: OBI가 HTTP 헤더에서 기존 traceparent를 읽지 못하거나, SDK가 OBI보다 먼저 trace를 시작하여 context가 중복됨

해결:

# OBI에서 기존 context를 존중하도록 설정
env:
  - name: OTEL_EBPF_CONTEXT_PROPAGATION
    value: 'true'
  - name: OTEL_EBPF_CONTEXT_PROPAGATION_MODE
    value: 'reuse' # 기존 context가 있으면 재사용, 없으면 새로 생성

4. SLI recording rule이 `NaN`을 반환

원인: 분모(total requests)가 0인 시간대. 트래픽이 없는 심야 시간 또는 새로 배포된 서비스.

해결:

# NaN 방지: 분모가 0이면 1을 반환 (에러 없음으로 간주)
sli:http_availability:ratio_rate5m = (
  sum(rate(http_server_request_duration_seconds_count{
    http_response_status_code!~"5.."
  }[5m])) by (service_name)
  /
  (sum(rate(http_server_request_duration_seconds_count[5m])) by (service_name) > 0)
) or vector(1)

5. eBPF overhead가 예상보다 높음

증상: OBI DaemonSet의 CPU 사용량이 500m을 초과

진단 및 해결:

# 1. 어떤 프로브가 CPU를 많이 사용하는지 확인
kubectl exec -n observability obi-xxx -- /obi debug perf-stats

# 2. 불필요한 프로토콜 감지 비활성화
# OTEL_EBPF_PROTOCOLS에서 사용하지 않는 프로토콜 제거
# 예: Redis를 사용하지 않으면 REDIS 제거

# 3. 높은 트래픽 파드 제외
env:
  - name: OTEL_EBPF_EXCLUDE_NAMESPACES
    value: "kube-system,monitoring"
  - name: OTEL_EBPF_EXCLUDE_PODS
    value: "load-generator-*"  # 부하 테스트 파드 제외

2026년 로드맵 기반 준비 사항

OBI의 2026년 로드맵(opentelemetry.io/blog/2026/obi-goals)에 따르면 다음 기능이 추가될 예정이다.

예정 기능	현재 상태	준비 사항
안정 1.0 릴리스	Alpha/Beta	프로덕션 배포 전 staging 테스트 계획 수립
.NET 계측 지원	초기 테스트	.NET 서비스 목록 파악, SDK 대체 가능성 평가
메시징 시스템 (MQTT, AMQP, NATS)	개발 중	메시지 큐 기반 서비스의 현재 계측 방식 정리
gRPC full context propagation	개선 중	gRPC 서비스 간 trace 연결 상태 확인
클라우드 SDK 계측 (AWS, GCP, Azure)	계획	클라우드 API 호출 관측 필요성 평가

퀴즈

Q1. eBPF 자동 계측이 SDK 계측을 완전히 대체할 수 없는 이유는?

정답: ||eBPF는 커널 수준에서 L7 프로토콜(HTTP, gRPC, SQL)을 캡처하지만, 애플리케이션 비즈니스 로직 수준의 커스텀 span이나 비즈니스 메트릭(예: 주문 금액, 추천 점수)은 생성할 수 없다. 핵심 비즈니스 서비스에서는 eBPF와 SDK를 함께 사용하는 하이브리드 전략이 필요하다.||

Q2. OBI DaemonSet에 privileged 권한이 필요한 이유는?

정답: ||eBPF 프로그램을 커널에 로드하고 실행하려면 CAP_BPF와 CAP_SYS_ADMIN 같은 높은 수준의 권한이 필요하다. eBPF 프로브를 커널 함수에 attach하고 네트워크 패킷을 검사하며 다른 프로세스의 시스템 콜을 트레이싱하기 때문이다.||

Q3. eBPF 메트릭에서 자동으로 SLI를 추출할 때의 장점은?

정답: ||코드 변경 없이 클러스터 내 모든 서비스의 가용성과 지연시간 SLI를 일괄 수집할 수 있다. 새 서비스가 배포되면 eBPF가 자동으로 메트릭을 생성하므로, 서비스 카탈로그와 연동하면 SLO 알림까지 자동 프로비저닝이 가능하다.||

Q4. eBPF 트레이스와 SDK 트레이스의 trace_id가 달라지는 상황은 언제인가?

정답: ||OBI가 HTTP 헤더의 traceparent를 읽지 못하거나, SDK가 이미 trace를 시작한 상태에서 OBI가 별도 trace를 생성하는 경우다. OTEL_EBPF_CONTEXT_PROPAGATION_MODE를 "reuse"로 설정하면 기존 context가 있을 때 재사용하여 이 문제를 해결할 수 있다.||

Q5. 서비스 카탈로그 기반 자동 SLO 프로비저닝의 전제 조건은?

정답: ||eBPF 자동 계측이 클러스터 전체에 배포되어 있어야 하고, 서비스 이름이 표준화(k8s deployment name 기반 매핑 등)되어 있어야 하며, 서비스 카탈로그에 각 서비스의 티어와 SLO 정의가 등록되어 있어야 한다. 이 세 가지가 갖춰지면 Prometheus recording rules과 알림 규칙을 자동 생성할 수 있다.||

Q6. SLI recording rule에서 분모가 0일 때 NaN 대신 1을 반환해야 하는 이유는?

정답: ||트래픽이 없는 시간대에 NaN이 발생하면 burn rate 알림이 제대로 계산되지 않는다. 요청이 없으면 에러도 없으므로 가용성 1(100%)로 간주하는 것이 합리적이다. 단, 장기간 트래픽이 0이면 별도 "서비스 무응답" 알림을 설정해야 한다.||

Q7. 관측성 운영 모델에서 주간 리뷰와 월간 리뷰의 초점 차이는?

정답: ||주간 리뷰는 현재 error budget 상태와 이번 주 릴리스 계획에 집중하는 전술적 미팅이다. 월간 리뷰는 SLO 목표 자체의 적정성, eBPF 커버리지, 비용 추이, 알림 품질을 점검하는 전략적 미팅이다.||

References

Observability: OTel eBPF SLO Operating Model 2026

The Inflection Point of the 2026 Observability Stack
What eBPF Auto-Instrumentation Changes
eBPF-Based SLI Auto-Collection
- Metrics Auto-Generated by OBI
- Recording Rules for Extracting SLIs from eBPF Metrics
Unified Operating Model: OTel + eBPF + SLO
- Data Flow
- Unified Collector Configuration
Automated SLO Governance
- Service Catalog-Based Auto SLO Provisioning
- Service Catalog Example
Operations Cycle: Weekly/Monthly Routines
- Weekly SLO Review (30 minutes)
- Monthly SLO Tuning Review (1 hour)
Troubleshooting
2026 Roadmap-Based Preparation
Quiz
References

The Inflection Point of the 2026 Observability Stack

In 2025-2026, three technologies are converging in the observability space, fundamentally changing the operating model itself.

OpenTelemetry (OTel) has now become the de facto observability standard as a CNCF graduated project. It unifies Metrics, Logs, and Traces through a single SDK and protocol (OTLP), enabling backend replacement without vendor lock-in.

eBPF-based auto-instrumentation (OBI: OpenTelemetry eBPF Instrumentation) gained momentum in May 2025 when Grafana Labs donated the Beyla project to OTel. It automatically captures HTTP, gRPC, and SQL calls at the kernel level without code changes. In 2026, it targets a stable 1.0 release (opentelemetry.io/blog/2026/obi-goals).

SLO (Service Level Objective) has evolved beyond simple dashboard numbers into an operational framework that controls release decisions and on-call priorities through error budget policies.

This article designs a 2026 operating model that combines these three technologies.

What eBPF Auto-Instrumentation Changes

eBPF Instrumentation vs Traditional SDK Instrumentation Comparison

Item	Traditional OTel SDK Instrumentation	eBPF Auto-Instrumentation (OBI)
Code changes	Required (SDK addition, instrumentation code)	Not required (kernel-level auto-capture)
Supported languages	Java, Python, Go, .NET, JS, etc.	Language-agnostic (binary level)
Overhead	1-3% CPU	Under 1% CPU (kernel space execution)
Capture depth	Detailed down to business logic	L7 protocol level (HTTP, gRPC, SQL)
Context propagation	Fully supported	Supported for HTTP/gRPC, limited for some protocols
Custom attributes	Freely addable	Limited (only what can be extracted from protocols)
Deployment method	Together with application	DaemonSet or sidecar
Operational burden	Per-service individual application	Cluster-wide batch application

eBPF Auto-Instrumentation Deployment (Kubernetes)

# otel-ebpf-instrumentation.yaml
# Deploy OBI (OpenTelemetry eBPF Instrumentation) as DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-ebpf-instrumentation
  namespace: observability
spec:
  selector:
    matchLabels:
      app: obi
  template:
    metadata:
      labels:
        app: obi
    spec:
      hostPID: true # Required for eBPF probe attachment
      hostNetwork: false
      serviceAccountName: obi
      containers:
        - name: obi
          image: ghcr.io/open-telemetry/opentelemetry-ebpf-instrumentation:v0.9.0
          securityContext:
            privileged: true # Required for loading eBPF programs
            runAsUser: 0
          env:
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: 'http://otel-collector:4318'
            - name: OTEL_SERVICE_NAME
              value: 'auto-detected' # Auto-extracted from process name
            - name: OTEL_EBPF_TRACK_REQUEST_HEADERS
              value: 'true'
            # Namespace filter for monitoring targets
            - name: OTEL_EBPF_KUBE_NAMESPACE
              value: 'production,staging'
            # Protocols to capture
            - name: OTEL_EBPF_PROTOCOLS
              value: 'HTTP,GRPC,SQL,REDIS'
          volumeMounts:
            - name: sys-kernel
              mountPath: /sys/kernel
              readOnly: true
            - name: bpf-maps
              mountPath: /sys/fs/bpf
          resources:
            limits:
              cpu: 500m
              memory: 512Mi
            requests:
              cpu: 100m
              memory: 128Mi
      volumes:
        - name: sys-kernel
          hostPath:
            path: /sys/kernel
        - name: bpf-maps
          hostPath:
            path: /sys/fs/bpf

Hybrid Strategy: eBPF and SDK

Not all services can be instrumented with eBPF alone. eBPF automatically provides L7 protocol-level metrics and traces, but cannot provide business logic-level custom spans or metrics.

[Service Instrumentation Strategy Matrix]

                   Business Logic Observability Need
                   Low               High
                +--------------+--------------+
    High        |  eBPF only   |  eBPF + SDK  |
    importance  |  (infra      |  (core       |
                |   services)  |   business)  |
                +--------------+--------------+
    Low         |  eBPF only   |  SDK only    |
    importance  |  (legacy,    |  (data       |
                |   3rd party) |   pipelines) |
                +--------------+--------------+

Specific Application Examples:

# Hybrid strategy: eBPF auto-captures L7 traffic,
# SDK focuses only on business logic

from opentelemetry import trace

tracer = trace.get_tracer("recommendation-engine", "2.1.0")

async def get_recommendations(user_id: str, context: dict):
    # What eBPF auto-captures:
    # - HTTP request/response metrics (latency, status code)
    # - gRPC call metrics
    # - SQL query execution time
    # - Redis command execution time

    # What SDK adds: detailed business logic-level information
    with tracer.start_as_current_span(
        "generate_recommendations",
        attributes={
            "user.segment": context.get("segment", "unknown"),
            "model.version": "v3.2",
            "candidate.count": 1000,
        }
    ) as span:
        # Model inference span
        with tracer.start_as_current_span("ml_inference") as ml_span:
            scores = await ml_model.predict(user_id, context)
            ml_span.set_attribute("inference.latency_ms", scores.latency_ms)
            ml_span.set_attribute("inference.model_name", "rec-v3.2-prod")

        # Filtering span
        with tracer.start_as_current_span("business_filter") as filter_span:
            filtered = apply_business_rules(scores.items, context)
            filter_span.set_attribute("filter.input_count", len(scores.items))
            filter_span.set_attribute("filter.output_count", len(filtered))
            filter_span.set_attribute("filter.removed_reasons", {
                "out_of_stock": 12,
                "age_restricted": 3,
                "region_blocked": 1,
            })

        span.set_attribute("result.count", len(filtered))
        return filtered

eBPF-Based SLI Auto-Collection

SLIs can be extracted from data that eBPF automatically captures. It becomes possible to batch-collect SLIs for all services without code changes.

Metrics Auto-Generated by OBI

# Key metrics generated by OBI (Prometheus format)

# HTTP server request duration
http_server_request_duration_seconds_bucket{
  http_request_method="GET",
  http_response_status_code="200",
  url_path="/api/v1/orders",
  service_name="order-service",
  le="0.005"
} 1234

# HTTP server request count
http_server_request_duration_seconds_count{
  http_request_method="GET",
  http_response_status_code="200",
  url_path="/api/v1/orders",
  service_name="order-service",
} 5678

# gRPC server request duration
rpc_server_duration_seconds_bucket{
  rpc_method="GetUser",
  rpc_service="user.UserService",
  rpc_grpc_status_code="OK",
  service_name="user-service",
  le="0.1"
} 9012

# SQL query duration
db_client_operation_duration_seconds_bucket{
  db_system="postgresql",
  db_operation="SELECT",
  service_name="order-service",
  le="0.05"
} 3456

Recording Rules for Extracting SLIs from eBPF Metrics

# prometheus_rules/ebpf_sli_rules.yaml
groups:
  - name: ebpf_sli_from_obi
    interval: 30s
    rules:
      # Availability SLI: ratio of HTTP responses excluding 5xx
      - record: sli:http_availability:ratio_rate5m
        expr: |
          sum(rate(http_server_request_duration_seconds_count{
            http_response_status_code!~"5.."
          }[5m])) by (service_name)
          /
          sum(rate(http_server_request_duration_seconds_count[5m])) by (service_name)

      # Latency SLI: ratio of responses within 300ms
      - record: sli:http_latency:ratio_rate5m
        expr: |
          sum(rate(http_server_request_duration_seconds_bucket{
            le="0.3"
          }[5m])) by (service_name)
          /
          sum(rate(http_server_request_duration_seconds_count[5m])) by (service_name)

      # gRPC Availability SLI
      - record: sli:grpc_availability:ratio_rate5m
        expr: |
          sum(rate(rpc_server_duration_seconds_count{
            rpc_grpc_status_code="OK"
          }[5m])) by (service_name)
          /
          sum(rate(rpc_server_duration_seconds_count[5m])) by (service_name)

      # DB Query Latency SLI: ratio within 50ms
      - record: sli:db_latency:ratio_rate5m
        expr: |
          sum(rate(db_client_operation_duration_seconds_bucket{
            le="0.05"
          }[5m])) by (service_name, db_system)
          /
          sum(rate(db_client_operation_duration_seconds_count[5m])) by (service_name, db_system)

Unified Operating Model: OTel + eBPF + SLO

The operating model that combines these three technologies works through the following flow.

Data Flow

[Application Pods]
     |
     +-- eBPF (OBI DaemonSet)
     |   +-- Auto-capture: HTTP/gRPC/SQL metrics + traces
     |
     +-- OTel SDK (optional)
     |   +-- Manual instrumentation: business spans + custom metrics
     |
     +-- Both data sent to OTel Collector
              |
              v
     [OTel Collector Gateway]
     +-- Attribute standardization (semantic conventions)
     +-- Merge eBPF data and SDK data
     +-- Tail-based sampling
     +-- Per-backend routing
              |
         +----+----+
         v         v
   [Metrics DB]  [Traces DB]
   (Mimir)       (Tempo)
         |         |
         v         v
   [Prometheus Recording Rules]
   +-- SLI calculation (from eBPF metrics)
   +-- Error budget calculation
   +-- Burn rate alerts
              |
              v
   [Error Budget Policy Engine]
   +-- Budget >= 50%: Normal releases
   +-- Budget 20-50%: Canary required
   +-- Budget < 20%: Release freeze
   +-- CI/CD pipeline integration

Unified Collector Configuration

# otel-collector-unified.yaml
# Unified Gateway receiving both eBPF and SDK data
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  # Standardize service names generated by eBPF
  # OBI extracts service names from process names,
  # which may differ from service names set in SDK
  transform/service_name:
    trace_statements:
      - context: resource
        statements:
          # Map auto-detected OBI names to standard names
          - replace_pattern(attributes["service.name"], "^python3?$", "unknown-python-service")
          - replace_pattern(attributes["service.name"], "^java$", "unknown-java-service")
    metric_statements:
      - context: resource
        statements:
          - replace_pattern(attributes["service.name"], "^python3?$", "unknown-python-service")

  # Connect eBPF traces and SDK traces with the same trace_id
  # Since OBI reads traceparent from HTTP headers,
  # eBPF spans automatically join traces created by SDK
  batch:
    send_batch_size: 2048
    timeout: 10s

  tail_sampling:
    decision_wait: 15s
    num_traces: 200000
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-requests
        type: latency
        latency:
          threshold_ms: 500
      - name: sample-normal
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  prometheusremotewrite:
    endpoint: http://mimir:9009/api/v1/push
    resource_to_telemetry_conversion:
      enabled: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [transform/service_name, tail_sampling, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [transform/service_name, batch]
      exporters: [prometheusremotewrite]

Automated SLO Governance

Service Catalog-Based Auto SLO Provisioning

When a new service is deployed, eBPF automatically collects metrics, and alerts are auto-generated according to SLO policies registered in the service catalog.

# slo_provisioner.py
# Read SLO definitions from service catalog and generate Prometheus alert rules

import yaml
from pathlib import Path

def generate_slo_alerts(service_catalog_path: str, output_dir: str):
    """Read SLO definitions from service catalog and auto-generate Prometheus alert rules"""
    catalog = yaml.safe_load(open(service_catalog_path))

    for service in catalog["services"]:
        name = service["name"]
        tier = service["tier"]
        slo = service["slo"]

        # Apply defaults based on tier
        availability_target = slo.get("availability", TIER_DEFAULTS[tier]["availability"])
        latency_threshold_ms = slo.get("latency_threshold_ms", TIER_DEFAULTS[tier]["latency_ms"])
        latency_target = slo.get("latency_target", TIER_DEFAULTS[tier]["latency_target"])

        # Generate Prometheus alert rules
        alert_rules = generate_burn_rate_alerts(
            service_name=name,
            availability_target=availability_target,
            latency_threshold_ms=latency_threshold_ms,
            latency_target=latency_target,
        )

        output_file = Path(output_dir) / f"slo-{name}.yaml"
        yaml.dump(alert_rules, open(output_file, "w"), default_flow_style=False)
        print(f"Generated SLO alerts for {name}: {output_file}")

TIER_DEFAULTS = {
    "tier1": {"availability": 0.9995, "latency_ms": 200, "latency_target": 0.99},
    "tier2": {"availability": 0.999,  "latency_ms": 500, "latency_target": 0.95},
    "tier3": {"availability": 0.995,  "latency_ms": 2000, "latency_target": 0.90},
}

def generate_burn_rate_alerts(
    service_name: str,
    availability_target: float,
    latency_threshold_ms: int,
    latency_target: float,
) -> dict:
    """Generate multi-window burn rate alert rules"""
    error_budget = 1.0 - availability_target
    latency_threshold_sec = latency_threshold_ms / 1000.0

    return {
        "groups": [{
            "name": f"slo-{service_name}",
            "rules": [
                # Critical: burn rate 14, 1h/5m window
                {
                    "alert": f"SLO_{service_name}_BurnRate_Critical",
                    "expr": (
                        f'(\n'
                        f'  1 - sli:http_availability:ratio_rate1h{{service_name="{service_name}"}}\n'
                        f') > {14 * error_budget}\n'
                        f'and\n'
                        f'(\n'
                        f'  1 - sli:http_availability:ratio_rate5m{{service_name="{service_name}"}}\n'
                        f') > {14 * error_budget}'
                    ),
                    "for": "1m",
                    "labels": {
                        "severity": "critical",
                        "service": service_name,
                        "burn_rate": "14",
                    },
                    "annotations": {
                        "summary": f"{service_name}: SLO critical burn rate (14x)",
                        "runbook": f"https://wiki.internal/runbook/slo/{service_name}",
                    },
                },
                # Warning: burn rate 6, 6h/30m window
                {
                    "alert": f"SLO_{service_name}_BurnRate_Warning",
                    "expr": (
                        f'(\n'
                        f'  1 - sli:http_availability:ratio_rate6h{{service_name="{service_name}"}}\n'
                        f') > {6 * error_budget}\n'
                        f'and\n'
                        f'(\n'
                        f'  1 - sli:http_availability:ratio_rate30m{{service_name="{service_name}"}}\n'
                        f') > {6 * error_budget}'
                    ),
                    "for": "5m",
                    "labels": {
                        "severity": "warning",
                        "service": service_name,
                        "burn_rate": "6",
                    },
                },
            ],
        }],
    }

Service Catalog Example

# service_catalog.yaml
services:
  - name: payment-api
    tier: tier1
    team: payment
    slo:
      availability: 0.9995
      latency_threshold_ms: 200
      latency_target: 0.99

  - name: recommendation-engine
    tier: tier2
    team: ml-platform
    slo:
      availability: 0.999
      latency_threshold_ms: 500

  - name: notification-service
    tier: tier3
    team: platform
    # Uses tier3 defaults

  - name: internal-admin
    tier: tier3
    team: platform
    slo:
      availability: 0.99
      latency_threshold_ms: 3000

Operations Cycle: Weekly/Monthly Routines

Weekly SLO Review (30 minutes)

Attendees: SRE Lead, Service Owners, Product Manager

1. Error Budget Status Check (5 min)
   - Review overall service budget remaining dashboard
   - Identify Yellow/Red status services

2. Last Week's Incident SLO Impact (10 min)
   - Budget percentage consumed by each incident
   - Identify recurring patterns

3. Release Plan Review (10 min)
   - Risk assessment for this week's planned releases
   - Determine release strategy based on budget status
     (canary ratio, rollback criteria, etc.)

4. Action Items (5 min)
   - Previous week's action items completion status
   - Assign new action items

Monthly SLO Tuning Review (1 hour)

Attendees: Engineering VP, SRE Team, Service Owners

1. SLO Target Appropriateness Review
   - Are SLO targets appropriate compared to actual SLI trends over the last 3 months?
   - Too generous SLO: potential for unnecessary resource waste
   - Too tight SLO: innovation speed decline

2. eBPF Instrumentation Coverage Check
   - Are new services being auto-instrumented?
   - Need for OBI version updates
   - Need for new protocol support (MQTT, AMQP, etc.)

3. Cost Review
   - Observability data storage cost trends
   - Need for sampling ratio adjustments
   - Retention period policy review

4. Alert Quality Review
   - Last month's alert firing count
   - False positive rate
   - False negative cases

Troubleshooting

1. eBPF Program Load Failure

Error: failed to load BPF program: operation not permitted

Cause: Container lacks CAP_BPF or CAP_SYS_ADMIN capabilities

Solution:

# Add required capabilities to securityContext
securityContext:
  privileged: true
  # Or with minimum privileges:
  capabilities:
    add:
      - BPF
      - SYS_ADMIN
      - NET_ADMIN
      - PERFMON

2. Service Name Shows as "python3" in eBPF Metrics

Cause: OBI extracts service names from process names, but Python services expose the interpreter name

Solution:

# Method 1: Configure mapping via OBI environment variables
env:
  - name: OTEL_EBPF_SERVICE_NAME_MAP
    value: 'python3.11:/usr/local/bin/gunicorn=order-service'

# Method 2: Map in Collector's transform processor
processors:
  transform/service_name:
    metric_statements:
      - context: resource
        statements:
          - set(attributes["service.name"], "order-service")
            where attributes["k8s.deployment.name"] == "order-service"

3. eBPF Traces and SDK Traces Appear Separately

Symptom: Same request but eBPF-generated traces and SDK-generated traces have different trace_ids

Cause: OBI fails to read existing traceparent from HTTP headers, or SDK starts a trace before OBI causing duplicate contexts

Solution:

# Configure OBI to respect existing context
env:
  - name: OTEL_EBPF_CONTEXT_PROPAGATION
    value: 'true'
  - name: OTEL_EBPF_CONTEXT_PROPAGATION_MODE
    value: 'reuse' # Reuse existing context if present, create new if not

4. SLI Recording Rule Returns `NaN`

Cause: Denominator (total requests) is 0 during that time window. Late night hours with no traffic or newly deployed services.

Solution:

# Prevent NaN: return 1 if denominator is 0 (treated as no errors)
sli:http_availability:ratio_rate5m = (
  sum(rate(http_server_request_duration_seconds_count{
    http_response_status_code!~"5.."
  }[5m])) by (service_name)
  /
  (sum(rate(http_server_request_duration_seconds_count[5m])) by (service_name) > 0)
) or vector(1)

5. eBPF Overhead Higher Than Expected

Symptom: OBI DaemonSet CPU usage exceeds 500m

Diagnosis and Solution:

# 1. Check which probes are consuming the most CPU
kubectl exec -n observability obi-xxx -- /obi debug perf-stats

# 2. Disable unnecessary protocol detection
# Remove unused protocols from OTEL_EBPF_PROTOCOLS
# Example: remove REDIS if not using Redis

# 3. Exclude high-traffic pods
env:
  - name: OTEL_EBPF_EXCLUDE_NAMESPACES
    value: "kube-system,monitoring"
  - name: OTEL_EBPF_EXCLUDE_PODS
    value: "load-generator-*"  # Exclude load test pods

2026 Roadmap-Based Preparation

According to OBI's 2026 roadmap (opentelemetry.io/blog/2026/obi-goals), the following features are planned.

Planned Feature	Current Status	Preparation Items
Stable 1.0 release	Alpha/Beta	Establish staging test plan before production
.NET instrumentation support	Early testing	Identify .NET service inventory, evaluate SDK replacement potential
Messaging systems (MQTT, AMQP, NATS)	In development	Document current instrumentation methods for message queue-based services
gRPC full context propagation	Improving	Verify trace connectivity status between gRPC services
Cloud SDK instrumentation (AWS, GCP, Azure)	Planned	Evaluate cloud API call observability requirements

Quiz

Q1. Why can't eBPF auto-instrumentation completely replace SDK instrumentation?

Answer: eBPF captures L7 protocols (HTTP, gRPC, SQL) at the kernel level, but cannot generate application business logic-level custom spans or business metrics (e.g., order amounts, recommendation scores). A hybrid strategy using both eBPF and SDK is necessary for core business services.

Q2. Why does the OBI DaemonSet require privileged permissions?

Answer: Loading and executing eBPF programs in the kernel requires high-level capabilities like CAP_BPF and CAP_SYS_ADMIN. This is because eBPF probes need to attach to kernel functions, inspect network packets, and trace system calls of other processes.

Q3. What are the advantages of automatically extracting SLIs from eBPF metrics?

Answer: Availability and latency SLIs for all services in the cluster can be batch-collected without code changes. When a new service is deployed, eBPF automatically generates metrics, so when integrated with a service catalog, SLO alerts can be auto-provisioned as well.

Q4. When do eBPF traces and SDK traces end up with different trace_ids?

Answer: When OBI fails to read the traceparent in the HTTP header, or when SDK has already started a trace and OBI creates a separate trace. Setting OTEL_EBPF_CONTEXT_PROPAGATION_MODE to "reuse" solves this problem by reusing existing context when available.

Q5. What are the prerequisites for service catalog-based auto SLO provisioning?

Answer: eBPF auto-instrumentation must be deployed cluster-wide, service names must be standardized (e.g., k8s deployment name-based mapping), and each service's tier and SLO definition must be registered in the service catalog. When these three conditions are met, Prometheus recording rules and alert rules can be auto-generated.

Q6. Why should the SLI recording rule return 1 instead of NaN when the denominator is 0?

Answer: When NaN occurs during periods with no traffic, burn rate alerts cannot be calculated properly. Since there are no errors when there are no requests, treating availability as 1 (100%) is reasonable. However, if traffic is 0 for an extended period, a separate "service unresponsive" alert should be configured.

Q7. What is the difference in focus between weekly and monthly reviews in the observability operating model?

Answer: The weekly review is a tactical meeting focused on current error budget status and this week's release plans. The monthly review is a strategic meeting that examines the appropriateness of SLO targets themselves, eBPF coverage, cost trends, and alert quality.

관측성: OTel eBPF SLO 운영모델 2026

2026년 관측성 스택의 변곡점

eBPF 자동 계측이 바꾸는 것

eBPF 계측 vs 기존 SDK 계측 비교

eBPF 자동 계측 배포 (Kubernetes)

eBPF와 SDK의 하이브리드 전략

eBPF 기반 SLI 자동 수집

OBI가 자동 생성하는 메트릭

eBPF 메트릭에서 SLI 추출하는 Recording Rules

통합 운영 모델: OTel + eBPF + SLO

데이터 흐름

통합 Collector 설정

자동화된 SLO 거버넌스

서비스 카탈로그 기반 자동 SLO 프로비저닝

서비스 카탈로그 예시

운영 사이클: 주간/월간 루틴

주간 SLO 리뷰 (30분)

월간 SLO 튜닝 리뷰 (1시간)

트러블슈팅

1. eBPF 프로그램 로드 실패

2. eBPF 메트릭에서 서비스 이름이 "python3"으로 표시

3. eBPF 트레이스와 SDK 트레이스가 별도로 보임

4. SLI recording rule이 NaN을 반환

5. eBPF overhead가 예상보다 높음

2026년 로드맵 기반 준비 사항

퀴즈

References

Observability: OTel eBPF SLO Operating Model 2026

The Inflection Point of the 2026 Observability Stack

What eBPF Auto-Instrumentation Changes

eBPF Instrumentation vs Traditional SDK Instrumentation Comparison

eBPF Auto-Instrumentation Deployment (Kubernetes)

Hybrid Strategy: eBPF and SDK

eBPF-Based SLI Auto-Collection

Metrics Auto-Generated by OBI

Recording Rules for Extracting SLIs from eBPF Metrics

Unified Operating Model: OTel + eBPF + SLO

Data Flow

Unified Collector Configuration

Automated SLO Governance

Service Catalog-Based Auto SLO Provisioning

Service Catalog Example

Operations Cycle: Weekly/Monthly Routines

Weekly SLO Review (30 minutes)

Monthly SLO Tuning Review (1 hour)

Troubleshooting

1. eBPF Program Load Failure

2. Service Name Shows as "python3" in eBPF Metrics

3. eBPF Traces and SDK Traces Appear Separately

4. SLI Recording Rule Returns NaN

5. eBPF Overhead Higher Than Expected

2026 Roadmap-Based Preparation

Quiz

References

4. SLI recording rule이 `NaN`을 반환

4. SLI Recording Rule Returns `NaN`