Split View: 관측성: OTel eBPF SLO 운영모델 2026
관측성: OTel eBPF SLO 운영모델 2026

- 2026년 관측성 스택의 변곡점
- eBPF 자동 계측이 바꾸는 것
- eBPF 기반 SLI 자동 수집
- 통합 운영 모델: OTel + eBPF + SLO
- 자동화된 SLO 거버넌스
- 운영 사이클: 주간/월간 루틴
- 트러블슈팅
- 2026년 로드맵 기반 준비 사항
- 퀴즈
- References
2026년 관측성 스택의 변곡점
2025-2026년에 관측성 영역에서 세 가지 기술이 교차하며 운영 모델 자체가 바뀌고 있다.
**OpenTelemetry(OTel)**는 이제 CNCF graduated project로서 사실상의 관측성 표준이 되었다. Metrics, Logs, Traces를 하나의 SDK와 프로토콜(OTLP)로 통합하며, vendor lock-in 없이 백엔드를 교체할 수 있다.
**eBPF 기반 자동 계측(OBI: OpenTelemetry eBPF Instrumentation)**은 2025년 5월 Grafana Labs가 Beyla 프로젝트를 OTel에 기증하면서 본격화되었다. 코드 변경 없이 커널 수준에서 HTTP, gRPC, SQL 호출을 자동으로 캡처한다. 2026년에는 안정 1.0 릴리스를 목표로 하고 있다(opentelemetry.io/blog/2026/obi-goals).
**SLO(Service Level Objective)**는 단순한 대시보드 숫자에서 벗어나, error budget policy를 통해 릴리스 의사결정과 온콜 우선순위를 제어하는 운영 프레임워크로 진화하고 있다.
이 글은 이 세 가지를 결합한 2026년 운영 모델을 설계한다.
eBPF 자동 계측이 바꾸는 것
eBPF 계측 vs 기존 SDK 계측 비교
| 항목 | 기존 OTel SDK 계측 | eBPF 자동 계측 (OBI) |
|---|---|---|
| 코드 변경 | 필요 (SDK 추가, instrumentation 코드) | 불필요 (커널 수준 자동 캡처) |
| 지원 언어 | Java, Python, Go, .NET, JS 등 | 언어 무관 (바이너리 수준) |
| 오버헤드 | 1-3% CPU | < 1% CPU (커널 공간 실행) |
| 캡처 깊이 | 비즈니스 로직까지 상세 | L7 프로토콜 수준 (HTTP, gRPC, SQL) |
| Context propagation | 완전 지원 | HTTP/gRPC에서 지원, 일부 프로토콜 제한 |
| 커스텀 속성 | 자유롭게 추가 가능 | 제한적 (프로토콜에서 추출 가능한 것만) |
| 배포 방식 | 애플리케이션과 함께 | DaemonSet 또는 sidecar |
| 운영 부담 | 서비스별 개별 적용 | 클러스터 전체 일괄 적용 |
eBPF 자동 계측 배포 (Kubernetes)
# otel-ebpf-instrumentation.yaml
# OBI(OpenTelemetry eBPF Instrumentation)를 DaemonSet으로 배포
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-ebpf-instrumentation
namespace: observability
spec:
selector:
matchLabels:
app: obi
template:
metadata:
labels:
app: obi
spec:
hostPID: true # eBPF 프로브 연결을 위해 필요
hostNetwork: false
serviceAccountName: obi
containers:
- name: obi
image: ghcr.io/open-telemetry/opentelemetry-ebpf-instrumentation:v0.9.0
securityContext:
privileged: true # eBPF 프로그램 로드를 위해 필요
runAsUser: 0
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: 'http://otel-collector:4318'
- name: OTEL_SERVICE_NAME
value: 'auto-detected' # 프로세스 이름에서 자동 추출
- name: OTEL_EBPF_TRACK_REQUEST_HEADERS
value: 'true'
# 모니터링 대상 네임스페이스 필터
- name: OTEL_EBPF_KUBE_NAMESPACE
value: 'production,staging'
# 캡처할 프로토콜
- name: OTEL_EBPF_PROTOCOLS
value: 'HTTP,GRPC,SQL,REDIS'
volumeMounts:
- name: sys-kernel
mountPath: /sys/kernel
readOnly: true
- name: bpf-maps
mountPath: /sys/fs/bpf
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 100m
memory: 128Mi
volumes:
- name: sys-kernel
hostPath:
path: /sys/kernel
- name: bpf-maps
hostPath:
path: /sys/fs/bpf
eBPF와 SDK의 하이브리드 전략
모든 서비스를 eBPF만으로 계측할 수는 없다. eBPF는 L7 프로토콜 수준의 메트릭과 트레이스를 자동으로 제공하지만, 비즈니스 로직 수준의 커스텀 span이나 메트릭은 제공하지 못한다.
[서비스 계측 전략 매트릭스]
비즈니스 로직 관측 필요도
낮음 높음
┌─────────────┬─────────────┐
중요도 높음 │ eBPF only │ eBPF + SDK │
│ (인프라 │ (핵심 │
│ 서비스) │ 비즈니스) │
├─────────────┼─────────────┤
중요도 낮음 │ eBPF only │ SDK only │
│ (레거시, │ (데이터 │
│ 3rd party) │ 파이프라인)│
└─────────────┴─────────────┘
구체적 적용 예시:
# 하이브리드 전략: eBPF가 L7 트래픽을 자동 캡처하고,
# SDK는 비즈니스 로직에만 집중
from opentelemetry import trace
tracer = trace.get_tracer("recommendation-engine", "2.1.0")
async def get_recommendations(user_id: str, context: dict):
# eBPF가 자동으로 캡처하는 것:
# - HTTP 요청/응답 메트릭 (latency, status code)
# - gRPC 호출 메트릭
# - SQL 쿼리 실행 시간
# - Redis 명령 실행 시간
# SDK로 추가하는 것: 비즈니스 로직 수준의 상세 정보
with tracer.start_as_current_span(
"generate_recommendations",
attributes={
"user.segment": context.get("segment", "unknown"),
"model.version": "v3.2",
"candidate.count": 1000,
}
) as span:
# 모델 추론 span
with tracer.start_as_current_span("ml_inference") as ml_span:
scores = await ml_model.predict(user_id, context)
ml_span.set_attribute("inference.latency_ms", scores.latency_ms)
ml_span.set_attribute("inference.model_name", "rec-v3.2-prod")
# 필터링 span
with tracer.start_as_current_span("business_filter") as filter_span:
filtered = apply_business_rules(scores.items, context)
filter_span.set_attribute("filter.input_count", len(scores.items))
filter_span.set_attribute("filter.output_count", len(filtered))
filter_span.set_attribute("filter.removed_reasons", {
"out_of_stock": 12,
"age_restricted": 3,
"region_blocked": 1,
})
span.set_attribute("result.count", len(filtered))
return filtered
eBPF 기반 SLI 자동 수집
eBPF가 자동으로 캡처하는 데이터에서 SLI를 추출할 수 있다. 코드 변경 없이 모든 서비스의 SLI를 일괄 수집하는 것이 가능해진다.
OBI가 자동 생성하는 메트릭
# OBI가 생성하는 주요 메트릭 (Prometheus 형식)
# HTTP 서버 요청 지속시간
http_server_request_duration_seconds_bucket{
http_request_method="GET",
http_response_status_code="200",
url_path="/api/v1/orders",
service_name="order-service",
le="0.005"
} 1234
# HTTP 서버 요청 수
http_server_request_duration_seconds_count{
http_request_method="GET",
http_response_status_code="200",
url_path="/api/v1/orders",
service_name="order-service",
} 5678
# gRPC 서버 요청 지속시간
rpc_server_duration_seconds_bucket{
rpc_method="GetUser",
rpc_service="user.UserService",
rpc_grpc_status_code="OK",
service_name="user-service",
le="0.1"
} 9012
# SQL 쿼리 지속시간
db_client_operation_duration_seconds_bucket{
db_system="postgresql",
db_operation="SELECT",
service_name="order-service",
le="0.05"
} 3456
eBPF 메트릭에서 SLI 추출하는 Recording Rules
# prometheus_rules/ebpf_sli_rules.yaml
groups:
- name: ebpf_sli_from_obi
interval: 30s
rules:
# 가용성 SLI: 5xx를 제외한 HTTP 응답 비율
- record: sli:http_availability:ratio_rate5m
expr: |
sum(rate(http_server_request_duration_seconds_count{
http_response_status_code!~"5.."
}[5m])) by (service_name)
/
sum(rate(http_server_request_duration_seconds_count[5m])) by (service_name)
# 지연시간 SLI: 300ms 이내 응답 비율
- record: sli:http_latency:ratio_rate5m
expr: |
sum(rate(http_server_request_duration_seconds_bucket{
le="0.3"
}[5m])) by (service_name)
/
sum(rate(http_server_request_duration_seconds_count[5m])) by (service_name)
# gRPC 가용성 SLI
- record: sli:grpc_availability:ratio_rate5m
expr: |
sum(rate(rpc_server_duration_seconds_count{
rpc_grpc_status_code="OK"
}[5m])) by (service_name)
/
sum(rate(rpc_server_duration_seconds_count[5m])) by (service_name)
# DB 쿼리 지연시간 SLI: 50ms 이내 비율
- record: sli:db_latency:ratio_rate5m
expr: |
sum(rate(db_client_operation_duration_seconds_bucket{
le="0.05"
}[5m])) by (service_name, db_system)
/
sum(rate(db_client_operation_duration_seconds_count[5m])) by (service_name, db_system)
통합 운영 모델: OTel + eBPF + SLO
세 기술을 결합한 운영 모델은 다음과 같은 흐름으로 동작한다.
데이터 흐름
[Application Pods]
│
├── eBPF (OBI DaemonSet)
│ └── 자동 캡처: HTTP/gRPC/SQL 메트릭 + 트레이스
│
├── OTel SDK (선택적)
│ └── 수동 계측: 비즈니스 span + 커스텀 메트릭
│
└── 두 데이터 모두 OTel Collector로 전송
│
▼
[OTel Collector Gateway]
├── 속성 표준화 (semantic conventions)
├── eBPF 데이터와 SDK 데이터 merge
├── Tail-based sampling
└── 백엔드별 라우팅
│
┌────┴────┐
▼ ▼
[Metrics DB] [Traces DB]
(Mimir) (Tempo)
│ │
▼ ▼
[Prometheus Recording Rules]
├── SLI 계산 (from eBPF metrics)
├── Error budget 계산
└── Burn rate 알림
│
▼
[Error Budget Policy Engine]
├── Budget >= 50%: 정상 릴리스
├── Budget 20-50%: Canary 필수
├── Budget < 20%: 릴리스 동결
└── CI/CD 파이프라인 연동
통합 Collector 설정
# otel-collector-unified.yaml
# eBPF 데이터와 SDK 데이터를 모두 수신하는 통합 Gateway
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
# eBPF가 생성한 서비스 이름을 표준화
# OBI는 프로세스 이름에서 서비스 이름을 추출하는데,
# 이것이 SDK에서 설정한 서비스 이름과 다를 수 있음
transform/service_name:
trace_statements:
- context: resource
statements:
# OBI가 자동 감지한 이름을 표준 이름으로 매핑
- replace_pattern(attributes["service.name"], "^python3?$", "unknown-python-service")
- replace_pattern(attributes["service.name"], "^java$", "unknown-java-service")
metric_statements:
- context: resource
statements:
- replace_pattern(attributes["service.name"], "^python3?$", "unknown-python-service")
# eBPF 트레이스와 SDK 트레이스를 동일 trace_id로 연결
# OBI가 HTTP 헤더에서 traceparent를 읽으므로,
# SDK가 생성한 trace에 eBPF span이 자동으로 합류
batch:
send_batch_size: 2048
timeout: 10s
tail_sampling:
decision_wait: 15s
num_traces: 200000
policies:
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
- name: slow-requests
type: latency
latency:
threshold_ms: 500
- name: sample-normal
type: probabilistic
probabilistic:
sampling_percentage: 5
exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
prometheusremotewrite:
endpoint: http://mimir:9009/api/v1/push
resource_to_telemetry_conversion:
enabled: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [transform/service_name, tail_sampling, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [transform/service_name, batch]
exporters: [prometheusremotewrite]
자동화된 SLO 거버넌스
서비스 카탈로그 기반 자동 SLO 프로비저닝
새 서비스가 배포되면 eBPF가 자동으로 메트릭을 수집하고, 서비스 카탈로그에 등록된 SLO 정책에 따라 알림이 자동 생성되는 구조다.
# slo_provisioner.py
# 서비스 카탈로그에서 SLO 정의를 읽고 Prometheus 알림 규칙 생성
import yaml
from pathlib import Path
def generate_slo_alerts(service_catalog_path: str, output_dir: str):
"""서비스 카탈로그에서 SLO 정의를 읽고 Prometheus 알림 규칙 자동 생성"""
catalog = yaml.safe_load(open(service_catalog_path))
for service in catalog["services"]:
name = service["name"]
tier = service["tier"]
slo = service["slo"]
# 티어에 따른 기본값 적용
availability_target = slo.get("availability", TIER_DEFAULTS[tier]["availability"])
latency_threshold_ms = slo.get("latency_threshold_ms", TIER_DEFAULTS[tier]["latency_ms"])
latency_target = slo.get("latency_target", TIER_DEFAULTS[tier]["latency_target"])
# Prometheus 알림 규칙 생성
alert_rules = generate_burn_rate_alerts(
service_name=name,
availability_target=availability_target,
latency_threshold_ms=latency_threshold_ms,
latency_target=latency_target,
)
output_file = Path(output_dir) / f"slo-{name}.yaml"
yaml.dump(alert_rules, open(output_file, "w"), default_flow_style=False)
print(f"Generated SLO alerts for {name}: {output_file}")
TIER_DEFAULTS = {
"tier1": {"availability": 0.9995, "latency_ms": 200, "latency_target": 0.99},
"tier2": {"availability": 0.999, "latency_ms": 500, "latency_target": 0.95},
"tier3": {"availability": 0.995, "latency_ms": 2000, "latency_target": 0.90},
}
def generate_burn_rate_alerts(
service_name: str,
availability_target: float,
latency_threshold_ms: int,
latency_target: float,
) -> dict:
"""Multi-window burn rate 알림 규칙 생성"""
error_budget = 1.0 - availability_target
latency_threshold_sec = latency_threshold_ms / 1000.0
return {
"groups": [{
"name": f"slo-{service_name}",
"rules": [
# Critical: burn rate 14, 1h/5m window
{
"alert": f"SLO_{service_name}_BurnRate_Critical",
"expr": (
f'(\n'
f' 1 - sli:http_availability:ratio_rate1h{{service_name="{service_name}"}}\n'
f') > {14 * error_budget}\n'
f'and\n'
f'(\n'
f' 1 - sli:http_availability:ratio_rate5m{{service_name="{service_name}"}}\n'
f') > {14 * error_budget}'
),
"for": "1m",
"labels": {
"severity": "critical",
"service": service_name,
"burn_rate": "14",
},
"annotations": {
"summary": f"{service_name}: SLO critical burn rate (14x)",
"runbook": f"https://wiki.internal/runbook/slo/{service_name}",
},
},
# Warning: burn rate 6, 6h/30m window
{
"alert": f"SLO_{service_name}_BurnRate_Warning",
"expr": (
f'(\n'
f' 1 - sli:http_availability:ratio_rate6h{{service_name="{service_name}"}}\n'
f') > {6 * error_budget}\n'
f'and\n'
f'(\n'
f' 1 - sli:http_availability:ratio_rate30m{{service_name="{service_name}"}}\n'
f') > {6 * error_budget}'
),
"for": "5m",
"labels": {
"severity": "warning",
"service": service_name,
"burn_rate": "6",
},
},
],
}],
}
서비스 카탈로그 예시
# service_catalog.yaml
services:
- name: payment-api
tier: tier1
team: payment
slo:
availability: 0.9995
latency_threshold_ms: 200
latency_target: 0.99
- name: recommendation-engine
tier: tier2
team: ml-platform
slo:
availability: 0.999
latency_threshold_ms: 500
- name: notification-service
tier: tier3
team: platform
# tier3 기본값 사용
- name: internal-admin
tier: tier3
team: platform
slo:
availability: 0.99
latency_threshold_ms: 3000
운영 사이클: 주간/월간 루틴
주간 SLO 리뷰 (30분)
참석: SRE 리드, 서비스 오너, 제품 매니저
1. Error Budget 현황 확인 (5분)
- 전체 서비스 budget 잔량 대시보드 리뷰
- Yellow/Red 상태 서비스 식별
2. 지난주 Incident SLO 영향도 (10분)
- 각 incident가 소진한 budget 비율
- 반복 패턴 확인
3. 릴리스 계획 리뷰 (10분)
- 이번 주 예정된 릴리스의 risk 평가
- Budget 상태에 따른 릴리스 전략 결정
(canary 비율, rollback 기준 등)
4. Action Items (5분)
- 이전 주 action items 완료 상태
- 새 action items 할당
월간 SLO 튜닝 리뷰 (1시간)
참석: Engineering VP, SRE 팀, 서비스 오너들
1. SLO 목표 적정성 검토
- 지난 3개월 실제 SLI 추이 대비 SLO 목표가 적절한가?
- 너무 여유로운 SLO: 불필요한 리소스 낭비 가능성
- 너무 빠듯한 SLO: 혁신 속도 저하
2. eBPF 계측 커버리지 확인
- 신규 서비스가 자동 계측되고 있는가?
- OBI 버전 업데이트 필요성
- 새로운 프로토콜 지원 필요성 (MQTT, AMQP 등)
3. 비용 리뷰
- 관측성 데이터 스토리지 비용 추이
- 샘플링 비율 조정 필요성
- 보존 기간 정책 확인
4. 알림 품질 리뷰
- 지난 달 알림 발화 횟수
- 오탐(false positive) 비율
- 미탐(false negative) 사례
트러블슈팅
1. eBPF 프로그램 로드 실패
Error: failed to load BPF program: operation not permitted
원인: 컨테이너에 CAP_BPF 또는 CAP_SYS_ADMIN 권한이 없음
해결:
# securityContext에 필요한 권한 추가
securityContext:
privileged: true
# 또는 최소 권한으로:
capabilities:
add:
- BPF
- SYS_ADMIN
- NET_ADMIN
- PERFMON
2. eBPF 메트릭에서 서비스 이름이 "python3"으로 표시
원인: OBI가 프로세스 이름에서 서비스 이름을 추출하는데, Python 서비스는 인터프리터 이름이 노출됨
해결:
# 방법 1: OBI 환경변수로 매핑 설정
env:
- name: OTEL_EBPF_SERVICE_NAME_MAP
value: 'python3.11:/usr/local/bin/gunicorn=order-service'
# 방법 2: Collector의 transform processor에서 매핑
processors:
transform/service_name:
metric_statements:
- context: resource
statements:
- set(attributes["service.name"], "order-service")
where attributes["k8s.deployment.name"] == "order-service"
3. eBPF 트레이스와 SDK 트레이스가 별도로 보임
증상: 같은 요청인데 eBPF가 생성한 트레이스와 SDK가 생성한 트레이스가 서로 다른 trace_id를 가짐
원인: OBI가 HTTP 헤더에서 기존 traceparent를 읽지 못하거나, SDK가 OBI보다 먼저 trace를 시작하여 context가 중복됨
해결:
# OBI에서 기존 context를 존중하도록 설정
env:
- name: OTEL_EBPF_CONTEXT_PROPAGATION
value: 'true'
- name: OTEL_EBPF_CONTEXT_PROPAGATION_MODE
value: 'reuse' # 기존 context가 있으면 재사용, 없으면 새로 생성
4. SLI recording rule이 NaN을 반환
원인: 분모(total requests)가 0인 시간대. 트래픽이 없는 심야 시간 또는 새로 배포된 서비스.
해결:
# NaN 방지: 분모가 0이면 1을 반환 (에러 없음으로 간주)
sli:http_availability:ratio_rate5m = (
sum(rate(http_server_request_duration_seconds_count{
http_response_status_code!~"5.."
}[5m])) by (service_name)
/
(sum(rate(http_server_request_duration_seconds_count[5m])) by (service_name) > 0)
) or vector(1)
5. eBPF overhead가 예상보다 높음
증상: OBI DaemonSet의 CPU 사용량이 500m을 초과
진단 및 해결:
# 1. 어떤 프로브가 CPU를 많이 사용하는지 확인
kubectl exec -n observability obi-xxx -- /obi debug perf-stats
# 2. 불필요한 프로토콜 감지 비활성화
# OTEL_EBPF_PROTOCOLS에서 사용하지 않는 프로토콜 제거
# 예: Redis를 사용하지 않으면 REDIS 제거
# 3. 높은 트래픽 파드 제외
env:
- name: OTEL_EBPF_EXCLUDE_NAMESPACES
value: "kube-system,monitoring"
- name: OTEL_EBPF_EXCLUDE_PODS
value: "load-generator-*" # 부하 테스트 파드 제외
2026년 로드맵 기반 준비 사항
OBI의 2026년 로드맵(opentelemetry.io/blog/2026/obi-goals)에 따르면 다음 기능이 추가될 예정이다.
| 예정 기능 | 현재 상태 | 준비 사항 |
|---|---|---|
| 안정 1.0 릴리스 | Alpha/Beta | 프로덕션 배포 전 staging 테스트 계획 수립 |
| .NET 계측 지원 | 초기 테스트 | .NET 서비스 목록 파악, SDK 대체 가능성 평가 |
| 메시징 시스템 (MQTT, AMQP, NATS) | 개발 중 | 메시지 큐 기반 서비스의 현재 계측 방식 정리 |
| gRPC full context propagation | 개선 중 | gRPC 서비스 간 trace 연결 상태 확인 |
| 클라우드 SDK 계측 (AWS, GCP, Azure) | 계획 | 클라우드 API 호출 관측 필요성 평가 |
퀴즈
Q1. eBPF 자동 계측이 SDK 계측을 완전히 대체할 수 없는 이유는?
정답: ||eBPF는 커널 수준에서 L7 프로토콜(HTTP, gRPC, SQL)을 캡처하지만, 애플리케이션 비즈니스 로직
수준의 커스텀 span이나 비즈니스 메트릭(예: 주문 금액, 추천 점수)은 생성할 수 없다. 핵심 비즈니스
서비스에서는 eBPF와 SDK를 함께 사용하는 하이브리드 전략이 필요하다.||
Q2. OBI DaemonSet에 privileged 권한이 필요한 이유는?
정답: ||eBPF 프로그램을 커널에 로드하고 실행하려면 CAP_BPF와 CAP_SYS_ADMIN 같은 높은 수준의 권한이
필요하다. eBPF 프로브를 커널 함수에 attach하고 네트워크 패킷을 검사하며 다른 프로세스의 시스템
콜을 트레이싱하기 때문이다.||
Q3. eBPF 메트릭에서 자동으로 SLI를 추출할 때의 장점은?
정답: ||코드 변경 없이 클러스터 내 모든 서비스의 가용성과 지연시간 SLI를 일괄 수집할 수 있다. 새
서비스가 배포되면 eBPF가 자동으로 메트릭을 생성하므로, 서비스 카탈로그와 연동하면 SLO 알림까지
자동 프로비저닝이 가능하다.||
Q4. eBPF 트레이스와 SDK 트레이스의 trace_id가 달라지는 상황은 언제인가?
정답: ||OBI가 HTTP 헤더의 traceparent를 읽지 못하거나, SDK가 이미 trace를 시작한 상태에서 OBI가
별도 trace를 생성하는 경우다. OTEL_EBPF_CONTEXT_PROPAGATION_MODE를 "reuse"로 설정하면 기존
context가 있을 때 재사용하여 이 문제를 해결할 수 있다.||
Q5. 서비스 카탈로그 기반 자동 SLO 프로비저닝의 전제 조건은?
정답: ||eBPF 자동 계측이 클러스터 전체에 배포되어 있어야 하고, 서비스 이름이 표준화(k8s deployment
name 기반 매핑 등)되어 있어야 하며, 서비스 카탈로그에 각 서비스의 티어와 SLO 정의가 등록되어
있어야 한다. 이 세 가지가 갖춰지면 Prometheus recording rules과 알림 규칙을 자동 생성할 수 있다.||
Q6. SLI recording rule에서 분모가 0일 때 NaN 대신 1을 반환해야 하는 이유는?
정답: ||트래픽이 없는 시간대에 NaN이 발생하면 burn rate 알림이 제대로 계산되지 않는다. 요청이
없으면 에러도 없으므로 가용성 1(100%)로 간주하는 것이 합리적이다. 단, 장기간 트래픽이 0이면 별도
"서비스 무응답" 알림을 설정해야 한다.||
Q7. 관측성 운영 모델에서 주간 리뷰와 월간 리뷰의 초점 차이는?
정답: ||주간 리뷰는 현재 error budget 상태와 이번 주 릴리스 계획에 집중하는 전술적 미팅이다. 월간
리뷰는 SLO 목표 자체의 적정성, eBPF 커버리지, 비용 추이, 알림 품질을 점검하는 전략적 미팅이다.||
References
Observability: OTel eBPF SLO Operating Model 2026

- The Inflection Point of the 2026 Observability Stack
- What eBPF Auto-Instrumentation Changes
- eBPF-Based SLI Auto-Collection
- Unified Operating Model: OTel + eBPF + SLO
- Automated SLO Governance
- Operations Cycle: Weekly/Monthly Routines
- Troubleshooting
- 2026 Roadmap-Based Preparation
- Quiz
- References
The Inflection Point of the 2026 Observability Stack
In 2025-2026, three technologies are converging in the observability space, fundamentally changing the operating model itself.
OpenTelemetry (OTel) has now become the de facto observability standard as a CNCF graduated project. It unifies Metrics, Logs, and Traces through a single SDK and protocol (OTLP), enabling backend replacement without vendor lock-in.
eBPF-based auto-instrumentation (OBI: OpenTelemetry eBPF Instrumentation) gained momentum in May 2025 when Grafana Labs donated the Beyla project to OTel. It automatically captures HTTP, gRPC, and SQL calls at the kernel level without code changes. In 2026, it targets a stable 1.0 release (opentelemetry.io/blog/2026/obi-goals).
SLO (Service Level Objective) has evolved beyond simple dashboard numbers into an operational framework that controls release decisions and on-call priorities through error budget policies.
This article designs a 2026 operating model that combines these three technologies.
What eBPF Auto-Instrumentation Changes
eBPF Instrumentation vs Traditional SDK Instrumentation Comparison
| Item | Traditional OTel SDK Instrumentation | eBPF Auto-Instrumentation (OBI) |
|---|---|---|
| Code changes | Required (SDK addition, instrumentation code) | Not required (kernel-level auto-capture) |
| Supported languages | Java, Python, Go, .NET, JS, etc. | Language-agnostic (binary level) |
| Overhead | 1-3% CPU | Under 1% CPU (kernel space execution) |
| Capture depth | Detailed down to business logic | L7 protocol level (HTTP, gRPC, SQL) |
| Context propagation | Fully supported | Supported for HTTP/gRPC, limited for some protocols |
| Custom attributes | Freely addable | Limited (only what can be extracted from protocols) |
| Deployment method | Together with application | DaemonSet or sidecar |
| Operational burden | Per-service individual application | Cluster-wide batch application |
eBPF Auto-Instrumentation Deployment (Kubernetes)
# otel-ebpf-instrumentation.yaml
# Deploy OBI (OpenTelemetry eBPF Instrumentation) as DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-ebpf-instrumentation
namespace: observability
spec:
selector:
matchLabels:
app: obi
template:
metadata:
labels:
app: obi
spec:
hostPID: true # Required for eBPF probe attachment
hostNetwork: false
serviceAccountName: obi
containers:
- name: obi
image: ghcr.io/open-telemetry/opentelemetry-ebpf-instrumentation:v0.9.0
securityContext:
privileged: true # Required for loading eBPF programs
runAsUser: 0
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: 'http://otel-collector:4318'
- name: OTEL_SERVICE_NAME
value: 'auto-detected' # Auto-extracted from process name
- name: OTEL_EBPF_TRACK_REQUEST_HEADERS
value: 'true'
# Namespace filter for monitoring targets
- name: OTEL_EBPF_KUBE_NAMESPACE
value: 'production,staging'
# Protocols to capture
- name: OTEL_EBPF_PROTOCOLS
value: 'HTTP,GRPC,SQL,REDIS'
volumeMounts:
- name: sys-kernel
mountPath: /sys/kernel
readOnly: true
- name: bpf-maps
mountPath: /sys/fs/bpf
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 100m
memory: 128Mi
volumes:
- name: sys-kernel
hostPath:
path: /sys/kernel
- name: bpf-maps
hostPath:
path: /sys/fs/bpf
Hybrid Strategy: eBPF and SDK
Not all services can be instrumented with eBPF alone. eBPF automatically provides L7 protocol-level metrics and traces, but cannot provide business logic-level custom spans or metrics.
[Service Instrumentation Strategy Matrix]
Business Logic Observability Need
Low High
+--------------+--------------+
High | eBPF only | eBPF + SDK |
importance | (infra | (core |
| services) | business) |
+--------------+--------------+
Low | eBPF only | SDK only |
importance | (legacy, | (data |
| 3rd party) | pipelines) |
+--------------+--------------+
Specific Application Examples:
# Hybrid strategy: eBPF auto-captures L7 traffic,
# SDK focuses only on business logic
from opentelemetry import trace
tracer = trace.get_tracer("recommendation-engine", "2.1.0")
async def get_recommendations(user_id: str, context: dict):
# What eBPF auto-captures:
# - HTTP request/response metrics (latency, status code)
# - gRPC call metrics
# - SQL query execution time
# - Redis command execution time
# What SDK adds: detailed business logic-level information
with tracer.start_as_current_span(
"generate_recommendations",
attributes={
"user.segment": context.get("segment", "unknown"),
"model.version": "v3.2",
"candidate.count": 1000,
}
) as span:
# Model inference span
with tracer.start_as_current_span("ml_inference") as ml_span:
scores = await ml_model.predict(user_id, context)
ml_span.set_attribute("inference.latency_ms", scores.latency_ms)
ml_span.set_attribute("inference.model_name", "rec-v3.2-prod")
# Filtering span
with tracer.start_as_current_span("business_filter") as filter_span:
filtered = apply_business_rules(scores.items, context)
filter_span.set_attribute("filter.input_count", len(scores.items))
filter_span.set_attribute("filter.output_count", len(filtered))
filter_span.set_attribute("filter.removed_reasons", {
"out_of_stock": 12,
"age_restricted": 3,
"region_blocked": 1,
})
span.set_attribute("result.count", len(filtered))
return filtered
eBPF-Based SLI Auto-Collection
SLIs can be extracted from data that eBPF automatically captures. It becomes possible to batch-collect SLIs for all services without code changes.
Metrics Auto-Generated by OBI
# Key metrics generated by OBI (Prometheus format)
# HTTP server request duration
http_server_request_duration_seconds_bucket{
http_request_method="GET",
http_response_status_code="200",
url_path="/api/v1/orders",
service_name="order-service",
le="0.005"
} 1234
# HTTP server request count
http_server_request_duration_seconds_count{
http_request_method="GET",
http_response_status_code="200",
url_path="/api/v1/orders",
service_name="order-service",
} 5678
# gRPC server request duration
rpc_server_duration_seconds_bucket{
rpc_method="GetUser",
rpc_service="user.UserService",
rpc_grpc_status_code="OK",
service_name="user-service",
le="0.1"
} 9012
# SQL query duration
db_client_operation_duration_seconds_bucket{
db_system="postgresql",
db_operation="SELECT",
service_name="order-service",
le="0.05"
} 3456
Recording Rules for Extracting SLIs from eBPF Metrics
# prometheus_rules/ebpf_sli_rules.yaml
groups:
- name: ebpf_sli_from_obi
interval: 30s
rules:
# Availability SLI: ratio of HTTP responses excluding 5xx
- record: sli:http_availability:ratio_rate5m
expr: |
sum(rate(http_server_request_duration_seconds_count{
http_response_status_code!~"5.."
}[5m])) by (service_name)
/
sum(rate(http_server_request_duration_seconds_count[5m])) by (service_name)
# Latency SLI: ratio of responses within 300ms
- record: sli:http_latency:ratio_rate5m
expr: |
sum(rate(http_server_request_duration_seconds_bucket{
le="0.3"
}[5m])) by (service_name)
/
sum(rate(http_server_request_duration_seconds_count[5m])) by (service_name)
# gRPC Availability SLI
- record: sli:grpc_availability:ratio_rate5m
expr: |
sum(rate(rpc_server_duration_seconds_count{
rpc_grpc_status_code="OK"
}[5m])) by (service_name)
/
sum(rate(rpc_server_duration_seconds_count[5m])) by (service_name)
# DB Query Latency SLI: ratio within 50ms
- record: sli:db_latency:ratio_rate5m
expr: |
sum(rate(db_client_operation_duration_seconds_bucket{
le="0.05"
}[5m])) by (service_name, db_system)
/
sum(rate(db_client_operation_duration_seconds_count[5m])) by (service_name, db_system)
Unified Operating Model: OTel + eBPF + SLO
The operating model that combines these three technologies works through the following flow.
Data Flow
[Application Pods]
|
+-- eBPF (OBI DaemonSet)
| +-- Auto-capture: HTTP/gRPC/SQL metrics + traces
|
+-- OTel SDK (optional)
| +-- Manual instrumentation: business spans + custom metrics
|
+-- Both data sent to OTel Collector
|
v
[OTel Collector Gateway]
+-- Attribute standardization (semantic conventions)
+-- Merge eBPF data and SDK data
+-- Tail-based sampling
+-- Per-backend routing
|
+----+----+
v v
[Metrics DB] [Traces DB]
(Mimir) (Tempo)
| |
v v
[Prometheus Recording Rules]
+-- SLI calculation (from eBPF metrics)
+-- Error budget calculation
+-- Burn rate alerts
|
v
[Error Budget Policy Engine]
+-- Budget >= 50%: Normal releases
+-- Budget 20-50%: Canary required
+-- Budget < 20%: Release freeze
+-- CI/CD pipeline integration
Unified Collector Configuration
# otel-collector-unified.yaml
# Unified Gateway receiving both eBPF and SDK data
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
# Standardize service names generated by eBPF
# OBI extracts service names from process names,
# which may differ from service names set in SDK
transform/service_name:
trace_statements:
- context: resource
statements:
# Map auto-detected OBI names to standard names
- replace_pattern(attributes["service.name"], "^python3?$", "unknown-python-service")
- replace_pattern(attributes["service.name"], "^java$", "unknown-java-service")
metric_statements:
- context: resource
statements:
- replace_pattern(attributes["service.name"], "^python3?$", "unknown-python-service")
# Connect eBPF traces and SDK traces with the same trace_id
# Since OBI reads traceparent from HTTP headers,
# eBPF spans automatically join traces created by SDK
batch:
send_batch_size: 2048
timeout: 10s
tail_sampling:
decision_wait: 15s
num_traces: 200000
policies:
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
- name: slow-requests
type: latency
latency:
threshold_ms: 500
- name: sample-normal
type: probabilistic
probabilistic:
sampling_percentage: 5
exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
prometheusremotewrite:
endpoint: http://mimir:9009/api/v1/push
resource_to_telemetry_conversion:
enabled: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [transform/service_name, tail_sampling, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [transform/service_name, batch]
exporters: [prometheusremotewrite]
Automated SLO Governance
Service Catalog-Based Auto SLO Provisioning
When a new service is deployed, eBPF automatically collects metrics, and alerts are auto-generated according to SLO policies registered in the service catalog.
# slo_provisioner.py
# Read SLO definitions from service catalog and generate Prometheus alert rules
import yaml
from pathlib import Path
def generate_slo_alerts(service_catalog_path: str, output_dir: str):
"""Read SLO definitions from service catalog and auto-generate Prometheus alert rules"""
catalog = yaml.safe_load(open(service_catalog_path))
for service in catalog["services"]:
name = service["name"]
tier = service["tier"]
slo = service["slo"]
# Apply defaults based on tier
availability_target = slo.get("availability", TIER_DEFAULTS[tier]["availability"])
latency_threshold_ms = slo.get("latency_threshold_ms", TIER_DEFAULTS[tier]["latency_ms"])
latency_target = slo.get("latency_target", TIER_DEFAULTS[tier]["latency_target"])
# Generate Prometheus alert rules
alert_rules = generate_burn_rate_alerts(
service_name=name,
availability_target=availability_target,
latency_threshold_ms=latency_threshold_ms,
latency_target=latency_target,
)
output_file = Path(output_dir) / f"slo-{name}.yaml"
yaml.dump(alert_rules, open(output_file, "w"), default_flow_style=False)
print(f"Generated SLO alerts for {name}: {output_file}")
TIER_DEFAULTS = {
"tier1": {"availability": 0.9995, "latency_ms": 200, "latency_target": 0.99},
"tier2": {"availability": 0.999, "latency_ms": 500, "latency_target": 0.95},
"tier3": {"availability": 0.995, "latency_ms": 2000, "latency_target": 0.90},
}
def generate_burn_rate_alerts(
service_name: str,
availability_target: float,
latency_threshold_ms: int,
latency_target: float,
) -> dict:
"""Generate multi-window burn rate alert rules"""
error_budget = 1.0 - availability_target
latency_threshold_sec = latency_threshold_ms / 1000.0
return {
"groups": [{
"name": f"slo-{service_name}",
"rules": [
# Critical: burn rate 14, 1h/5m window
{
"alert": f"SLO_{service_name}_BurnRate_Critical",
"expr": (
f'(\n'
f' 1 - sli:http_availability:ratio_rate1h{{service_name="{service_name}"}}\n'
f') > {14 * error_budget}\n'
f'and\n'
f'(\n'
f' 1 - sli:http_availability:ratio_rate5m{{service_name="{service_name}"}}\n'
f') > {14 * error_budget}'
),
"for": "1m",
"labels": {
"severity": "critical",
"service": service_name,
"burn_rate": "14",
},
"annotations": {
"summary": f"{service_name}: SLO critical burn rate (14x)",
"runbook": f"https://wiki.internal/runbook/slo/{service_name}",
},
},
# Warning: burn rate 6, 6h/30m window
{
"alert": f"SLO_{service_name}_BurnRate_Warning",
"expr": (
f'(\n'
f' 1 - sli:http_availability:ratio_rate6h{{service_name="{service_name}"}}\n'
f') > {6 * error_budget}\n'
f'and\n'
f'(\n'
f' 1 - sli:http_availability:ratio_rate30m{{service_name="{service_name}"}}\n'
f') > {6 * error_budget}'
),
"for": "5m",
"labels": {
"severity": "warning",
"service": service_name,
"burn_rate": "6",
},
},
],
}],
}
Service Catalog Example
# service_catalog.yaml
services:
- name: payment-api
tier: tier1
team: payment
slo:
availability: 0.9995
latency_threshold_ms: 200
latency_target: 0.99
- name: recommendation-engine
tier: tier2
team: ml-platform
slo:
availability: 0.999
latency_threshold_ms: 500
- name: notification-service
tier: tier3
team: platform
# Uses tier3 defaults
- name: internal-admin
tier: tier3
team: platform
slo:
availability: 0.99
latency_threshold_ms: 3000
Operations Cycle: Weekly/Monthly Routines
Weekly SLO Review (30 minutes)
Attendees: SRE Lead, Service Owners, Product Manager
1. Error Budget Status Check (5 min)
- Review overall service budget remaining dashboard
- Identify Yellow/Red status services
2. Last Week's Incident SLO Impact (10 min)
- Budget percentage consumed by each incident
- Identify recurring patterns
3. Release Plan Review (10 min)
- Risk assessment for this week's planned releases
- Determine release strategy based on budget status
(canary ratio, rollback criteria, etc.)
4. Action Items (5 min)
- Previous week's action items completion status
- Assign new action items
Monthly SLO Tuning Review (1 hour)
Attendees: Engineering VP, SRE Team, Service Owners
1. SLO Target Appropriateness Review
- Are SLO targets appropriate compared to actual SLI trends over the last 3 months?
- Too generous SLO: potential for unnecessary resource waste
- Too tight SLO: innovation speed decline
2. eBPF Instrumentation Coverage Check
- Are new services being auto-instrumented?
- Need for OBI version updates
- Need for new protocol support (MQTT, AMQP, etc.)
3. Cost Review
- Observability data storage cost trends
- Need for sampling ratio adjustments
- Retention period policy review
4. Alert Quality Review
- Last month's alert firing count
- False positive rate
- False negative cases
Troubleshooting
1. eBPF Program Load Failure
Error: failed to load BPF program: operation not permitted
Cause: Container lacks CAP_BPF or CAP_SYS_ADMIN capabilities
Solution:
# Add required capabilities to securityContext
securityContext:
privileged: true
# Or with minimum privileges:
capabilities:
add:
- BPF
- SYS_ADMIN
- NET_ADMIN
- PERFMON
2. Service Name Shows as "python3" in eBPF Metrics
Cause: OBI extracts service names from process names, but Python services expose the interpreter name
Solution:
# Method 1: Configure mapping via OBI environment variables
env:
- name: OTEL_EBPF_SERVICE_NAME_MAP
value: 'python3.11:/usr/local/bin/gunicorn=order-service'
# Method 2: Map in Collector's transform processor
processors:
transform/service_name:
metric_statements:
- context: resource
statements:
- set(attributes["service.name"], "order-service")
where attributes["k8s.deployment.name"] == "order-service"
3. eBPF Traces and SDK Traces Appear Separately
Symptom: Same request but eBPF-generated traces and SDK-generated traces have different trace_ids
Cause: OBI fails to read existing traceparent from HTTP headers, or SDK starts a trace before OBI causing duplicate contexts
Solution:
# Configure OBI to respect existing context
env:
- name: OTEL_EBPF_CONTEXT_PROPAGATION
value: 'true'
- name: OTEL_EBPF_CONTEXT_PROPAGATION_MODE
value: 'reuse' # Reuse existing context if present, create new if not
4. SLI Recording Rule Returns NaN
Cause: Denominator (total requests) is 0 during that time window. Late night hours with no traffic or newly deployed services.
Solution:
# Prevent NaN: return 1 if denominator is 0 (treated as no errors)
sli:http_availability:ratio_rate5m = (
sum(rate(http_server_request_duration_seconds_count{
http_response_status_code!~"5.."
}[5m])) by (service_name)
/
(sum(rate(http_server_request_duration_seconds_count[5m])) by (service_name) > 0)
) or vector(1)
5. eBPF Overhead Higher Than Expected
Symptom: OBI DaemonSet CPU usage exceeds 500m
Diagnosis and Solution:
# 1. Check which probes are consuming the most CPU
kubectl exec -n observability obi-xxx -- /obi debug perf-stats
# 2. Disable unnecessary protocol detection
# Remove unused protocols from OTEL_EBPF_PROTOCOLS
# Example: remove REDIS if not using Redis
# 3. Exclude high-traffic pods
env:
- name: OTEL_EBPF_EXCLUDE_NAMESPACES
value: "kube-system,monitoring"
- name: OTEL_EBPF_EXCLUDE_PODS
value: "load-generator-*" # Exclude load test pods
2026 Roadmap-Based Preparation
According to OBI's 2026 roadmap (opentelemetry.io/blog/2026/obi-goals), the following features are planned.
| Planned Feature | Current Status | Preparation Items |
|---|---|---|
| Stable 1.0 release | Alpha/Beta | Establish staging test plan before production |
| .NET instrumentation support | Early testing | Identify .NET service inventory, evaluate SDK replacement potential |
| Messaging systems (MQTT, AMQP, NATS) | In development | Document current instrumentation methods for message queue-based services |
| gRPC full context propagation | Improving | Verify trace connectivity status between gRPC services |
| Cloud SDK instrumentation (AWS, GCP, Azure) | Planned | Evaluate cloud API call observability requirements |
Quiz
Q1. Why can't eBPF auto-instrumentation completely replace SDK instrumentation?
Answer: eBPF captures L7 protocols (HTTP, gRPC, SQL) at the kernel level, but cannot generate
application business logic-level custom spans or business metrics (e.g., order amounts,
recommendation scores). A hybrid strategy using both eBPF and SDK is necessary for core business
services.
Q2. Why does the OBI DaemonSet require privileged permissions?
Answer: Loading and executing eBPF programs in the kernel requires high-level capabilities
like CAP_BPF and CAP_SYS_ADMIN. This is because eBPF probes need to attach to kernel functions,
inspect network packets, and trace system calls of other processes.
Q3. What are the advantages of automatically extracting SLIs from eBPF metrics?
Answer: Availability and latency SLIs for all services in the cluster can be batch-collected
without code changes. When a new service is deployed, eBPF automatically generates metrics, so
when integrated with a service catalog, SLO alerts can be auto-provisioned as well.
Q4. When do eBPF traces and SDK traces end up with different trace_ids?
Answer: When OBI fails to read the traceparent in the HTTP header, or when SDK has already
started a trace and OBI creates a separate trace. Setting OTEL_EBPF_CONTEXT_PROPAGATION_MODE to
"reuse" solves this problem by reusing existing context when available.
Q5. What are the prerequisites for service catalog-based auto SLO provisioning?
Answer: eBPF auto-instrumentation must be deployed cluster-wide, service names must be
standardized (e.g., k8s deployment name-based mapping), and each service's tier and SLO definition
must be registered in the service catalog. When these three conditions are met, Prometheus
recording rules and alert rules can be auto-generated.
Q6. Why should the SLI recording rule return 1 instead of NaN when the denominator is 0?
Answer: When NaN occurs during periods with no traffic, burn rate alerts cannot be calculated properly. Since there are no errors when there are no requests, treating availability as 1 (100%) is reasonable. However, if traffic is 0 for an extended period, a separate "service unresponsive" alert should be configured.
Q7. What is the difference in focus between weekly and monthly reviews in the observability
operating model?
Answer: The weekly review is a tactical meeting focused on current error budget status and this week's release plans. The monthly review is a strategic meeting that examines the appropriateness of SLO targets themselves, eBPF coverage, cost trends, and alert quality.