Split View: Observability 완전 가이드 — Metric·Log·Trace·OpenTelemetry·eBPF·SLO (Season 2 Ep 9, 2025)

Observability 완전 가이드 — Metric·Log·Trace·OpenTelemetry·eBPF·SLO (Season 2 Ep 9, 2025)

들어가며 — Observability vs Monitoring

Monitoring: 알려진 문제 감지. "CPU > 80%면 알림." Observability: 모르는 문제 탐구. "왜 갑자기 P99가 3배로?"

9개의 블랙박스를 뜯는 도구

2025년 엔지니어의 운영 필수 역량:

분산 트레이싱으로 요청 흐름 추적
프로파일로 CPU·메모리 병목 발견
eBPF로 커널 레벨 진단
SLO/Error Budget으로 릴리스 속도 결정
Log·Metric 비용을 통제하면서 품질 유지

이 글은 Observability의 기본·실전·비용·조직 네 레이어를 한 번에 다룬다.

1부 — 3축 + 프로파일 = 4축 관측성

1.1 Metric

숫자 시계열 — 집계·장기 보관·알림 용이.

예: http_requests_total{status="200",method="GET"}
도구: Prometheus, VictoriaMetrics, Mimir, Cortex
장점: 저장 효율, 대시보드 빠름
단점: 카디널리티 폭증 위험 (labels 많으면 비용↑↑)

1.2 Log

이벤트 스트림 — 상세하지만 크다.

구조화 로깅이 필수 (JSON)
도구: Loki, Elasticsearch, OpenSearch, Quickwit, ClickHouse
장점: 상세·유연
단점: 비용 비쌈, 검색 느림

1.3 Trace

요청의 인과 사슬 — 분산 시스템 디버깅 핵심.

개념: Span (작업 단위) + Context Propagation
도구: Jaeger, Tempo, Zipkin, Honeycomb
장점: 어느 서비스가 병목인지 한눈에
단점: 샘플링 전략이 핵심 (100% 저장 비현실적)

1.4 Profile (4번째 축, 2023~)

프로세스 내부 CPU·메모리 프로파일 — 지속 저장.

도구: Pyroscope (Grafana), Parca, Polar Signals
장점: "어느 함수가 CPU 먹는가?" 실시간
단점: 오버헤드·스토리지 관리

1.5 4축 통합 시나리오

알림: P99 레이턴시 스파이크
  ↓
[Metric] 어느 서비스? → checkout-service
  ↓
[Trace] 어느 span? → payment-gateway 호출 3초
  ↓
[Log] 해당 trace_id의 로그 → timeout error
  ↓
[Profile] 해당 시간대 CPU → TLS handshake에 90%
  ↓
원인: 인증서 만료 임박으로 TLS handshake 증가

Correlation 이 핵심. trace_id로 4축을 꿰어야.

2부 — OpenTelemetry: 관측의 표준

2.1 OTEL이 해결한 것

이전: 벤더마다 다른 SDK (Prometheus, Datadog, New Relic...). 벤더 변경 = 코드 재작성.

OTEL: 한 번 계측, 어디든 전송.

2.2 OTEL 구성 요소

Application
  ↓ (OTEL SDK)
OTLP Protocol
  ↓
[OTEL Collector]
  ├─ Receivers (OTLP, Prometheus, Jaeger...)
  ├─ Processors (batch, filter, sample, enrich)
  └─ Exporters (Tempo, Loki, Prometheus, Datadog, ...)
  ↓
Backend (where you want)

2.3 Signal (신호) 3가지 + 프로파일

Traces: 안정 (Stable)
Metrics: 안정
Logs: 안정 (2024)
Profiles: 베타 (2024~2025)

2.4 Context Propagation

요청이 A 서비스 → B 서비스 → C 서비스로 흐를 때 같은 trace_id 전파 필수:

HTTP Headers:
  traceparent: 00-{trace_id}-{span_id}-01
  tracestate: ...

W3C Trace Context 표준. OTEL 자동 처리.

2.5 Auto-Instrumentation

Java: -javaagent:opentelemetry-javaagent.jar (바이트코드 조작)
Python: opentelemetry-instrument python app.py
Node.js: NODE_OPTIONS="--require @opentelemetry/auto-instrumentations-node/register"
Go: SDK 명시적 계측 (reflection 없음)
Rust: tracing crate

2.6 2024~2025 OTEL 진화

Profile Signal 정식화
Gen AI Semantic Conventions: LLM 호출도 표준
Exponential Histograms: 고해상도 메트릭
OTEL Collector Contrib: 수백 개 receiver/exporter

3부 — Prometheus & Grafana Stack

3.1 Grafana Stack (2025 표준 OSS)

컴포넌트	역할
Prometheus / Mimir	Metric
Loki	Log
Tempo	Trace
Pyroscope	Profile
Grafana	대시보드
Alertmanager	알림
Beyla	eBPF 기반 auto-instrumentation

3.2 Prometheus 4가지 메트릭 타입

Counter: 단조 증가 (요청 수)
Gauge: 현재값 (메모리 사용)
Histogram: 버킷화 (레이턴시 분포)
Summary: 분위수 직접 계산

3.3 PromQL 기본

# 분당 요청 수
rate(http_requests_total[1m])

# 서비스별 P99
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

# 에러율
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))

3.4 카디널리티 관리

문제: {user_id, path, status, method} → 100만 사용자 × 10 path × 5 status × 4 method = 2억 시계열. 프로메테우스 터짐.

해결:

높은 카디널리티 label 금지 (user_id, trace_id 등)
Aggregation 먼저, 저장 나중
VictoriaMetrics·Mimir로 확장 (Prometheus보다 효율)

4부 — 로그 관리: 비용과 품질의 균형

4.1 로그 비용이 폭주하는 이유

디버그 로그를 프로덕션에서 그대로
JSON 아닌 텍스트 → 검색·집계 어려움
전문 검색 엔진(Elastic) 사용 비용
로그 보존 기간 과도

4.2 로그 레벨 전략

레벨	용도	보존
ERROR	알림 필요	30~90일
WARN	주의	30일
INFO	주요 이벤트	7~30일
DEBUG	개발만	미저장 또는 1일
TRACE	특수 디버깅	미저장

4.3 구조화 로깅 (필수)

# 나쁨
logger.info(f"User {user_id} logged in from {ip}")

# 좋음
logger.info("user_login", extra={
    "user_id": user_id,
    "ip": ip,
    "trace_id": trace_id,
})

검색·필터·집계가 모두 가능.

4.4 2025 로그 솔루션 비교

도구	모델	특징
Loki	저비용, 인덱스 최소	Grafana Stack 표준
Elasticsearch	전문 검색	비용 비쌈
OpenSearch	Elastic fork	AWS 친화
Quickwit	Rust, 로그 특화	새 선택지
ClickHouse	컬럼 DB	Analytics 강함
Vector.dev (Datadog)	수집 파이프라인	라우팅·필터

5부 — Distributed Tracing 실전

5.1 샘플링 전략 4가지

Head-based (probabilistic): 요청 시작 시 N% 결정. 간단. Root-leaf 일관.
Rate-limited: 초당 X개만
Tail-based: 전체 받고 에러·느림만 저장. 비용·정보 균형
Dynamic: 문제 감지 시 자동 증가

5.2 Tail-based Sampling (2024~2025 대세)

# OTEL Collector config
processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow
        type: latency
        latency: {threshold_ms: 1000}
      - name: random
        type: probabilistic
        probabilistic: {sampling_percentage: 1}

에러·느림 100%, 일반 1%.

5.3 Span 이름·속성 규칙

Span 이름: {service}.{operation} (예: db.query, http.get)
Attribute: W3C Semantic Conventions 따르기
이벤트: 주요 단계 (cache.hit, retry, backoff)
Error: span.setStatus(ERROR) + exception 이벤트

5.4 분산 디버깅 예시

Gateway (20ms)
├─ auth-service (5ms)
├─ user-service (50ms)
│  └─ postgres.query (45ms)  ← 여기 느림
└─ product-service (10ms)

45ms 쿼리 span 속성: db.statement = "SELECT * FROM users WHERE ..." → 인덱스 확인.

6부 — eBPF: 코드 수정 없는 관측

6.1 eBPF란

리눅스 커널에 안전한 바이트코드 삽입 → 네트워크·시스템 콜·함수 호출 추적.

장점:

애플리케이션 코드 수정 없이 관측
오버헤드 극저 (커널 내 실행)
네트워크·시스템 수준

6.2 2025 eBPF 관측 도구

Pixie (New Relic): Kubernetes 전용, auto-telemetry
Cilium Hubble: 네트워크 흐름
Parca: 항상 켜진 CPU 프로파일
Grafana Beyla: auto-instrumentation
Inspektor Gadget: Kubernetes용 툴킷
Odigos: eBPF 기반 auto-instrumentation SaaS

6.3 eBPF의 진짜 가치

"언어 불문 auto-instrument" — Java·Go·Python·Rust·Node 어떤 언어든 HTTP/gRPC/DB 호출을 자동 트레이스.

단점: 커널 ≥ 5.x 필요, Windows 미지원 (진행 중), 아직 디버깅 난이도 높음.

7부 — SLO·SLI·Error Budget

7.1 정의 (Google SRE 책)

SLI (Service Level Indicator): 측정값 — "요청 99%가 200ms 내 응답"
SLO (Objective): 목표 — "99.9% SLI 달성"
SLA (Agreement): 고객과 약속 — "99.5% 아래면 환불"

7.2 Error Budget

100% - SLO = Error Budget.

SLO 99.9% → Budget 0.1% = 월 ~43분 다운타임 허용
Budget 남음 → 릴리스 가능
Budget 소진 → 릴리스 중단, 안정화 집중

7.3 Good SLI 선택 기준

사용자 중심: 내부 지표 아닌 사용자가 느끼는 것
간단함: 계산·이해 쉽게
예측 가능: 정상 운영에서 100% 달성
조작 불가능: 재정의로 올릴 수 없어야

예시:

나쁨: "서버 CPU ≤ 70%"
좋음: "사용자 요청 99%가 500ms 내 200 응답"

7.4 Burn Rate

Error Budget 소진 속도.

Burn Rate = (현재 에러율) / (허용 에러율)

14x burn for 1 hour = 5% budget 소진
6x burn for 6 hours = 5% budget 소진

Multi-window alert: 짧은 기간 고강도 + 긴 기간 중강도 조합 → false positive 감소.

7.5 Error Budget Policy

Budget 100~50% 남음: 일반 릴리스 속도
Budget 50~10%: 릴리스 테스트 강화, feature freeze 준비
Budget 10% 이하: Feature freeze, 안정화 sprint
Budget 소진: 포스트모템, 프로세스 재검토

8부 — 2025 관측 플랫폼 선택

8.1 상용 vs OSS

상용	OSS Stack
Datadog, New Relic, Dynatrace, Splunk, Honeycomb	Grafana Stack, Signoz, Kibana
빠른 ROI, 지원, 고급 AI	유연, 저비용, 락인 없음
비용 폭주 위험	운영 인력 필요

8.2 어떤 상황에 무엇

상황	추천
스타트업 < 20명	Datadog 또는 Signoz SaaS
성장 스타트업	Grafana Cloud 또는 Honeycomb
중견 기업	Grafana Stack self-host + Prometheus
대기업	혼합 (Datadog + OSS + 자체 구축)
비용 극한 통제	ClickHouse + Grafana

8.3 Datadog 비용 함정

Custom Metrics: 시계열 하나당 월 ~$5. 카디널리티 폭주 시 폭탄
Log Ingestion: GB당 비쌈. 필터 필수
APM Host: 호스트당 월 $31. 마이크로서비스 많으면 폭증
해결: OTEL Collector에서 pre-filter, sample, aggregate

8.4 Honeycomb의 독특한 가치

High-cardinality 검색 우선: trace_id, user_id로 자유 검색
BubbleUp: 이상 그룹 자동 발견
"metric 먼저"가 아닌 "원본 event 먼저" 철학

9부 — 조직적 Observability

9.1 Observability-driven Development

개발 단계부터 계측 생각
PR에 "관측성 영향" 체크박스
새 기능엔 SLO 초안 필수
Post-incident에서 "왜 관측 실패했나?" 분석

9.2 Incident Management와의 접점

MTTR 감소 = 관측성 성숙도의 대리 지표
Runbook에 "이 알림 → 이 대시보드 → 이 trace 확인" 체크리스트
Postmortem Blameless: 시스템이 왜 감지 못했나?

9.3 관측성 코스트 모델

총 관측 비용 = Σ (signal 크기 × 저장기간 × 비용/GB)

절감 레버:
1. Sample (trace 1%, log 10%)
2. Aggregate (metric은 raw 아닌 rollup)
3. Retention 차등 (ERROR 90d, INFO 7d)
4. Pre-filter (OTEL Collector)
5. Cold/hot tier (S3로 arhive)

목표: 관측 비용 = 인프라 비용의 5~15% 수준 유지.

10부 — LLM 관측성 (2024~2025 새 영역)

10.1 추가로 측정할 것

Token Usage: prompt/completion 토큰
Latency: TTFT, total, per-token
Cost: USD 단위 (모델×토큰)
Quality: LLM-as-Judge score, user feedback
Tool Calls: 성공/실패, 체인 깊이
Safety: moderation 결과, refuse rate

10.2 도구

Langfuse: 오픈소스, 자가 호스팅 가능
LangSmith: LangChain 팀
Helicone: 프록시 기반
Phoenix (Arize): 오픈소스 강력
OTEL Gen AI Semantic Conventions: 표준 통합

10.3 LLM Trace 구조

user_query (root span)
├─ retrieval (vector search)
│   ├─ embedding (token count)
│   └─ qdrant.search (query vector)
├─ llm.openai.chat (tokens, cost)
│   └─ tool_call: get_weather
│       └─ api.weather
└─ response_generation

11부 — Observability 로드맵 6개월

Month 1: 기초

Prometheus + Grafana 설치
4가지 메트릭 타입 이해
PromQL 기본

Month 2: Logging

구조화 로깅 표준
Loki 또는 Elastic 운영
Log level 정책

Month 3: Tracing

OTEL SDK 통합
Tempo 또는 Jaeger
Tail-based sampling

Month 4: SLO

주요 서비스 SLI 정의
Error Budget 계산
Burn Rate 알림

Month 5: eBPF + Profile

Pyroscope 도입
Cilium Hubble (K8s 네트워크)
Beyla auto-instrumentation

Month 6: 비용 + 조직

관측 비용 감사
LLM 관측성 (Langfuse)
Post-incident review 프로세스

12부 — Observability 체크리스트 12

3축 + Profile의 각 강점을 안다
OpenTelemetry Collector 구조를 설명할 수 있다
Prometheus 4가지 메트릭 타입을 안다
카디널리티 폭증의 의미와 예방을 안다
Head vs Tail-based sampling 차이를 안다
W3C Trace Context로 서비스 간 연결 원리를 안다
eBPF가 왜 auto-instrumentation에 유리한지 안다
SLI·SLO·SLA 차이를 명확히 안다
Error Budget과 Burn Rate를 계산할 수 있다
로그 비용 절감 5가지를 안다
Datadog 비용 함정 3가지를 안다
LLM 관측성의 새 지표 5가지를 안다

13부 — Observability 안티패턴 10

비구조화 로그: printf식. 검색·집계 불가
카디널리티 무시: label에 user_id 넣어 폭발
100% trace 저장: 비용·저장 파탄. Tail sampling 필수
log-to-metric 난발: 비싸다. Counter·Gauge로 메트릭화
알림 피로: 매일 수백 개. 정말 중요한 것만
SLO 없이 운영: 어느 정도가 "문제"인지 합의 없음
Dashboard 정글: 50개 대시보드 중 10개만 사용
trace_id 연결 안 함: 로그에 trace_id 없으면 디버깅 지옥
Retention 통일: ERROR와 DEBUG 같은 기간 저장 = 낭비
관측성 후순위: 출시 후 "넣겠다" → 영원히 미뤄짐

마치며 — Observability는 "시스템의 자아성"이다

인간이 고통을 못 느끼면 몸을 다스릴 수 없듯, 시스템도 자기 상태를 인지하지 못하면 운영할 수 없다.

2025년 Observability의 본질은:

표준 (OpenTelemetry로 벤더 독립)
통합 (4축을 trace_id로 꿰기)
경제성 (비용 통제하며 품질 유지)
조직성 (SLO는 팀 합의)

도구는 매년 바뀐다. Grafana Stack이 Prometheus → Mimir → Grafana Cloud로, Elastic이 OpenSearch로, Datadog이 eBPF로. 하지만 원리는 그대로다.

**"관측할 수 없으면 운영할 수 없다"**를 기억하라.

다음 글 예고 — "보안 완전 가이드: Zero Trust·Secret 관리·OAuth·OIDC·Supply Chain·AI 보안"

Season 2 Ep 10은 "운영의 필수" 보안 엔지니어링. 다음 글은:

Zero Trust 아키텍처 실전
OAuth 2.1 / OIDC / PKCE
Secret 관리 (Vault, AWS KMS, SOPS)
SBOM과 Supply Chain 공격 방어
Container·K8s 보안 (Pod Security, Admission)
OWASP Top 10 for LLM
보안 관점의 관측성

암호는 쉽지만 보안은 어렵다, 다음 글에서 이어진다.

Observability Complete Guide — Metric, Log, Trace, OpenTelemetry, eBPF, SLO (Season 2 Ep 9, 2025)

Intro — Observability vs Monitoring

Monitoring: detecting known problems. "Alert if CPU > 80%." Observability: investigating unknown problems. "Why did P99 suddenly triple?"

The nine black boxes

In 2025, every operator needs to:

Trace request flow across distributed systems
Find CPU and memory bottlenecks via profiles
Diagnose at the kernel level with eBPF
Pace releases with SLO and Error Budget
Control Log and Metric cost without losing signal

This post covers four layers: fundamentals, practice, cost, organization.

Part 1 — Three Pillars + Profile = Four Pillars

1.1 Metric

Numeric time series — easy to aggregate, retain long term, and alert on.

Example: http_requests_total{status="200",method="GET"}
Tools: Prometheus, VictoriaMetrics, Mimir, Cortex
Pros: storage efficient, fast dashboards
Cons: cardinality explosion risk (many labels = huge cost)

1.2 Log

Event stream — detailed but large.

Structured logging (JSON) is mandatory
Tools: Loki, Elasticsearch, OpenSearch, Quickwit, ClickHouse
Pros: detailed, flexible
Cons: expensive, slow search

1.3 Trace

Causal chain of a request — core for distributed debugging.

Concepts: Span (unit of work) + Context Propagation
Tools: Jaeger, Tempo, Zipkin, Honeycomb
Pros: see which service is the bottleneck
Cons: sampling strategy is critical (100% storage is unrealistic)

1.4 Profile (fourth pillar, 2023+)

Continuous in-process CPU and memory profiles.

Tools: Pyroscope (Grafana), Parca, Polar Signals
Pros: answer "which function eats CPU?" in real time
Cons: overhead and storage management

1.5 Four-pillar correlation scenario

Alert: P99 latency spike
  ↓
[Metric] which service? → checkout-service
  ↓
[Trace] which span? → payment-gateway call, 3s
  ↓
[Log] logs for that trace_id → timeout error
  ↓
[Profile] CPU during that window → 90% in TLS handshake
  ↓
Root cause: expiring cert causes handshake surge

Correlation is the point. All four pillars must be stitched by trace_id.

Part 2 — OpenTelemetry: the observation standard

2.1 What OTEL solved

Before: every vendor shipped its own SDK (Prometheus, Datadog, New Relic...). Switching vendor meant rewriting instrumentation.

OTEL: instrument once, export anywhere.

2.2 OTEL components

Application
  ↓ (OTEL SDK)
OTLP Protocol
  ↓
[OTEL Collector]
  ├─ Receivers (OTLP, Prometheus, Jaeger...)
  ├─ Processors (batch, filter, sample, enrich)
  └─ Exporters (Tempo, Loki, Prometheus, Datadog, ...)
  ↓
Backend (where you want)

2.3 Signals — three, plus profile

Traces: Stable
Metrics: Stable
Logs: Stable (2024)
Profiles: Beta (2024–2025)

2.4 Context Propagation

When a request flows A → B → C, the same trace_id must propagate:

HTTP Headers:
  traceparent: 00-TRACEID-SPANID-01
  tracestate: ...

W3C Trace Context standard. OTEL handles it automatically.

2.5 Auto-Instrumentation

Java: -javaagent:opentelemetry-javaagent.jar (bytecode manipulation)
Python: opentelemetry-instrument python app.py
Node.js: NODE_OPTIONS="--require @opentelemetry/auto-instrumentations-node/register"
Go: explicit SDK instrumentation (no reflection)
Rust: tracing crate

2.6 OTEL evolution (2024–2025)

Profile Signal promoted
Gen AI Semantic Conventions: LLM calls standardized
Exponential Histograms: high-resolution metrics
OTEL Collector Contrib: hundreds of receivers and exporters

Part 3 — Prometheus and Grafana Stack

3.1 Grafana Stack (2025 OSS default)

Component	Role
Prometheus / Mimir	Metric
Loki	Log
Tempo	Trace
Pyroscope	Profile
Grafana	Dashboards
Alertmanager	Alerts
Beyla	eBPF auto-instrumentation

3.2 Prometheus metric types

Counter: monotonic (request count)
Gauge: current value (memory use)
Histogram: bucketed (latency distribution)
Summary: pre-computed quantiles

3.3 PromQL basics

# requests per second
rate(http_requests_total[1m])

# P99 per service
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

# error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))

3.4 Cardinality management

Problem: {user_id, path, status, method} with 1M users × 10 paths × 5 statuses × 4 methods = 200M series. Prometheus collapses.

Fix:

Ban high-cardinality labels (user_id, trace_id)
Aggregate first, store later
Scale with VictoriaMetrics or Mimir (more efficient than vanilla Prometheus)

Part 4 — Log management: cost vs quality

4.1 Why log cost explodes

Debug logs left on in production
Plain text instead of JSON
Full-text engines (Elastic) are expensive
Over-long retention

4.2 Log level strategy

Level	Use	Retention
ERROR	alert-worthy	30–90 days
WARN	caution	30 days
INFO	key events	7–30 days
DEBUG	dev only	drop or 1 day
TRACE	deep debug	drop

4.3 Structured logging (mandatory)

# bad
logger.info(f"User {user_id} logged in from {ip}")

# good
logger.info("user_login", extra={
    "user_id": user_id,
    "ip": ip,
    "trace_id": trace_id,
})

Search, filter, and aggregate all become possible.

4.4 Log solutions (2025)

Tool	Model	Notes
Loki	low-cost, minimal index	Grafana Stack default
Elasticsearch	full-text	expensive
OpenSearch	Elastic fork	AWS-friendly
Quickwit	Rust, log-specialized	new contender
ClickHouse	columnar DB	strong on analytics
Vector.dev (Datadog)	ingest pipeline	routing and filtering

Part 5 — Distributed Tracing in practice

5.1 Four sampling strategies

Head-based (probabilistic): decide at request start with N%. Simple. Root-leaf consistent.
Rate-limited: X per second
Tail-based: accept all, store only errors and slow. Best cost/signal balance
Dynamic: auto-increase on anomaly

5.2 Tail-based Sampling (the 2024–2025 default)

# OTEL Collector config
processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow
        type: latency
        latency: {threshold_ms: 1000}
      - name: random
        type: probabilistic
        probabilistic: {sampling_percentage: 1}

100% of errors and slow, 1% of the rest.

5.3 Span naming and attributes

Span name: service.operation (e.g. db.query, http.get)
Attributes: follow W3C Semantic Conventions
Events: key stages (cache.hit, retry, backoff)
Errors: span.setStatus(ERROR) + exception event

5.4 Debugging example

Gateway (20ms)
├─ auth-service (5ms)
├─ user-service (50ms)
│  └─ postgres.query (45ms)  ← slow here
└─ product-service (10ms)

The 45ms query span attribute db.statement = "SELECT * FROM users WHERE ..." points to a missing index.

Part 6 — eBPF: observation without code changes

6.1 What is eBPF

Safe bytecode injected into the Linux kernel to trace network, syscalls, and function calls.

Pros:

Observe without touching app code
Minimal overhead (kernel-resident)
Network and system level

6.2 eBPF observability tools (2025)

Pixie (New Relic): Kubernetes-native, auto-telemetry
Cilium Hubble: network flow
Parca: always-on CPU profiling
Grafana Beyla: auto-instrumentation
Inspektor Gadget: Kubernetes toolkit
Odigos: eBPF-based auto-instrumentation SaaS

6.3 eBPF's real value

"Language-agnostic auto-instrument" — Java, Go, Python, Rust, or Node, any HTTP/gRPC/DB call gets traced automatically.

Cons: kernel >= 5.x required, Windows still in progress, debugging is still hard.

Part 7 — SLO, SLI, Error Budget

7.1 Definitions (Google SRE)

SLI (Indicator): measurement — "99% of requests respond in 200ms"
SLO (Objective): target — "hit the SLI 99.9% of the time"
SLA (Agreement): customer promise — "refund below 99.5%"

7.2 Error Budget

100% - SLO = Error Budget.

SLO 99.9% → budget 0.1% = ~43 min downtime per month
Budget left → ship
Budget burned → freeze, stabilize

7.3 Good SLI criteria

User-centric: what users feel, not internal metrics
Simple: easy to compute and understand
Predictable: 100% in normal operation
Tamper-proof: cannot be raised by redefinition

Examples:

Bad: "server CPU <= 70%"
Good: "99% of user requests return 200 in 500ms"

7.4 Burn Rate

How fast the Error Budget is burning.

Burn Rate = current error rate / allowed error rate

14x burn for 1 hour = 5% of budget
6x burn for 6 hours = 5% of budget

Multi-window alerts: short-window high burn plus long-window moderate burn — cuts false positives.

7.5 Error Budget Policy

100–50% left: normal release cadence
50–10%: more release testing, prepare feature freeze
< 10%: feature freeze, stability sprint
Burned: postmortem, process review

Part 8 — Choosing an observability platform (2025)

8.1 Commercial vs OSS

Commercial	OSS Stack
Datadog, New Relic, Dynatrace, Splunk, Honeycomb	Grafana Stack, Signoz, Kibana
Fast ROI, support, advanced AI	Flexible, cheap, no lock-in
Cost can explode	Needs operators

8.2 Situational recommendations

Situation	Recommendation
Startup under 20 people	Datadog or Signoz SaaS
Growth-stage	Grafana Cloud or Honeycomb
Mid-size	Grafana Stack self-host + Prometheus
Enterprise	Hybrid (Datadog + OSS + in-house)
Cost-obsessed	ClickHouse + Grafana

8.3 Datadog cost traps

Custom Metrics: ~$5/month per series. Cardinality blow-up = bomb
Log Ingestion: expensive per GB. Filter aggressively
APM Host: $31/host/month. Many microservices = explosion
Fix: pre-filter, sample, and aggregate at the OTEL Collector

8.4 Honeycomb's differentiation

High-cardinality first search: free-form by trace_id, user_id
BubbleUp: automatic anomaly grouping
Philosophy: "raw events first", not "metrics first"

Part 9 — Organizational observability

9.1 Observability-driven development

Think instrumentation from day one
"Observability impact" checkbox on every PR
New features ship with a draft SLO
Post-incident reviews ask "why didn't we observe this?"

9.2 Incident management link

MTTR down = observability maturity proxy
Runbooks: "this alert → this dashboard → this trace" checklist
Blameless postmortems: why didn't the system detect it?

9.3 Observability cost model

total cost = sum(signal size × retention × cost/GB)

levers:
1. Sample (trace 1%, log 10%)
2. Aggregate (rollups, not raw metrics)
3. Tiered retention (ERROR 90d, INFO 7d)
4. Pre-filter (OTEL Collector)
5. Cold/hot tier (archive to S3)

Target: observability cost stays at 5–15% of infra cost.

Part 10 — LLM observability (new in 2024–2025)

10.1 What to measure

Token Usage: prompt and completion tokens
Latency: TTFT, total, per-token
Cost: USD (model × tokens)
Quality: LLM-as-Judge score, user feedback
Tool Calls: success, failure, chain depth
Safety: moderation result, refuse rate

10.2 Tools

Langfuse: OSS, self-hostable
LangSmith: LangChain team
Helicone: proxy-based
Phoenix (Arize): strong OSS
OTEL Gen AI Semantic Conventions: standard integration

10.3 LLM trace structure

user_query (root span)
├─ retrieval (vector search)
│   ├─ embedding (token count)
│   └─ qdrant.search (query vector)
├─ llm.openai.chat (tokens, cost)
│   └─ tool_call: get_weather
│       └─ api.weather
└─ response_generation

Part 11 — Six-month observability roadmap

Month 1: foundations

Install Prometheus and Grafana
Understand the four metric types
PromQL basics

Month 2: logging

Structured logging standard
Run Loki or Elastic
Log-level policy

Month 3: tracing

Integrate OTEL SDK
Deploy Tempo or Jaeger
Tail-based sampling

Month 4: SLO

Define SLIs for key services
Compute Error Budgets
Burn Rate alerts

Month 5: eBPF and profile

Adopt Pyroscope
Cilium Hubble for K8s network
Beyla auto-instrumentation

Month 6: cost and org

Audit observability cost
LLM observability (Langfuse)
Post-incident review process

Part 12 — Observability checklist (12)

Know the strength of each of the 3 pillars + Profile
Can explain OpenTelemetry Collector architecture
Know the four Prometheus metric types
Know what cardinality explosion is and how to prevent it
Know the difference between Head and Tail-based sampling
Know how W3C Trace Context connects services
Know why eBPF is great for auto-instrumentation
Clearly distinguish SLI, SLO, SLA
Can compute Error Budget and Burn Rate
Know 5 ways to cut log cost
Know 3 Datadog cost traps
Know 5 LLM observability metrics

Part 13 — Ten anti-patterns

Unstructured logs: printf style. No search, no aggregation
Ignoring cardinality: user_id as a label, explosion
100% trace retention: blows cost. Tail sampling is mandatory
Log-to-metric abuse: expensive. Use Counter and Gauge
Alert fatigue: hundreds a day. Keep only what matters
No SLOs: no shared definition of "problem"
Dashboard jungle: 50 dashboards, 10 used
No trace_id plumbing: logs without trace_id = debugging hell
Uniform retention: ERROR and DEBUG kept the same time = waste
Observability as afterthought: "we'll add it post-launch" = never

Closing — Observability is the system's sense of self

Just as a human who feels no pain cannot care for their body, a system that cannot perceive its own state cannot be operated.

In 2025, observability comes down to:

Standards (OpenTelemetry for vendor independence)
Integration (stitch four pillars with trace_id)
Economics (control cost while keeping signal)
Organization (SLOs are team agreements)

Tools churn every year. Grafana Stack goes Prometheus → Mimir → Grafana Cloud, Elastic becomes OpenSearch, Datadog embraces eBPF. But the principles hold.

Remember: "If you cannot observe it, you cannot operate it."

Next up — "Security Complete Guide: Zero Trust, Secrets, OAuth, OIDC, Supply Chain, AI Security"

Season 2 Episode 10 is the other operational must: security engineering. Next time:

Zero Trust architecture in practice
OAuth 2.1 / OIDC / PKCE
Secret management (Vault, AWS KMS, SOPS)
SBOM and supply chain defense
Container and K8s security (Pod Security, Admission)
OWASP Top 10 for LLM
Security-oriented observability

Crypto is easy, security is hard. Continues next.