- Published on
Observability Complete Guide — Metric, Log, Trace, OpenTelemetry, eBPF, SLO (Season 2 Ep 9, 2025)
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Intro — Observability vs Monitoring
Monitoring: detecting known problems. "Alert if CPU > 80%." Observability: investigating unknown problems. "Why did P99 suddenly triple?"
The nine black boxes
In 2025, every operator needs to:
- Trace request flow across distributed systems
- Find CPU and memory bottlenecks via profiles
- Diagnose at the kernel level with eBPF
- Pace releases with SLO and Error Budget
- Control Log and Metric cost without losing signal
This post covers four layers: fundamentals, practice, cost, organization.
Part 1 — Three Pillars + Profile = Four Pillars
1.1 Metric
Numeric time series — easy to aggregate, retain long term, and alert on.
- Example:
http_requests_total{status="200",method="GET"} - Tools: Prometheus, VictoriaMetrics, Mimir, Cortex
- Pros: storage efficient, fast dashboards
- Cons: cardinality explosion risk (many labels = huge cost)
1.2 Log
Event stream — detailed but large.
- Structured logging (JSON) is mandatory
- Tools: Loki, Elasticsearch, OpenSearch, Quickwit, ClickHouse
- Pros: detailed, flexible
- Cons: expensive, slow search
1.3 Trace
Causal chain of a request — core for distributed debugging.
- Concepts: Span (unit of work) + Context Propagation
- Tools: Jaeger, Tempo, Zipkin, Honeycomb
- Pros: see which service is the bottleneck
- Cons: sampling strategy is critical (100% storage is unrealistic)
1.4 Profile (fourth pillar, 2023+)
Continuous in-process CPU and memory profiles.
- Tools: Pyroscope (Grafana), Parca, Polar Signals
- Pros: answer "which function eats CPU?" in real time
- Cons: overhead and storage management
1.5 Four-pillar correlation scenario
Alert: P99 latency spike
↓
[Metric] which service? → checkout-service
↓
[Trace] which span? → payment-gateway call, 3s
↓
[Log] logs for that trace_id → timeout error
↓
[Profile] CPU during that window → 90% in TLS handshake
↓
Root cause: expiring cert causes handshake surge
Correlation is the point. All four pillars must be stitched by trace_id.
Part 2 — OpenTelemetry: the observation standard
2.1 What OTEL solved
Before: every vendor shipped its own SDK (Prometheus, Datadog, New Relic...). Switching vendor meant rewriting instrumentation.
OTEL: instrument once, export anywhere.
2.2 OTEL components
Application
↓ (OTEL SDK)
OTLP Protocol
↓
[OTEL Collector]
├─ Receivers (OTLP, Prometheus, Jaeger...)
├─ Processors (batch, filter, sample, enrich)
└─ Exporters (Tempo, Loki, Prometheus, Datadog, ...)
↓
Backend (where you want)
2.3 Signals — three, plus profile
- Traces: Stable
- Metrics: Stable
- Logs: Stable (2024)
- Profiles: Beta (2024–2025)
2.4 Context Propagation
When a request flows A → B → C, the same trace_id must propagate:
HTTP Headers:
traceparent: 00-TRACEID-SPANID-01
tracestate: ...
W3C Trace Context standard. OTEL handles it automatically.
2.5 Auto-Instrumentation
- Java:
-javaagent:opentelemetry-javaagent.jar(bytecode manipulation) - Python:
opentelemetry-instrument python app.py - Node.js:
NODE_OPTIONS="--require @opentelemetry/auto-instrumentations-node/register" - Go: explicit SDK instrumentation (no reflection)
- Rust:
tracingcrate
2.6 OTEL evolution (2024–2025)
- Profile Signal promoted
- Gen AI Semantic Conventions: LLM calls standardized
- Exponential Histograms: high-resolution metrics
- OTEL Collector Contrib: hundreds of receivers and exporters
Part 3 — Prometheus and Grafana Stack
3.1 Grafana Stack (2025 OSS default)
| Component | Role |
|---|---|
| Prometheus / Mimir | Metric |
| Loki | Log |
| Tempo | Trace |
| Pyroscope | Profile |
| Grafana | Dashboards |
| Alertmanager | Alerts |
| Beyla | eBPF auto-instrumentation |
3.2 Prometheus metric types
- Counter: monotonic (request count)
- Gauge: current value (memory use)
- Histogram: bucketed (latency distribution)
- Summary: pre-computed quantiles
3.3 PromQL basics
# requests per second
rate(http_requests_total[1m])
# P99 per service
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
# error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
3.4 Cardinality management
Problem: {user_id, path, status, method} with 1M users × 10 paths × 5 statuses × 4 methods = 200M series. Prometheus collapses.
Fix:
- Ban high-cardinality labels (user_id, trace_id)
- Aggregate first, store later
- Scale with VictoriaMetrics or Mimir (more efficient than vanilla Prometheus)
Part 4 — Log management: cost vs quality
4.1 Why log cost explodes
- Debug logs left on in production
- Plain text instead of JSON
- Full-text engines (Elastic) are expensive
- Over-long retention
4.2 Log level strategy
| Level | Use | Retention |
|---|---|---|
| ERROR | alert-worthy | 30–90 days |
| WARN | caution | 30 days |
| INFO | key events | 7–30 days |
| DEBUG | dev only | drop or 1 day |
| TRACE | deep debug | drop |
4.3 Structured logging (mandatory)
# bad
logger.info(f"User {user_id} logged in from {ip}")
# good
logger.info("user_login", extra={
"user_id": user_id,
"ip": ip,
"trace_id": trace_id,
})
Search, filter, and aggregate all become possible.
4.4 Log solutions (2025)
| Tool | Model | Notes |
|---|---|---|
| Loki | low-cost, minimal index | Grafana Stack default |
| Elasticsearch | full-text | expensive |
| OpenSearch | Elastic fork | AWS-friendly |
| Quickwit | Rust, log-specialized | new contender |
| ClickHouse | columnar DB | strong on analytics |
| Vector.dev (Datadog) | ingest pipeline | routing and filtering |
Part 5 — Distributed Tracing in practice
5.1 Four sampling strategies
- Head-based (probabilistic): decide at request start with N%. Simple. Root-leaf consistent.
- Rate-limited: X per second
- Tail-based: accept all, store only errors and slow. Best cost/signal balance
- Dynamic: auto-increase on anomaly
5.2 Tail-based Sampling (the 2024–2025 default)
# OTEL Collector config
processors:
tail_sampling:
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow
type: latency
latency: {threshold_ms: 1000}
- name: random
type: probabilistic
probabilistic: {sampling_percentage: 1}
100% of errors and slow, 1% of the rest.
5.3 Span naming and attributes
- Span name:
service.operation(e.g.db.query,http.get) - Attributes: follow W3C Semantic Conventions
- Events: key stages (cache.hit, retry, backoff)
- Errors:
span.setStatus(ERROR)+exceptionevent
5.4 Debugging example
Gateway (20ms)
├─ auth-service (5ms)
├─ user-service (50ms)
│ └─ postgres.query (45ms) ← slow here
└─ product-service (10ms)
The 45ms query span attribute db.statement = "SELECT * FROM users WHERE ..." points to a missing index.
Part 6 — eBPF: observation without code changes
6.1 What is eBPF
Safe bytecode injected into the Linux kernel to trace network, syscalls, and function calls.
Pros:
- Observe without touching app code
- Minimal overhead (kernel-resident)
- Network and system level
6.2 eBPF observability tools (2025)
- Pixie (New Relic): Kubernetes-native, auto-telemetry
- Cilium Hubble: network flow
- Parca: always-on CPU profiling
- Grafana Beyla: auto-instrumentation
- Inspektor Gadget: Kubernetes toolkit
- Odigos: eBPF-based auto-instrumentation SaaS
6.3 eBPF's real value
"Language-agnostic auto-instrument" — Java, Go, Python, Rust, or Node, any HTTP/gRPC/DB call gets traced automatically.
Cons: kernel >= 5.x required, Windows still in progress, debugging is still hard.
Part 7 — SLO, SLI, Error Budget
7.1 Definitions (Google SRE)
- SLI (Indicator): measurement — "99% of requests respond in 200ms"
- SLO (Objective): target — "hit the SLI 99.9% of the time"
- SLA (Agreement): customer promise — "refund below 99.5%"
7.2 Error Budget
100% - SLO = Error Budget.
- SLO 99.9% → budget 0.1% = ~43 min downtime per month
- Budget left → ship
- Budget burned → freeze, stabilize
7.3 Good SLI criteria
- User-centric: what users feel, not internal metrics
- Simple: easy to compute and understand
- Predictable: 100% in normal operation
- Tamper-proof: cannot be raised by redefinition
Examples:
- Bad: "server CPU
<= 70%" - Good: "99% of user requests return 200 in 500ms"
7.4 Burn Rate
How fast the Error Budget is burning.
Burn Rate = current error rate / allowed error rate
14x burn for 1 hour = 5% of budget
6x burn for 6 hours = 5% of budget
Multi-window alerts: short-window high burn plus long-window moderate burn — cuts false positives.
7.5 Error Budget Policy
100–50% left: normal release cadence
50–10%: more release testing, prepare feature freeze
< 10%: feature freeze, stability sprint
Burned: postmortem, process review
Part 8 — Choosing an observability platform (2025)
8.1 Commercial vs OSS
| Commercial | OSS Stack |
|---|---|
| Datadog, New Relic, Dynatrace, Splunk, Honeycomb | Grafana Stack, Signoz, Kibana |
| Fast ROI, support, advanced AI | Flexible, cheap, no lock-in |
| Cost can explode | Needs operators |
8.2 Situational recommendations
| Situation | Recommendation |
|---|---|
| Startup under 20 people | Datadog or Signoz SaaS |
| Growth-stage | Grafana Cloud or Honeycomb |
| Mid-size | Grafana Stack self-host + Prometheus |
| Enterprise | Hybrid (Datadog + OSS + in-house) |
| Cost-obsessed | ClickHouse + Grafana |
8.3 Datadog cost traps
- Custom Metrics: ~$5/month per series. Cardinality blow-up = bomb
- Log Ingestion: expensive per GB. Filter aggressively
- APM Host: $31/host/month. Many microservices = explosion
- Fix: pre-filter, sample, and aggregate at the OTEL Collector
8.4 Honeycomb's differentiation
- High-cardinality first search: free-form by
trace_id,user_id - BubbleUp: automatic anomaly grouping
- Philosophy: "raw events first", not "metrics first"
Part 9 — Organizational observability
9.1 Observability-driven development
- Think instrumentation from day one
- "Observability impact" checkbox on every PR
- New features ship with a draft SLO
- Post-incident reviews ask "why didn't we observe this?"
9.2 Incident management link
- MTTR down = observability maturity proxy
- Runbooks: "this alert → this dashboard → this trace" checklist
- Blameless postmortems: why didn't the system detect it?
9.3 Observability cost model
total cost = sum(signal size × retention × cost/GB)
levers:
1. Sample (trace 1%, log 10%)
2. Aggregate (rollups, not raw metrics)
3. Tiered retention (ERROR 90d, INFO 7d)
4. Pre-filter (OTEL Collector)
5. Cold/hot tier (archive to S3)
Target: observability cost stays at 5–15% of infra cost.
Part 10 — LLM observability (new in 2024–2025)
10.1 What to measure
- Token Usage: prompt and completion tokens
- Latency: TTFT, total, per-token
- Cost: USD (model × tokens)
- Quality: LLM-as-Judge score, user feedback
- Tool Calls: success, failure, chain depth
- Safety: moderation result, refuse rate
10.2 Tools
- Langfuse: OSS, self-hostable
- LangSmith: LangChain team
- Helicone: proxy-based
- Phoenix (Arize): strong OSS
- OTEL Gen AI Semantic Conventions: standard integration
10.3 LLM trace structure
user_query (root span)
├─ retrieval (vector search)
│ ├─ embedding (token count)
│ └─ qdrant.search (query vector)
├─ llm.openai.chat (tokens, cost)
│ └─ tool_call: get_weather
│ └─ api.weather
└─ response_generation
Part 11 — Six-month observability roadmap
Month 1: foundations
- Install Prometheus and Grafana
- Understand the four metric types
- PromQL basics
Month 2: logging
- Structured logging standard
- Run Loki or Elastic
- Log-level policy
Month 3: tracing
- Integrate OTEL SDK
- Deploy Tempo or Jaeger
- Tail-based sampling
Month 4: SLO
- Define SLIs for key services
- Compute Error Budgets
- Burn Rate alerts
Month 5: eBPF and profile
- Adopt Pyroscope
- Cilium Hubble for K8s network
- Beyla auto-instrumentation
Month 6: cost and org
- Audit observability cost
- LLM observability (Langfuse)
- Post-incident review process
Part 12 — Observability checklist (12)
- Know the strength of each of the 3 pillars + Profile
- Can explain OpenTelemetry Collector architecture
- Know the four Prometheus metric types
- Know what cardinality explosion is and how to prevent it
- Know the difference between Head and Tail-based sampling
- Know how W3C Trace Context connects services
- Know why eBPF is great for auto-instrumentation
- Clearly distinguish SLI, SLO, SLA
- Can compute Error Budget and Burn Rate
- Know 5 ways to cut log cost
- Know 3 Datadog cost traps
- Know 5 LLM observability metrics
Part 13 — Ten anti-patterns
- Unstructured logs: printf style. No search, no aggregation
- Ignoring cardinality:
user_idas a label, explosion - 100% trace retention: blows cost. Tail sampling is mandatory
- Log-to-metric abuse: expensive. Use Counter and Gauge
- Alert fatigue: hundreds a day. Keep only what matters
- No SLOs: no shared definition of "problem"
- Dashboard jungle: 50 dashboards, 10 used
- No
trace_idplumbing: logs withouttrace_id= debugging hell - Uniform retention: ERROR and DEBUG kept the same time = waste
- Observability as afterthought: "we'll add it post-launch" = never
Closing — Observability is the system's sense of self
Just as a human who feels no pain cannot care for their body, a system that cannot perceive its own state cannot be operated.
In 2025, observability comes down to:
- Standards (OpenTelemetry for vendor independence)
- Integration (stitch four pillars with
trace_id) - Economics (control cost while keeping signal)
- Organization (SLOs are team agreements)
Tools churn every year. Grafana Stack goes Prometheus → Mimir → Grafana Cloud, Elastic becomes OpenSearch, Datadog embraces eBPF. But the principles hold.
Remember: "If you cannot observe it, you cannot operate it."
Next up — "Security Complete Guide: Zero Trust, Secrets, OAuth, OIDC, Supply Chain, AI Security"
Season 2 Episode 10 is the other operational must: security engineering. Next time:
- Zero Trust architecture in practice
- OAuth 2.1 / OIDC / PKCE
- Secret management (Vault, AWS KMS, SOPS)
- SBOM and supply chain defense
- Container and K8s security (Pod Security, Admission)
- OWASP Top 10 for LLM
- Security-oriented observability
Crypto is easy, security is hard. Continues next.