Observability Complete Guide — Metric, Log, Trace, OpenTelemetry, eBPF, SLO (Season 2 Ep 9, 2025)

Intro — Observability vs Monitoring

Monitoring: detecting known problems. "Alert if CPU > 80%." Observability: investigating unknown problems. "Why did P99 suddenly triple?"

The nine black boxes

In 2025, every operator needs to:

Trace request flow across distributed systems
Find CPU and memory bottlenecks via profiles
Diagnose at the kernel level with eBPF
Pace releases with SLO and Error Budget
Control Log and Metric cost without losing signal

This post covers four layers: fundamentals, practice, cost, organization.

Part 1 — Three Pillars + Profile = Four Pillars

1.1 Metric

Numeric time series — easy to aggregate, retain long term, and alert on.

Example: http_requests_total{status="200",method="GET"}
Tools: Prometheus, VictoriaMetrics, Mimir, Cortex
Pros: storage efficient, fast dashboards
Cons: cardinality explosion risk (many labels = huge cost)

1.2 Log

Event stream — detailed but large.

Structured logging (JSON) is mandatory
Tools: Loki, Elasticsearch, OpenSearch, Quickwit, ClickHouse
Pros: detailed, flexible
Cons: expensive, slow search

1.3 Trace

Causal chain of a request — core for distributed debugging.

Concepts: Span (unit of work) + Context Propagation
Tools: Jaeger, Tempo, Zipkin, Honeycomb
Pros: see which service is the bottleneck
Cons: sampling strategy is critical (100% storage is unrealistic)

1.4 Profile (fourth pillar, 2023+)

Continuous in-process CPU and memory profiles.

Tools: Pyroscope (Grafana), Parca, Polar Signals
Pros: answer "which function eats CPU?" in real time
Cons: overhead and storage management

1.5 Four-pillar correlation scenario

Alert: P99 latency spike
  ↓
[Metric] which service? → checkout-service
  ↓
[Trace] which span? → payment-gateway call, 3s
  ↓
[Log] logs for that trace_id → timeout error
  ↓
[Profile] CPU during that window → 90% in TLS handshake
  ↓
Root cause: expiring cert causes handshake surge

Correlation is the point. All four pillars must be stitched by trace_id.

Part 2 — OpenTelemetry: the observation standard

2.1 What OTEL solved

Before: every vendor shipped its own SDK (Prometheus, Datadog, New Relic...). Switching vendor meant rewriting instrumentation.

OTEL: instrument once, export anywhere.

2.2 OTEL components

Application
  ↓ (OTEL SDK)
OTLP Protocol
  ↓
[OTEL Collector]
  ├─ Receivers (OTLP, Prometheus, Jaeger...)
  ├─ Processors (batch, filter, sample, enrich)
  └─ Exporters (Tempo, Loki, Prometheus, Datadog, ...)
  ↓
Backend (where you want)

2.3 Signals — three, plus profile

Traces: Stable
Metrics: Stable
Logs: Stable (2024)
Profiles: Beta (2024–2025)

2.4 Context Propagation

When a request flows A → B → C, the same trace_id must propagate:

HTTP Headers:
  traceparent: 00-TRACEID-SPANID-01
  tracestate: ...

W3C Trace Context standard. OTEL handles it automatically.

2.5 Auto-Instrumentation

Java: -javaagent:opentelemetry-javaagent.jar (bytecode manipulation)
Python: opentelemetry-instrument python app.py
Node.js: NODE_OPTIONS="--require @opentelemetry/auto-instrumentations-node/register"
Go: explicit SDK instrumentation (no reflection)
Rust: tracing crate

2.6 OTEL evolution (2024–2025)

Profile Signal promoted
Gen AI Semantic Conventions: LLM calls standardized
Exponential Histograms: high-resolution metrics
OTEL Collector Contrib: hundreds of receivers and exporters

Part 3 — Prometheus and Grafana Stack

3.1 Grafana Stack (2025 OSS default)

Component	Role
Prometheus / Mimir	Metric
Loki	Log
Tempo	Trace
Pyroscope	Profile
Grafana	Dashboards
Alertmanager	Alerts
Beyla	eBPF auto-instrumentation

3.2 Prometheus metric types

Counter: monotonic (request count)
Gauge: current value (memory use)
Histogram: bucketed (latency distribution)
Summary: pre-computed quantiles

3.3 PromQL basics

# requests per second
rate(http_requests_total[1m])

# P99 per service
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

# error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))

3.4 Cardinality management

Problem: {user_id, path, status, method} with 1M users × 10 paths × 5 statuses × 4 methods = 200M series. Prometheus collapses.

Fix:

Ban high-cardinality labels (user_id, trace_id)
Aggregate first, store later
Scale with VictoriaMetrics or Mimir (more efficient than vanilla Prometheus)

Part 4 — Log management: cost vs quality

4.1 Why log cost explodes

Debug logs left on in production
Plain text instead of JSON
Full-text engines (Elastic) are expensive
Over-long retention

4.2 Log level strategy

Level	Use	Retention
ERROR	alert-worthy	30–90 days
WARN	caution	30 days
INFO	key events	7–30 days
DEBUG	dev only	drop or 1 day
TRACE	deep debug	drop

4.3 Structured logging (mandatory)

# bad
logger.info(f"User {user_id} logged in from {ip}")

# good
logger.info("user_login", extra={
    "user_id": user_id,
    "ip": ip,
    "trace_id": trace_id,
})

Search, filter, and aggregate all become possible.

4.4 Log solutions (2025)

Tool	Model	Notes
Loki	low-cost, minimal index	Grafana Stack default
Elasticsearch	full-text	expensive
OpenSearch	Elastic fork	AWS-friendly
Quickwit	Rust, log-specialized	new contender
ClickHouse	columnar DB	strong on analytics
Vector.dev (Datadog)	ingest pipeline	routing and filtering

Part 5 — Distributed Tracing in practice

5.1 Four sampling strategies

Head-based (probabilistic): decide at request start with N%. Simple. Root-leaf consistent.
Rate-limited: X per second
Tail-based: accept all, store only errors and slow. Best cost/signal balance
Dynamic: auto-increase on anomaly

5.2 Tail-based Sampling (the 2024–2025 default)

# OTEL Collector config
processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow
        type: latency
        latency: {threshold_ms: 1000}
      - name: random
        type: probabilistic
        probabilistic: {sampling_percentage: 1}

100% of errors and slow, 1% of the rest.

5.3 Span naming and attributes

Span name: service.operation (e.g. db.query, http.get)
Attributes: follow W3C Semantic Conventions
Events: key stages (cache.hit, retry, backoff)
Errors: span.setStatus(ERROR) + exception event

5.4 Debugging example

Gateway (20ms)
├─ auth-service (5ms)
├─ user-service (50ms)
│  └─ postgres.query (45ms)  ← slow here
└─ product-service (10ms)

The 45ms query span attribute db.statement = "SELECT * FROM users WHERE ..." points to a missing index.

Part 6 — eBPF: observation without code changes

6.1 What is eBPF

Safe bytecode injected into the Linux kernel to trace network, syscalls, and function calls.

Pros:

Observe without touching app code
Minimal overhead (kernel-resident)
Network and system level

6.2 eBPF observability tools (2025)

Pixie (New Relic): Kubernetes-native, auto-telemetry
Cilium Hubble: network flow
Parca: always-on CPU profiling
Grafana Beyla: auto-instrumentation
Inspektor Gadget: Kubernetes toolkit
Odigos: eBPF-based auto-instrumentation SaaS

6.3 eBPF's real value

"Language-agnostic auto-instrument" — Java, Go, Python, Rust, or Node, any HTTP/gRPC/DB call gets traced automatically.

Cons: kernel >= 5.x required, Windows still in progress, debugging is still hard.

Part 7 — SLO, SLI, Error Budget

7.1 Definitions (Google SRE)

SLI (Indicator): measurement — "99% of requests respond in 200ms"
SLO (Objective): target — "hit the SLI 99.9% of the time"
SLA (Agreement): customer promise — "refund below 99.5%"

7.2 Error Budget

100% - SLO = Error Budget.

SLO 99.9% → budget 0.1% = ~43 min downtime per month
Budget left → ship
Budget burned → freeze, stabilize

7.3 Good SLI criteria

User-centric: what users feel, not internal metrics
Simple: easy to compute and understand
Predictable: 100% in normal operation
Tamper-proof: cannot be raised by redefinition

Examples:

Bad: "server CPU <= 70%"
Good: "99% of user requests return 200 in 500ms"

7.4 Burn Rate

How fast the Error Budget is burning.

Burn Rate = current error rate / allowed error rate

14x burn for 1 hour = 5% of budget
6x burn for 6 hours = 5% of budget

Multi-window alerts: short-window high burn plus long-window moderate burn — cuts false positives.

7.5 Error Budget Policy

100–50% left: normal release cadence
50–10%: more release testing, prepare feature freeze
< 10%: feature freeze, stability sprint
Burned: postmortem, process review

Part 8 — Choosing an observability platform (2025)

8.1 Commercial vs OSS

Commercial	OSS Stack
Datadog, New Relic, Dynatrace, Splunk, Honeycomb	Grafana Stack, Signoz, Kibana
Fast ROI, support, advanced AI	Flexible, cheap, no lock-in
Cost can explode	Needs operators

8.2 Situational recommendations

Situation	Recommendation
Startup under 20 people	Datadog or Signoz SaaS
Growth-stage	Grafana Cloud or Honeycomb
Mid-size	Grafana Stack self-host + Prometheus
Enterprise	Hybrid (Datadog + OSS + in-house)
Cost-obsessed	ClickHouse + Grafana

8.3 Datadog cost traps

Custom Metrics: ~$5/month per series. Cardinality blow-up = bomb
Log Ingestion: expensive per GB. Filter aggressively
APM Host: $31/host/month. Many microservices = explosion
Fix: pre-filter, sample, and aggregate at the OTEL Collector

8.4 Honeycomb's differentiation

High-cardinality first search: free-form by trace_id, user_id
BubbleUp: automatic anomaly grouping
Philosophy: "raw events first", not "metrics first"

Part 9 — Organizational observability

9.1 Observability-driven development

Think instrumentation from day one
"Observability impact" checkbox on every PR
New features ship with a draft SLO
Post-incident reviews ask "why didn't we observe this?"

9.2 Incident management link

MTTR down = observability maturity proxy
Runbooks: "this alert → this dashboard → this trace" checklist
Blameless postmortems: why didn't the system detect it?

9.3 Observability cost model

total cost = sum(signal size × retention × cost/GB)

levers:
1. Sample (trace 1%, log 10%)
2. Aggregate (rollups, not raw metrics)
3. Tiered retention (ERROR 90d, INFO 7d)
4. Pre-filter (OTEL Collector)
5. Cold/hot tier (archive to S3)

Target: observability cost stays at 5–15% of infra cost.

Part 10 — LLM observability (new in 2024–2025)

10.1 What to measure

Token Usage: prompt and completion tokens
Latency: TTFT, total, per-token
Cost: USD (model × tokens)
Quality: LLM-as-Judge score, user feedback
Tool Calls: success, failure, chain depth
Safety: moderation result, refuse rate

10.2 Tools

Langfuse: OSS, self-hostable
LangSmith: LangChain team
Helicone: proxy-based
Phoenix (Arize): strong OSS
OTEL Gen AI Semantic Conventions: standard integration

10.3 LLM trace structure

user_query (root span)
├─ retrieval (vector search)
│   ├─ embedding (token count)
│   └─ qdrant.search (query vector)
├─ llm.openai.chat (tokens, cost)
│   └─ tool_call: get_weather
│       └─ api.weather
└─ response_generation

Part 11 — Six-month observability roadmap

Month 1: foundations

Install Prometheus and Grafana
Understand the four metric types
PromQL basics

Month 2: logging

Structured logging standard
Run Loki or Elastic
Log-level policy

Month 3: tracing

Integrate OTEL SDK
Deploy Tempo or Jaeger
Tail-based sampling

Month 4: SLO

Define SLIs for key services
Compute Error Budgets
Burn Rate alerts

Month 5: eBPF and profile

Adopt Pyroscope
Cilium Hubble for K8s network
Beyla auto-instrumentation

Month 6: cost and org

Audit observability cost
LLM observability (Langfuse)
Post-incident review process

Part 12 — Observability checklist (12)

Know the strength of each of the 3 pillars + Profile
Can explain OpenTelemetry Collector architecture
Know the four Prometheus metric types
Know what cardinality explosion is and how to prevent it
Know the difference between Head and Tail-based sampling
Know how W3C Trace Context connects services
Know why eBPF is great for auto-instrumentation
Clearly distinguish SLI, SLO, SLA
Can compute Error Budget and Burn Rate
Know 5 ways to cut log cost
Know 3 Datadog cost traps
Know 5 LLM observability metrics

Part 13 — Ten anti-patterns

Unstructured logs: printf style. No search, no aggregation
Ignoring cardinality: user_id as a label, explosion
100% trace retention: blows cost. Tail sampling is mandatory
Log-to-metric abuse: expensive. Use Counter and Gauge
Alert fatigue: hundreds a day. Keep only what matters
No SLOs: no shared definition of "problem"
Dashboard jungle: 50 dashboards, 10 used
No trace_id plumbing: logs without trace_id = debugging hell
Uniform retention: ERROR and DEBUG kept the same time = waste
Observability as afterthought: "we'll add it post-launch" = never

Closing — Observability is the system's sense of self

Just as a human who feels no pain cannot care for their body, a system that cannot perceive its own state cannot be operated.

In 2025, observability comes down to:

Standards (OpenTelemetry for vendor independence)
Integration (stitch four pillars with trace_id)
Economics (control cost while keeping signal)
Organization (SLOs are team agreements)

Tools churn every year. Grafana Stack goes Prometheus → Mimir → Grafana Cloud, Elastic becomes OpenSearch, Datadog embraces eBPF. But the principles hold.

Remember: "If you cannot observe it, you cannot operate it."

Next up — "Security Complete Guide: Zero Trust, Secrets, OAuth, OIDC, Supply Chain, AI Security"

Season 2 Episode 10 is the other operational must: security engineering. Next time:

Zero Trust architecture in practice
OAuth 2.1 / OIDC / PKCE
Secret management (Vault, AWS KMS, SOPS)
SBOM and supply chain defense
Container and K8s security (Pod Security, Admission)
OWASP Top 10 for LLM
Security-oriented observability

Crypto is easy, security is hard. Continues next.