Skip to content
Published on

Observability Complete Guide — Metric, Log, Trace, OpenTelemetry, eBPF, SLO (Season 2 Ep 9, 2025)

Authors

Intro — Observability vs Monitoring

Monitoring: detecting known problems. "Alert if CPU > 80%." Observability: investigating unknown problems. "Why did P99 suddenly triple?"

The nine black boxes

In 2025, every operator needs to:

  • Trace request flow across distributed systems
  • Find CPU and memory bottlenecks via profiles
  • Diagnose at the kernel level with eBPF
  • Pace releases with SLO and Error Budget
  • Control Log and Metric cost without losing signal

This post covers four layers: fundamentals, practice, cost, organization.


Part 1 — Three Pillars + Profile = Four Pillars

1.1 Metric

Numeric time series — easy to aggregate, retain long term, and alert on.

  • Example: http_requests_total{status="200",method="GET"}
  • Tools: Prometheus, VictoriaMetrics, Mimir, Cortex
  • Pros: storage efficient, fast dashboards
  • Cons: cardinality explosion risk (many labels = huge cost)

1.2 Log

Event stream — detailed but large.

  • Structured logging (JSON) is mandatory
  • Tools: Loki, Elasticsearch, OpenSearch, Quickwit, ClickHouse
  • Pros: detailed, flexible
  • Cons: expensive, slow search

1.3 Trace

Causal chain of a request — core for distributed debugging.

  • Concepts: Span (unit of work) + Context Propagation
  • Tools: Jaeger, Tempo, Zipkin, Honeycomb
  • Pros: see which service is the bottleneck
  • Cons: sampling strategy is critical (100% storage is unrealistic)

1.4 Profile (fourth pillar, 2023+)

Continuous in-process CPU and memory profiles.

  • Tools: Pyroscope (Grafana), Parca, Polar Signals
  • Pros: answer "which function eats CPU?" in real time
  • Cons: overhead and storage management

1.5 Four-pillar correlation scenario

Alert: P99 latency spike
[Metric] which service? → checkout-service
[Trace] which span? → payment-gateway call, 3s
[Log] logs for that trace_id → timeout error
[Profile] CPU during that window → 90% in TLS handshake
Root cause: expiring cert causes handshake surge

Correlation is the point. All four pillars must be stitched by trace_id.


Part 2 — OpenTelemetry: the observation standard

2.1 What OTEL solved

Before: every vendor shipped its own SDK (Prometheus, Datadog, New Relic...). Switching vendor meant rewriting instrumentation.

OTEL: instrument once, export anywhere.

2.2 OTEL components

Application
  ↓ (OTEL SDK)
OTLP Protocol
[OTEL Collector]
  ├─ Receivers (OTLP, Prometheus, Jaeger...)
  ├─ Processors (batch, filter, sample, enrich)
  └─ Exporters (Tempo, Loki, Prometheus, Datadog, ...)
Backend (where you want)

2.3 Signals — three, plus profile

  • Traces: Stable
  • Metrics: Stable
  • Logs: Stable (2024)
  • Profiles: Beta (2024–2025)

2.4 Context Propagation

When a request flows A → B → C, the same trace_id must propagate:

HTTP Headers:
  traceparent: 00-TRACEID-SPANID-01
  tracestate: ...

W3C Trace Context standard. OTEL handles it automatically.

2.5 Auto-Instrumentation

  • Java: -javaagent:opentelemetry-javaagent.jar (bytecode manipulation)
  • Python: opentelemetry-instrument python app.py
  • Node.js: NODE_OPTIONS="--require @opentelemetry/auto-instrumentations-node/register"
  • Go: explicit SDK instrumentation (no reflection)
  • Rust: tracing crate

2.6 OTEL evolution (2024–2025)

  • Profile Signal promoted
  • Gen AI Semantic Conventions: LLM calls standardized
  • Exponential Histograms: high-resolution metrics
  • OTEL Collector Contrib: hundreds of receivers and exporters

Part 3 — Prometheus and Grafana Stack

3.1 Grafana Stack (2025 OSS default)

ComponentRole
Prometheus / MimirMetric
LokiLog
TempoTrace
PyroscopeProfile
GrafanaDashboards
AlertmanagerAlerts
BeylaeBPF auto-instrumentation

3.2 Prometheus metric types

  1. Counter: monotonic (request count)
  2. Gauge: current value (memory use)
  3. Histogram: bucketed (latency distribution)
  4. Summary: pre-computed quantiles

3.3 PromQL basics

# requests per second
rate(http_requests_total[1m])

# P99 per service
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

# error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))

3.4 Cardinality management

Problem: {user_id, path, status, method} with 1M users × 10 paths × 5 statuses × 4 methods = 200M series. Prometheus collapses.

Fix:

  • Ban high-cardinality labels (user_id, trace_id)
  • Aggregate first, store later
  • Scale with VictoriaMetrics or Mimir (more efficient than vanilla Prometheus)

Part 4 — Log management: cost vs quality

4.1 Why log cost explodes

  • Debug logs left on in production
  • Plain text instead of JSON
  • Full-text engines (Elastic) are expensive
  • Over-long retention

4.2 Log level strategy

LevelUseRetention
ERRORalert-worthy30–90 days
WARNcaution30 days
INFOkey events7–30 days
DEBUGdev onlydrop or 1 day
TRACEdeep debugdrop

4.3 Structured logging (mandatory)

# bad
logger.info(f"User {user_id} logged in from {ip}")

# good
logger.info("user_login", extra={
    "user_id": user_id,
    "ip": ip,
    "trace_id": trace_id,
})

Search, filter, and aggregate all become possible.

4.4 Log solutions (2025)

ToolModelNotes
Lokilow-cost, minimal indexGrafana Stack default
Elasticsearchfull-textexpensive
OpenSearchElastic forkAWS-friendly
QuickwitRust, log-specializednew contender
ClickHousecolumnar DBstrong on analytics
Vector.dev (Datadog)ingest pipelinerouting and filtering

Part 5 — Distributed Tracing in practice

5.1 Four sampling strategies

  1. Head-based (probabilistic): decide at request start with N%. Simple. Root-leaf consistent.
  2. Rate-limited: X per second
  3. Tail-based: accept all, store only errors and slow. Best cost/signal balance
  4. Dynamic: auto-increase on anomaly

5.2 Tail-based Sampling (the 2024–2025 default)

# OTEL Collector config
processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow
        type: latency
        latency: {threshold_ms: 1000}
      - name: random
        type: probabilistic
        probabilistic: {sampling_percentage: 1}

100% of errors and slow, 1% of the rest.

5.3 Span naming and attributes

  • Span name: service.operation (e.g. db.query, http.get)
  • Attributes: follow W3C Semantic Conventions
  • Events: key stages (cache.hit, retry, backoff)
  • Errors: span.setStatus(ERROR) + exception event

5.4 Debugging example

Gateway (20ms)
├─ auth-service (5ms)
├─ user-service (50ms)
│  └─ postgres.query (45ms)  ← slow here
└─ product-service (10ms)

The 45ms query span attribute db.statement = "SELECT * FROM users WHERE ..." points to a missing index.


Part 6 — eBPF: observation without code changes

6.1 What is eBPF

Safe bytecode injected into the Linux kernel to trace network, syscalls, and function calls.

Pros:

  • Observe without touching app code
  • Minimal overhead (kernel-resident)
  • Network and system level

6.2 eBPF observability tools (2025)

  • Pixie (New Relic): Kubernetes-native, auto-telemetry
  • Cilium Hubble: network flow
  • Parca: always-on CPU profiling
  • Grafana Beyla: auto-instrumentation
  • Inspektor Gadget: Kubernetes toolkit
  • Odigos: eBPF-based auto-instrumentation SaaS

6.3 eBPF's real value

"Language-agnostic auto-instrument" — Java, Go, Python, Rust, or Node, any HTTP/gRPC/DB call gets traced automatically.

Cons: kernel >= 5.x required, Windows still in progress, debugging is still hard.


Part 7 — SLO, SLI, Error Budget

7.1 Definitions (Google SRE)

  • SLI (Indicator): measurement — "99% of requests respond in 200ms"
  • SLO (Objective): target — "hit the SLI 99.9% of the time"
  • SLA (Agreement): customer promise — "refund below 99.5%"

7.2 Error Budget

100% - SLO = Error Budget.

  • SLO 99.9% → budget 0.1% = ~43 min downtime per month
  • Budget left → ship
  • Budget burned → freeze, stabilize

7.3 Good SLI criteria

  1. User-centric: what users feel, not internal metrics
  2. Simple: easy to compute and understand
  3. Predictable: 100% in normal operation
  4. Tamper-proof: cannot be raised by redefinition

Examples:

  • Bad: "server CPU <= 70%"
  • Good: "99% of user requests return 200 in 500ms"

7.4 Burn Rate

How fast the Error Budget is burning.

Burn Rate = current error rate / allowed error rate

14x burn for 1 hour = 5% of budget
6x burn for 6 hours = 5% of budget

Multi-window alerts: short-window high burn plus long-window moderate burn — cuts false positives.

7.5 Error Budget Policy

100–50% left: normal release cadence
50–10%: more release testing, prepare feature freeze
< 10%: feature freeze, stability sprint
Burned: postmortem, process review

Part 8 — Choosing an observability platform (2025)

8.1 Commercial vs OSS

CommercialOSS Stack
Datadog, New Relic, Dynatrace, Splunk, HoneycombGrafana Stack, Signoz, Kibana
Fast ROI, support, advanced AIFlexible, cheap, no lock-in
Cost can explodeNeeds operators

8.2 Situational recommendations

SituationRecommendation
Startup under 20 peopleDatadog or Signoz SaaS
Growth-stageGrafana Cloud or Honeycomb
Mid-sizeGrafana Stack self-host + Prometheus
EnterpriseHybrid (Datadog + OSS + in-house)
Cost-obsessedClickHouse + Grafana

8.3 Datadog cost traps

  • Custom Metrics: ~$5/month per series. Cardinality blow-up = bomb
  • Log Ingestion: expensive per GB. Filter aggressively
  • APM Host: $31/host/month. Many microservices = explosion
  • Fix: pre-filter, sample, and aggregate at the OTEL Collector

8.4 Honeycomb's differentiation

  • High-cardinality first search: free-form by trace_id, user_id
  • BubbleUp: automatic anomaly grouping
  • Philosophy: "raw events first", not "metrics first"

Part 9 — Organizational observability

9.1 Observability-driven development

  • Think instrumentation from day one
  • "Observability impact" checkbox on every PR
  • New features ship with a draft SLO
  • Post-incident reviews ask "why didn't we observe this?"
  • MTTR down = observability maturity proxy
  • Runbooks: "this alert → this dashboard → this trace" checklist
  • Blameless postmortems: why didn't the system detect it?

9.3 Observability cost model

total cost = sum(signal size × retention × cost/GB)

levers:
1. Sample (trace 1%, log 10%)
2. Aggregate (rollups, not raw metrics)
3. Tiered retention (ERROR 90d, INFO 7d)
4. Pre-filter (OTEL Collector)
5. Cold/hot tier (archive to S3)

Target: observability cost stays at 5–15% of infra cost.


Part 10 — LLM observability (new in 2024–2025)

10.1 What to measure

  • Token Usage: prompt and completion tokens
  • Latency: TTFT, total, per-token
  • Cost: USD (model × tokens)
  • Quality: LLM-as-Judge score, user feedback
  • Tool Calls: success, failure, chain depth
  • Safety: moderation result, refuse rate

10.2 Tools

  • Langfuse: OSS, self-hostable
  • LangSmith: LangChain team
  • Helicone: proxy-based
  • Phoenix (Arize): strong OSS
  • OTEL Gen AI Semantic Conventions: standard integration

10.3 LLM trace structure

user_query (root span)
├─ retrieval (vector search)
│   ├─ embedding (token count)
│   └─ qdrant.search (query vector)
├─ llm.openai.chat (tokens, cost)
│   └─ tool_call: get_weather
│       └─ api.weather
└─ response_generation

Part 11 — Six-month observability roadmap

Month 1: foundations

  • Install Prometheus and Grafana
  • Understand the four metric types
  • PromQL basics

Month 2: logging

  • Structured logging standard
  • Run Loki or Elastic
  • Log-level policy

Month 3: tracing

  • Integrate OTEL SDK
  • Deploy Tempo or Jaeger
  • Tail-based sampling

Month 4: SLO

  • Define SLIs for key services
  • Compute Error Budgets
  • Burn Rate alerts

Month 5: eBPF and profile

  • Adopt Pyroscope
  • Cilium Hubble for K8s network
  • Beyla auto-instrumentation

Month 6: cost and org

  • Audit observability cost
  • LLM observability (Langfuse)
  • Post-incident review process

Part 12 — Observability checklist (12)

  1. Know the strength of each of the 3 pillars + Profile
  2. Can explain OpenTelemetry Collector architecture
  3. Know the four Prometheus metric types
  4. Know what cardinality explosion is and how to prevent it
  5. Know the difference between Head and Tail-based sampling
  6. Know how W3C Trace Context connects services
  7. Know why eBPF is great for auto-instrumentation
  8. Clearly distinguish SLI, SLO, SLA
  9. Can compute Error Budget and Burn Rate
  10. Know 5 ways to cut log cost
  11. Know 3 Datadog cost traps
  12. Know 5 LLM observability metrics

Part 13 — Ten anti-patterns

  1. Unstructured logs: printf style. No search, no aggregation
  2. Ignoring cardinality: user_id as a label, explosion
  3. 100% trace retention: blows cost. Tail sampling is mandatory
  4. Log-to-metric abuse: expensive. Use Counter and Gauge
  5. Alert fatigue: hundreds a day. Keep only what matters
  6. No SLOs: no shared definition of "problem"
  7. Dashboard jungle: 50 dashboards, 10 used
  8. No trace_id plumbing: logs without trace_id = debugging hell
  9. Uniform retention: ERROR and DEBUG kept the same time = waste
  10. Observability as afterthought: "we'll add it post-launch" = never

Closing — Observability is the system's sense of self

Just as a human who feels no pain cannot care for their body, a system that cannot perceive its own state cannot be operated.

In 2025, observability comes down to:

  • Standards (OpenTelemetry for vendor independence)
  • Integration (stitch four pillars with trace_id)
  • Economics (control cost while keeping signal)
  • Organization (SLOs are team agreements)

Tools churn every year. Grafana Stack goes Prometheus → Mimir → Grafana Cloud, Elastic becomes OpenSearch, Datadog embraces eBPF. But the principles hold.

Remember: "If you cannot observe it, you cannot operate it."


Next up — "Security Complete Guide: Zero Trust, Secrets, OAuth, OIDC, Supply Chain, AI Security"

Season 2 Episode 10 is the other operational must: security engineering. Next time:

  • Zero Trust architecture in practice
  • OAuth 2.1 / OIDC / PKCE
  • Secret management (Vault, AWS KMS, SOPS)
  • SBOM and supply chain defense
  • Container and K8s security (Pod Security, Admission)
  • OWASP Top 10 for LLM
  • Security-oriented observability

Crypto is easy, security is hard. Continues next.