Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

TL;DR

Distributed tracing = essential for microservice debugging: visualize the entire request flow
OpenTelemetry is the standard: CNCF graduated, all languages, vendor-neutral
Three major backends: Jaeger (Uber), Tempo (Grafana), Zipkin (Twitter)
Span structure: trace_id + span_id + parent_id + attributes + events
Sampling is key: 100% storage explodes cost, use Head/Tail/Adaptive
W3C Trace Context: standard headers for cross-service trace propagation

1. Why Distributed Tracing

1.1 Monolith Debugging

request -> [monolith app]
        |- Auth check
        |- DB query
        |- Cache lookup
        |- Response

A single stack trace suffices. All code in one process.

1.2 Microservice Debugging Nightmare

request -> [API Gateway]
        -> [Auth Service]
        -> [User Service]
           -> [DB]  [Cache]
                     -> [Email Service]
                        -> [SMS Service]

Problems:

Which service is slow?
Where did the error start?
Network latency or code issue?
How do you collect logs for one request?

Distributed tracing is the answer.

1.3 The Promise

Total: 1245ms
|- API Gateway (5ms)
|- Auth Service (50ms)
|  |- JWT verify (45ms)
|- User Service (1180ms) WARN
|  |- DB query (1100ms) <- bottleneck
|  |- Cache lookup (5ms)
|- Response (5ms)

Immediately visible: DB query is 1100ms. Missing index.

2. Core Concepts

2.1 Trace

A Trace is the full flow of a single request, identified by trace_id.

trace_id: abc123...
|- Span A (root)
|- Span B (child of A)
|- Span C (child of B)
|- Span D (child of A)

2.2 Span

A Span is one unit of work in a trace.

Required fields:

span_id: unique ID
trace_id: parent trace
parent_span_id: parent span (none means root)
name: operation name (e.g., HTTP GET /users)
start_time, end_time
status: OK / ERROR
attributes: key-value metadata

Optional: events, links, kind (SERVER/CLIENT/PRODUCER/CONSUMER/INTERNAL).

2.3 Span Example

{
  "trace_id": "abc123def456...",
  "span_id": "789xyz...",
  "parent_span_id": "456abc...",
  "name": "GET /api/users/123",
  "start_time": "2025-04-15T10:00:00.000Z",
  "end_time": "2025-04-15T10:00:00.150Z",
  "duration_ms": 150,
  "status": { "code": "OK" },
  "attributes": {
    "http.method": "GET",
    "http.url": "/api/users/123",
    "http.status_code": 200,
    "user.id": "123",
    "db.query.count": 3
  },
  "events": [
    { "name": "cache_miss", "timestamp": "2025-04-15T10:00:00.020Z" }
  ]
}

2.4 Context Propagation

How does trace context travel across services?

W3C Trace Context standard headers:

GET /api/users/123 HTTP/1.1
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: rojo=00f067aa0ba902b7,congo=t61rcWkgMzE

traceparent format: version-trace_id-parent_span_id-trace_flags. The next service reads this header, creates a new span, and links it to the same trace_id.

3. OpenTelemetry — Standardization Wins

3.1 What is OTel?

OTel is CNCF's observability project unifying traces, metrics, and logs.

History:

2015: Google releases OpenCensus
2016: Uber/Lightstep release OpenTracing
2019: The two projects merge into OpenTelemetry
2021: CNCF Incubating
2024: Trace/Metric GA (Stable)

3.2 Why OTel Won

Vendor-neutral: code uses only OTel API; backend can be Jaeger, Tempo, Datadog, New Relic, etc.
All languages: Go, Java, Python, JS, C#, Ruby, PHP, Rust, Swift...
Auto-instrumentation: Java agent, Python decorators etc. trace without code changes.
Single standard: previously each vendor had its own SDK; now OTel is universal.

3.3 Architecture

[Application]
   | (OTel SDK)
[Spans/Metrics/Logs]
   | (OTLP)
[OpenTelemetry Collector]
   | (export)
[Jaeger / Tempo / Datadog / ...]

OTel Collector receives, transforms, and exports data to multiple backends.

3.4 Node.js Example

const { NodeSDK } = require('@opentelemetry/sdk-node')
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http')
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node')

const sdk = new NodeSDK({
  serviceName: 'my-service',
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces'
  }),
  instrumentations: [getNodeAutoInstrumentations()]
})

sdk.start()
// HTTP, Express, MongoDB now auto-traced

3.5 Manual Span

const { trace } = require('@opentelemetry/api')
const tracer = trace.getTracer('my-service')

async function processOrder(orderId) {
  const span = tracer.startSpan('process_order')
  span.setAttribute('order.id', orderId)
  try {
    await chargePayment(orderId)
    span.addEvent('payment_charged')
    await updateInventory(orderId)
    span.addEvent('inventory_updated')
    span.setStatus({ code: SpanStatusCode.OK })
  } catch (error) {
    span.recordException(error)
    span.setStatus({ code: SpanStatusCode.ERROR, message: error.message })
    throw error
  } finally {
    span.end()
  }
}

4. Backends — Jaeger vs Tempo vs Zipkin

4.1 Jaeger (Uber)

Started at Uber (2017), CNCF graduated, Go
Cassandra/Elasticsearch backend
Powerful UI (search, compare, dependency graph)
Mature, wide adoption, OTel compatible
Downsides: storage cost (Elasticsearch), complex setup

4.2 Tempo (Grafana)

Grafana Labs (2020)
Object storage (S3, GCS) — very cheap
No index; fetch by trace_id only
Tight Grafana integration
Downsides: richer search is harder (TraceQL improving), newer

TraceQL:

{ resource.service.name = "checkout" && duration > 1s }

4.3 Zipkin

Twitter (2012), Java, MySQL/Cassandra/Elasticsearch
Oldest and most stable, simple
Feature development slower; ecosystem migrating to Jaeger/Tempo

4.4 Comparison

	Jaeger	Tempo	Zipkin
Origin	Uber	Grafana	Twitter
Language	Go	Go	Java
Storage	Cassandra/ES	Object Storage	MySQL/ES
Cost	High	Low	Medium
Ops	Complex	Simple	Simple
UI	Own + Grafana	Grafana	Own
Search	Strong	TraceQL (improving)	Medium
OTel	Yes	Yes	Yes

4.5 Managed Cloud

	Price	Notes
Datadog APM	High	Powerful, commercial standard
New Relic	High	Full-stack
Honeycomb	High	High-cardinality analytics
Lightstep	High	Change analysis
Grafana Cloud	Reasonable	Managed Tempo
AWS X-Ray	Medium	AWS-integrated

5. Sampling Strategies

5.1 Why Sample?

Cost of 100% storage:

10k req/sec x 30 days = 2.6B traces
5KB/trace avg = 130 TB/month
Storage + processing blows up

Solution: sample only a portion.

5.2 Head Sampling

Decide at request start.

# 10% sampling
if random.random() < 0.1:
    span = tracer.start_span(...)

Pros: simple, fast, cheap. Cons: drops 90% of error traces.

5.3 Probabilistic Sampling

Consistent ratio via trace_id hash — same trace_id yields the same decision across services.

processors:
  probabilistic_sampler:
    sampling_percentage: 10

5.4 Tail Sampling

Decide after the request completes.

processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: error-traces
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 1000 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 1 }

Pros: 100% of errors/slow traces kept, only 1% of normal. Cons: buffer all spans in memory, complex.

5.5 Adaptive Sampling

Auto-adjust by traffic: low traffic -> high rate; high traffic -> low rate. Datadog and Honeycomb support this automatically.

5.6 Best Practices

normal traffic: 1%
slow traces (>1s): 100%
errors: 100%
new endpoint: 100% (1 week)
critical endpoint (/checkout): 10%

Implement via tail sampling.

6. Auto-instrumentation

6.1 Java

java -javaagent:opentelemetry-javaagent.jar \
  -Dotel.service.name=my-app \
  -Dotel.exporter.otlp.endpoint=http://collector:4318 \
  -jar my-app.jar

Auto-traces HTTP (Servlet, Spring MVC), DB (JDBC, Hibernate), messaging (Kafka, RabbitMQ), gRPC, 100+ libraries.

6.2 Python

pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap --action=install
opentelemetry-instrument python my_app.py

Auto-traces Flask, Django, FastAPI, requests, SQLAlchemy, etc.

6.3 Node.js

node --require @opentelemetry/auto-instrumentations-node/register my-app.js

6.4 Go

Go's compiled nature makes auto-instrumentation hard; explicit code required:

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/trace"
)

tracer := otel.Tracer("my-service")

func handleRequest(ctx context.Context) {
    ctx, span := tracer.Start(ctx, "handle_request")
    defer span.End()
}

Newer: eBPF-based auto-instrumentation (Grafana Beyla).

7. Trace Analysis

7.1 What to look at

Critical path: longest path — which span dominates? Parallelizable?
Error spans: red spans — where did it start? Error message?
External calls: HTTP/DB/gRPC — slowest call, retry patterns.
Timing deltas: average vs current, pre/post deploy.

7.2 Common Patterns

N+1 Query

parent_span (200ms)
|- db_query (5ms)  <- user info
|- db_query (5ms)  <- user 1 posts
|- db_query (5ms)  <- user 2 posts
... (47 more)

Fix with JOIN or batch query.

Serial vs Parallel

parent_span (300ms)
|- call_service_a (100ms)
|- call_service_b (100ms)  <- serial
|- call_service_c (100ms)

Use Promise.all — 100ms.

Cache Miss

parent_span (200ms)
|- cache_check (5ms)  <- miss
|- db_query (190ms)

Analyze hit ratio.

7.3 RED + Tracing

Rate, Errors, Duration metrics derived from traces:

processors:
  spanmetrics:
    metrics_exporter: prometheus

Auto-generates Prometheus metrics.

8. Cost Optimization

8.1 Cost Drivers

Data volume (traces/sec x spans/trace)
Retention (7/30/90 days)
Index (searchable fields)
Network egress (cross-region)

8.2 Reduction Strategies

Aggressive sampling — normal 0.1%, error/slow 100%
Reduce cardinality — avoid unique-per-request attrs like user_id as dimension
Object storage (Tempo) — 90% cheaper than Elasticsearch
Local processing — extract metrics at the Collector
Provider comparison — 10B spans/month: Datadog $30,000+ vs self-hosted Tempo on S3$ 500

9. Real-world — Debugging Microservices

9.1 Scenario

User complaint: "Orders occasionally take 30 seconds."

9.2 Search

{ resource.service.name = "checkout" && duration > 5s }

Found 100 traces.

9.3 Pattern

All show slowness in payment service; payment.gateway.call span runs 15-25s.

9.4 Drill Down

checkout (28s)
|- validate (10ms)
|- inventory (50ms)
|- payment (27s)
    |- db_save (10ms)
    |- stripe_api_call (26.9s) <- !!
        |- http_retry (3 attempts, 9s each)

Stripe calls timing out; three retries each.

9.5 Investigate

"http.url": "https://api.stripe.com/v1/charges",
"http.status_code": 0,
"http.error": "EAI_AGAIN"

DNS issue — checkout service's /etc/resolv.conf is broken.

9.6 Fix

Fix DNS, redeploy:

checkout (1.2s) OK
|- payment (800ms)
    |- stripe_api_call (750ms)

Time savings: days of debugging shrink to minutes with tracing.

10. Best Practices

10.1 Good Span Names

Bad: db_query / http_request Good: SELECT users by id / GET /api/users/{id}

Control cardinality — variables as attributes, names as patterns.

10.2 Meaningful Attributes

span.set_attribute("user.id", user_id)
span.set_attribute("user.tier", "premium")
span.set_attribute("db.statement", query)

Follow OTel Semantic Conventions: http.method, db.system, messaging.system, etc.

10.3 Error Recording

try:
    do_something()
except Exception as e:
    span.record_exception(e)
    span.set_status(Status(StatusCode.ERROR, str(e)))
    raise

10.4 Avoid Excess Spans

Span per tiny function becomes noise — use meaningful work units.

10.5 Security

Never put password, credit_card, or api_key into attributes — mask or use IDs.

10.6 Correlate Traces + Logs + Metrics

Join with shared trace_id:

import logging
from opentelemetry import trace

current_span = trace.get_current_span()
ctx = current_span.get_span_context()

logger.info("Order processed", extra={
    "trace_id": format(ctx.trace_id, "032x"),
    "span_id": format(ctx.span_id, "016x"),
    "order_id": order_id
})

Search logs by trace_id for full flow visualization.

Quiz

1. Why did OpenTelemetry become the standard?

A: (1) Vendor-neutral (code uses OTel API only, any backend works), (2) all major languages, (3) auto-instrumentation (Java agent, Python decorators), (4) CNCF graduated, (5) merger of OpenCensus and OpenTracing. Result: every observability tool (Jaeger, Tempo, Datadog, New Relic) is OTel-compatible.

2. Head vs Tail Sampling?

A: Head: decide at request start (e.g., 10% probability). Simple but drops 90% of error traces. Tail: decide after completion, allowing policies like "100% errors, 100% slow, 1% normal". Far more useful for debugging but requires buffering all spans. Tail sampling is standard at large scale; head is fine for simple environments.

3. Why is Tempo cheaper than Jaeger?

A: Object storage. Jaeger uses Cassandra or Elasticsearch (search index required, expensive). Tempo stores traces on S3/GCS with no index — fetch by trace_id only. Weaker search but 90% cheaper storage. TraceQL is closing the gap. At 10B spans/month: Datadog $30k+ vs self-hosted Tempo on S3$ 500.

4. What is W3C Trace Context?

A: Standard HTTP header for cross-service trace propagation. traceparent carries version-trace_id-parent_span_id-trace_flags, e.g. traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01. Previously each vendor used its own header (X-B3-TraceId, X-Datadog-Trace-Id); now unified. Supported by OTel, Jaeger, Tempo, Datadog.

5. Core value of distributed tracing?

A: Visualize the full flow of one request across microservices. A monolith needs only a stack trace, but in microservices it's hard to know which service is slow or where an error began. Tracing gives: (1) immediate bottleneck discovery (DB 1100ms), (2) error origin, (3) service dependency mapping, (4) optimization priorities. Days-long debugging becomes minutes.

References

OpenTelemetry
W3C Trace Context
Jaeger
Grafana Tempo
Zipkin
OTel Semantic Conventions
Distributed Systems Observability — Cindy Sridharan
Mastering Distributed Tracing — Yuri Shkuro
Honeycomb Observability — Charity Majors
SigNoz — open-source OTel backend
Beyla — eBPF auto-instrumentation