✍️ 필사 모드: Distributed Tracing Complete Guide 2025: OpenTelemetry, Jaeger, Tempo, Span Analysis, Sampling Strategies
EnglishTL;DR
- Distributed tracing = essential for microservice debugging: visualize the entire request flow
- OpenTelemetry is the standard: CNCF graduated, all languages, vendor-neutral
- Three major backends: Jaeger (Uber), Tempo (Grafana), Zipkin (Twitter)
- Span structure: trace_id + span_id + parent_id + attributes + events
- Sampling is key: 100% storage explodes cost, use Head/Tail/Adaptive
- W3C Trace Context: standard headers for cross-service trace propagation
1. Why Distributed Tracing
1.1 Monolith Debugging
request -> [monolith app]
|- Auth check
|- DB query
|- Cache lookup
|- Response
A single stack trace suffices. All code in one process.
1.2 Microservice Debugging Nightmare
request -> [API Gateway]
-> [Auth Service]
-> [User Service]
-> [DB] [Cache]
-> [Email Service]
-> [SMS Service]
Problems:
- Which service is slow?
- Where did the error start?
- Network latency or code issue?
- How do you collect logs for one request?
Distributed tracing is the answer.
1.3 The Promise
Total: 1245ms
|- API Gateway (5ms)
|- Auth Service (50ms)
| |- JWT verify (45ms)
|- User Service (1180ms) WARN
| |- DB query (1100ms) <- bottleneck
| |- Cache lookup (5ms)
|- Response (5ms)
Immediately visible: DB query is 1100ms. Missing index.
2. Core Concepts
2.1 Trace
A Trace is the full flow of a single request, identified by trace_id.
trace_id: abc123...
|- Span A (root)
|- Span B (child of A)
|- Span C (child of B)
|- Span D (child of A)
2.2 Span
A Span is one unit of work in a trace.
Required fields:
span_id: unique IDtrace_id: parent traceparent_span_id: parent span (none means root)name: operation name (e.g.,HTTP GET /users)start_time,end_timestatus: OK / ERRORattributes: key-value metadata
Optional: events, links, kind (SERVER/CLIENT/PRODUCER/CONSUMER/INTERNAL).
2.3 Span Example
{
"trace_id": "abc123def456...",
"span_id": "789xyz...",
"parent_span_id": "456abc...",
"name": "GET /api/users/123",
"start_time": "2025-04-15T10:00:00.000Z",
"end_time": "2025-04-15T10:00:00.150Z",
"duration_ms": 150,
"status": { "code": "OK" },
"attributes": {
"http.method": "GET",
"http.url": "/api/users/123",
"http.status_code": 200,
"user.id": "123",
"db.query.count": 3
},
"events": [
{ "name": "cache_miss", "timestamp": "2025-04-15T10:00:00.020Z" }
]
}
2.4 Context Propagation
How does trace context travel across services?
W3C Trace Context standard headers:
GET /api/users/123 HTTP/1.1
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: rojo=00f067aa0ba902b7,congo=t61rcWkgMzE
traceparent format: version-trace_id-parent_span_id-trace_flags. The next service reads this header, creates a new span, and links it to the same trace_id.
3. OpenTelemetry — Standardization Wins
3.1 What is OTel?
OTel is CNCF's observability project unifying traces, metrics, and logs.
History:
- 2015: Google releases OpenCensus
- 2016: Uber/Lightstep release OpenTracing
- 2019: The two projects merge into OpenTelemetry
- 2021: CNCF Incubating
- 2024: Trace/Metric GA (Stable)
3.2 Why OTel Won
- Vendor-neutral: code uses only OTel API; backend can be Jaeger, Tempo, Datadog, New Relic, etc.
- All languages: Go, Java, Python, JS, C#, Ruby, PHP, Rust, Swift...
- Auto-instrumentation: Java agent, Python decorators etc. trace without code changes.
- Single standard: previously each vendor had its own SDK; now OTel is universal.
3.3 Architecture
[Application]
| (OTel SDK)
[Spans/Metrics/Logs]
| (OTLP)
[OpenTelemetry Collector]
| (export)
[Jaeger / Tempo / Datadog / ...]
OTel Collector receives, transforms, and exports data to multiple backends.
3.4 Node.js Example
const { NodeSDK } = require('@opentelemetry/sdk-node')
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http')
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node')
const sdk = new NodeSDK({
serviceName: 'my-service',
traceExporter: new OTLPTraceExporter({
url: 'http://localhost:4318/v1/traces'
}),
instrumentations: [getNodeAutoInstrumentations()]
})
sdk.start()
// HTTP, Express, MongoDB now auto-traced
3.5 Manual Span
const { trace } = require('@opentelemetry/api')
const tracer = trace.getTracer('my-service')
async function processOrder(orderId) {
const span = tracer.startSpan('process_order')
span.setAttribute('order.id', orderId)
try {
await chargePayment(orderId)
span.addEvent('payment_charged')
await updateInventory(orderId)
span.addEvent('inventory_updated')
span.setStatus({ code: SpanStatusCode.OK })
} catch (error) {
span.recordException(error)
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message })
throw error
} finally {
span.end()
}
}
4. Backends — Jaeger vs Tempo vs Zipkin
4.1 Jaeger (Uber)
- Started at Uber (2017), CNCF graduated, Go
- Cassandra/Elasticsearch backend
- Powerful UI (search, compare, dependency graph)
- Mature, wide adoption, OTel compatible
- Downsides: storage cost (Elasticsearch), complex setup
4.2 Tempo (Grafana)
- Grafana Labs (2020)
- Object storage (S3, GCS) — very cheap
- No index; fetch by trace_id only
- Tight Grafana integration
- Downsides: richer search is harder (TraceQL improving), newer
TraceQL:
{ resource.service.name = "checkout" && duration > 1s }
4.3 Zipkin
- Twitter (2012), Java, MySQL/Cassandra/Elasticsearch
- Oldest and most stable, simple
- Feature development slower; ecosystem migrating to Jaeger/Tempo
4.4 Comparison
| Jaeger | Tempo | Zipkin | |
|---|---|---|---|
| Origin | Uber | Grafana | |
| Language | Go | Go | Java |
| Storage | Cassandra/ES | Object Storage | MySQL/ES |
| Cost | High | Low | Medium |
| Ops | Complex | Simple | Simple |
| UI | Own + Grafana | Grafana | Own |
| Search | Strong | TraceQL (improving) | Medium |
| OTel | Yes | Yes | Yes |
4.5 Managed Cloud
| Price | Notes | |
|---|---|---|
| Datadog APM | High | Powerful, commercial standard |
| New Relic | High | Full-stack |
| Honeycomb | High | High-cardinality analytics |
| Lightstep | High | Change analysis |
| Grafana Cloud | Reasonable | Managed Tempo |
| AWS X-Ray | Medium | AWS-integrated |
5. Sampling Strategies
5.1 Why Sample?
Cost of 100% storage:
- 10k req/sec x 30 days = 2.6B traces
- 5KB/trace avg = 130 TB/month
- Storage + processing blows up
Solution: sample only a portion.
5.2 Head Sampling
Decide at request start.
# 10% sampling
if random.random() < 0.1:
span = tracer.start_span(...)
Pros: simple, fast, cheap. Cons: drops 90% of error traces.
5.3 Probabilistic Sampling
Consistent ratio via trace_id hash — same trace_id yields the same decision across services.
processors:
probabilistic_sampler:
sampling_percentage: 10
5.4 Tail Sampling
Decide after the request completes.
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: error-traces
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-traces
type: latency
latency: { threshold_ms: 1000 }
- name: probabilistic
type: probabilistic
probabilistic: { sampling_percentage: 1 }
Pros: 100% of errors/slow traces kept, only 1% of normal. Cons: buffer all spans in memory, complex.
5.5 Adaptive Sampling
Auto-adjust by traffic: low traffic -> high rate; high traffic -> low rate. Datadog and Honeycomb support this automatically.
5.6 Best Practices
normal traffic: 1%
slow traces (>1s): 100%
errors: 100%
new endpoint: 100% (1 week)
critical endpoint (/checkout): 10%
Implement via tail sampling.
6. Auto-instrumentation
6.1 Java
java -javaagent:opentelemetry-javaagent.jar \
-Dotel.service.name=my-app \
-Dotel.exporter.otlp.endpoint=http://collector:4318 \
-jar my-app.jar
Auto-traces HTTP (Servlet, Spring MVC), DB (JDBC, Hibernate), messaging (Kafka, RabbitMQ), gRPC, 100+ libraries.
6.2 Python
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap --action=install
opentelemetry-instrument python my_app.py
Auto-traces Flask, Django, FastAPI, requests, SQLAlchemy, etc.
6.3 Node.js
node --require @opentelemetry/auto-instrumentations-node/register my-app.js
6.4 Go
Go's compiled nature makes auto-instrumentation hard; explicit code required:
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/trace"
)
tracer := otel.Tracer("my-service")
func handleRequest(ctx context.Context) {
ctx, span := tracer.Start(ctx, "handle_request")
defer span.End()
}
Newer: eBPF-based auto-instrumentation (Grafana Beyla).
7. Trace Analysis
7.1 What to look at
- Critical path: longest path — which span dominates? Parallelizable?
- Error spans: red spans — where did it start? Error message?
- External calls: HTTP/DB/gRPC — slowest call, retry patterns.
- Timing deltas: average vs current, pre/post deploy.
7.2 Common Patterns
N+1 Query
parent_span (200ms)
|- db_query (5ms) <- user info
|- db_query (5ms) <- user 1 posts
|- db_query (5ms) <- user 2 posts
... (47 more)
Fix with JOIN or batch query.
Serial vs Parallel
parent_span (300ms)
|- call_service_a (100ms)
|- call_service_b (100ms) <- serial
|- call_service_c (100ms)
Use Promise.all — 100ms.
Cache Miss
parent_span (200ms)
|- cache_check (5ms) <- miss
|- db_query (190ms)
Analyze hit ratio.
7.3 RED + Tracing
Rate, Errors, Duration metrics derived from traces:
processors:
spanmetrics:
metrics_exporter: prometheus
Auto-generates Prometheus metrics.
8. Cost Optimization
8.1 Cost Drivers
- Data volume (traces/sec x spans/trace)
- Retention (7/30/90 days)
- Index (searchable fields)
- Network egress (cross-region)
8.2 Reduction Strategies
- Aggressive sampling — normal 0.1%, error/slow 100%
- Reduce cardinality — avoid unique-per-request attrs like
user_idas dimension - Object storage (Tempo) — 90% cheaper than Elasticsearch
- Local processing — extract metrics at the Collector
- Provider comparison — 10B spans/month: Datadog 500
9. Real-world — Debugging Microservices
9.1 Scenario
User complaint: "Orders occasionally take 30 seconds."
9.2 Search
{ resource.service.name = "checkout" && duration > 5s }
Found 100 traces.
9.3 Pattern
All show slowness in payment service; payment.gateway.call span runs 15-25s.
9.4 Drill Down
checkout (28s)
|- validate (10ms)
|- inventory (50ms)
|- payment (27s)
|- db_save (10ms)
|- stripe_api_call (26.9s) <- !!
|- http_retry (3 attempts, 9s each)
Stripe calls timing out; three retries each.
9.5 Investigate
"http.url": "https://api.stripe.com/v1/charges",
"http.status_code": 0,
"http.error": "EAI_AGAIN"
DNS issue — checkout service's /etc/resolv.conf is broken.
9.6 Fix
Fix DNS, redeploy:
checkout (1.2s) OK
|- payment (800ms)
|- stripe_api_call (750ms)
Time savings: days of debugging shrink to minutes with tracing.
10. Best Practices
10.1 Good Span Names
Bad: db_query / http_request
Good: SELECT users by id / GET /api/users/{id}
Control cardinality — variables as attributes, names as patterns.
10.2 Meaningful Attributes
span.set_attribute("user.id", user_id)
span.set_attribute("user.tier", "premium")
span.set_attribute("db.statement", query)
Follow OTel Semantic Conventions: http.method, db.system, messaging.system, etc.
10.3 Error Recording
try:
do_something()
except Exception as e:
span.record_exception(e)
span.set_status(Status(StatusCode.ERROR, str(e)))
raise
10.4 Avoid Excess Spans
Span per tiny function becomes noise — use meaningful work units.
10.5 Security
Never put password, credit_card, or api_key into attributes — mask or use IDs.
10.6 Correlate Traces + Logs + Metrics
Join with shared trace_id:
import logging
from opentelemetry import trace
current_span = trace.get_current_span()
ctx = current_span.get_span_context()
logger.info("Order processed", extra={
"trace_id": format(ctx.trace_id, "032x"),
"span_id": format(ctx.span_id, "016x"),
"order_id": order_id
})
Search logs by trace_id for full flow visualization.
Quiz
1. Why did OpenTelemetry become the standard?
A: (1) Vendor-neutral (code uses OTel API only, any backend works), (2) all major languages, (3) auto-instrumentation (Java agent, Python decorators), (4) CNCF graduated, (5) merger of OpenCensus and OpenTracing. Result: every observability tool (Jaeger, Tempo, Datadog, New Relic) is OTel-compatible.
2. Head vs Tail Sampling?
A: Head: decide at request start (e.g., 10% probability). Simple but drops 90% of error traces. Tail: decide after completion, allowing policies like "100% errors, 100% slow, 1% normal". Far more useful for debugging but requires buffering all spans. Tail sampling is standard at large scale; head is fine for simple environments.
3. Why is Tempo cheaper than Jaeger?
A: Object storage. Jaeger uses Cassandra or Elasticsearch (search index required, expensive). Tempo stores traces on S3/GCS with no index — fetch by trace_id only. Weaker search but 90% cheaper storage. TraceQL is closing the gap. At 10B spans/month: Datadog 500.
4. What is W3C Trace Context?
A: Standard HTTP header for cross-service trace propagation. traceparent carries version-trace_id-parent_span_id-trace_flags, e.g. traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01. Previously each vendor used its own header (X-B3-TraceId, X-Datadog-Trace-Id); now unified. Supported by OTel, Jaeger, Tempo, Datadog.
5. Core value of distributed tracing?
A: Visualize the full flow of one request across microservices. A monolith needs only a stack trace, but in microservices it's hard to know which service is slow or where an error began. Tracing gives: (1) immediate bottleneck discovery (DB 1100ms), (2) error origin, (3) service dependency mapping, (4) optimization priorities. Days-long debugging becomes minutes.
References
- OpenTelemetry
- W3C Trace Context
- Jaeger
- Grafana Tempo
- Zipkin
- OTel Semantic Conventions
- Distributed Systems Observability — Cindy Sridharan
- Mastering Distributed Tracing — Yuri Shkuro
- Honeycomb Observability — Charity Majors
- SigNoz — open-source OTel backend
- Beyla — eBPF auto-instrumentation
현재 단락 (1/323)
- **Distributed tracing = essential for microservice debugging**: visualize the entire request flow