Skip to content
Published on

Distributed Tracing Complete Guide 2025: OpenTelemetry, Jaeger, Tempo, Span Analysis, Sampling Strategies

Authors

TL;DR

  • Distributed tracing = essential for microservice debugging: visualize the entire request flow
  • OpenTelemetry is the standard: CNCF graduated, all languages, vendor-neutral
  • Three major backends: Jaeger (Uber), Tempo (Grafana), Zipkin (Twitter)
  • Span structure: trace_id + span_id + parent_id + attributes + events
  • Sampling is key: 100% storage explodes cost, use Head/Tail/Adaptive
  • W3C Trace Context: standard headers for cross-service trace propagation

1. Why Distributed Tracing

1.1 Monolith Debugging

request -> [monolith app]
        |- Auth check
        |- DB query
        |- Cache lookup
        |- Response

A single stack trace suffices. All code in one process.

1.2 Microservice Debugging Nightmare

request -> [API Gateway]
        -> [Auth Service]
        -> [User Service]
           -> [DB]  [Cache]
                     -> [Email Service]
                        -> [SMS Service]

Problems:

  • Which service is slow?
  • Where did the error start?
  • Network latency or code issue?
  • How do you collect logs for one request?

Distributed tracing is the answer.

1.3 The Promise

Total: 1245ms
|- API Gateway (5ms)
|- Auth Service (50ms)
|  |- JWT verify (45ms)
|- User Service (1180ms) WARN
|  |- DB query (1100ms) <- bottleneck
|  |- Cache lookup (5ms)
|- Response (5ms)

Immediately visible: DB query is 1100ms. Missing index.


2. Core Concepts

2.1 Trace

A Trace is the full flow of a single request, identified by trace_id.

trace_id: abc123...
|- Span A (root)
|- Span B (child of A)
|- Span C (child of B)
|- Span D (child of A)

2.2 Span

A Span is one unit of work in a trace.

Required fields:

  • span_id: unique ID
  • trace_id: parent trace
  • parent_span_id: parent span (none means root)
  • name: operation name (e.g., HTTP GET /users)
  • start_time, end_time
  • status: OK / ERROR
  • attributes: key-value metadata

Optional: events, links, kind (SERVER/CLIENT/PRODUCER/CONSUMER/INTERNAL).

2.3 Span Example

{
  "trace_id": "abc123def456...",
  "span_id": "789xyz...",
  "parent_span_id": "456abc...",
  "name": "GET /api/users/123",
  "start_time": "2025-04-15T10:00:00.000Z",
  "end_time": "2025-04-15T10:00:00.150Z",
  "duration_ms": 150,
  "status": { "code": "OK" },
  "attributes": {
    "http.method": "GET",
    "http.url": "/api/users/123",
    "http.status_code": 200,
    "user.id": "123",
    "db.query.count": 3
  },
  "events": [
    { "name": "cache_miss", "timestamp": "2025-04-15T10:00:00.020Z" }
  ]
}

2.4 Context Propagation

How does trace context travel across services?

W3C Trace Context standard headers:

GET /api/users/123 HTTP/1.1
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: rojo=00f067aa0ba902b7,congo=t61rcWkgMzE

traceparent format: version-trace_id-parent_span_id-trace_flags. The next service reads this header, creates a new span, and links it to the same trace_id.


3. OpenTelemetry — Standardization Wins

3.1 What is OTel?

OTel is CNCF's observability project unifying traces, metrics, and logs.

History:

  • 2015: Google releases OpenCensus
  • 2016: Uber/Lightstep release OpenTracing
  • 2019: The two projects merge into OpenTelemetry
  • 2021: CNCF Incubating
  • 2024: Trace/Metric GA (Stable)

3.2 Why OTel Won

  1. Vendor-neutral: code uses only OTel API; backend can be Jaeger, Tempo, Datadog, New Relic, etc.
  2. All languages: Go, Java, Python, JS, C#, Ruby, PHP, Rust, Swift...
  3. Auto-instrumentation: Java agent, Python decorators etc. trace without code changes.
  4. Single standard: previously each vendor had its own SDK; now OTel is universal.

3.3 Architecture

[Application]
   | (OTel SDK)
[Spans/Metrics/Logs]
   | (OTLP)
[OpenTelemetry Collector]
   | (export)
[Jaeger / Tempo / Datadog / ...]

OTel Collector receives, transforms, and exports data to multiple backends.

3.4 Node.js Example

const { NodeSDK } = require('@opentelemetry/sdk-node')
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http')
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node')

const sdk = new NodeSDK({
  serviceName: 'my-service',
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces'
  }),
  instrumentations: [getNodeAutoInstrumentations()]
})

sdk.start()
// HTTP, Express, MongoDB now auto-traced

3.5 Manual Span

const { trace } = require('@opentelemetry/api')
const tracer = trace.getTracer('my-service')

async function processOrder(orderId) {
  const span = tracer.startSpan('process_order')
  span.setAttribute('order.id', orderId)
  try {
    await chargePayment(orderId)
    span.addEvent('payment_charged')
    await updateInventory(orderId)
    span.addEvent('inventory_updated')
    span.setStatus({ code: SpanStatusCode.OK })
  } catch (error) {
    span.recordException(error)
    span.setStatus({ code: SpanStatusCode.ERROR, message: error.message })
    throw error
  } finally {
    span.end()
  }
}

4. Backends — Jaeger vs Tempo vs Zipkin

4.1 Jaeger (Uber)

  • Started at Uber (2017), CNCF graduated, Go
  • Cassandra/Elasticsearch backend
  • Powerful UI (search, compare, dependency graph)
  • Mature, wide adoption, OTel compatible
  • Downsides: storage cost (Elasticsearch), complex setup

4.2 Tempo (Grafana)

  • Grafana Labs (2020)
  • Object storage (S3, GCS) — very cheap
  • No index; fetch by trace_id only
  • Tight Grafana integration
  • Downsides: richer search is harder (TraceQL improving), newer

TraceQL:

{ resource.service.name = "checkout" && duration > 1s }

4.3 Zipkin

  • Twitter (2012), Java, MySQL/Cassandra/Elasticsearch
  • Oldest and most stable, simple
  • Feature development slower; ecosystem migrating to Jaeger/Tempo

4.4 Comparison

JaegerTempoZipkin
OriginUberGrafanaTwitter
LanguageGoGoJava
StorageCassandra/ESObject StorageMySQL/ES
CostHighLowMedium
OpsComplexSimpleSimple
UIOwn + GrafanaGrafanaOwn
SearchStrongTraceQL (improving)Medium
OTelYesYesYes

4.5 Managed Cloud

PriceNotes
Datadog APMHighPowerful, commercial standard
New RelicHighFull-stack
HoneycombHighHigh-cardinality analytics
LightstepHighChange analysis
Grafana CloudReasonableManaged Tempo
AWS X-RayMediumAWS-integrated

5. Sampling Strategies

5.1 Why Sample?

Cost of 100% storage:

  • 10k req/sec x 30 days = 2.6B traces
  • 5KB/trace avg = 130 TB/month
  • Storage + processing blows up

Solution: sample only a portion.

5.2 Head Sampling

Decide at request start.

# 10% sampling
if random.random() < 0.1:
    span = tracer.start_span(...)

Pros: simple, fast, cheap. Cons: drops 90% of error traces.

5.3 Probabilistic Sampling

Consistent ratio via trace_id hash — same trace_id yields the same decision across services.

processors:
  probabilistic_sampler:
    sampling_percentage: 10

5.4 Tail Sampling

Decide after the request completes.

processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: error-traces
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 1000 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 1 }

Pros: 100% of errors/slow traces kept, only 1% of normal. Cons: buffer all spans in memory, complex.

5.5 Adaptive Sampling

Auto-adjust by traffic: low traffic -> high rate; high traffic -> low rate. Datadog and Honeycomb support this automatically.

5.6 Best Practices

normal traffic: 1%
slow traces (>1s): 100%
errors: 100%
new endpoint: 100% (1 week)
critical endpoint (/checkout): 10%

Implement via tail sampling.


6. Auto-instrumentation

6.1 Java

java -javaagent:opentelemetry-javaagent.jar \
  -Dotel.service.name=my-app \
  -Dotel.exporter.otlp.endpoint=http://collector:4318 \
  -jar my-app.jar

Auto-traces HTTP (Servlet, Spring MVC), DB (JDBC, Hibernate), messaging (Kafka, RabbitMQ), gRPC, 100+ libraries.

6.2 Python

pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap --action=install
opentelemetry-instrument python my_app.py

Auto-traces Flask, Django, FastAPI, requests, SQLAlchemy, etc.

6.3 Node.js

node --require @opentelemetry/auto-instrumentations-node/register my-app.js

6.4 Go

Go's compiled nature makes auto-instrumentation hard; explicit code required:

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/trace"
)

tracer := otel.Tracer("my-service")

func handleRequest(ctx context.Context) {
    ctx, span := tracer.Start(ctx, "handle_request")
    defer span.End()
}

Newer: eBPF-based auto-instrumentation (Grafana Beyla).


7. Trace Analysis

7.1 What to look at

  1. Critical path: longest path — which span dominates? Parallelizable?
  2. Error spans: red spans — where did it start? Error message?
  3. External calls: HTTP/DB/gRPC — slowest call, retry patterns.
  4. Timing deltas: average vs current, pre/post deploy.

7.2 Common Patterns

N+1 Query

parent_span (200ms)
|- db_query (5ms)  <- user info
|- db_query (5ms)  <- user 1 posts
|- db_query (5ms)  <- user 2 posts
... (47 more)

Fix with JOIN or batch query.

Serial vs Parallel

parent_span (300ms)
|- call_service_a (100ms)
|- call_service_b (100ms)  <- serial
|- call_service_c (100ms)

Use Promise.all — 100ms.

Cache Miss

parent_span (200ms)
|- cache_check (5ms)  <- miss
|- db_query (190ms)

Analyze hit ratio.

7.3 RED + Tracing

Rate, Errors, Duration metrics derived from traces:

processors:
  spanmetrics:
    metrics_exporter: prometheus

Auto-generates Prometheus metrics.


8. Cost Optimization

8.1 Cost Drivers

  1. Data volume (traces/sec x spans/trace)
  2. Retention (7/30/90 days)
  3. Index (searchable fields)
  4. Network egress (cross-region)

8.2 Reduction Strategies

  1. Aggressive sampling — normal 0.1%, error/slow 100%
  2. Reduce cardinality — avoid unique-per-request attrs like user_id as dimension
  3. Object storage (Tempo) — 90% cheaper than Elasticsearch
  4. Local processing — extract metrics at the Collector
  5. Provider comparison — 10B spans/month: Datadog 30,000+vsselfhostedTempoonS330,000+ vs self-hosted Tempo on S3 500

9. Real-world — Debugging Microservices

9.1 Scenario

User complaint: "Orders occasionally take 30 seconds."

{ resource.service.name = "checkout" && duration > 5s }

Found 100 traces.

9.3 Pattern

All show slowness in payment service; payment.gateway.call span runs 15-25s.

9.4 Drill Down

checkout (28s)
|- validate (10ms)
|- inventory (50ms)
|- payment (27s)
    |- db_save (10ms)
    |- stripe_api_call (26.9s) <- !!
        |- http_retry (3 attempts, 9s each)

Stripe calls timing out; three retries each.

9.5 Investigate

"http.url": "https://api.stripe.com/v1/charges",
"http.status_code": 0,
"http.error": "EAI_AGAIN"

DNS issue — checkout service's /etc/resolv.conf is broken.

9.6 Fix

Fix DNS, redeploy:

checkout (1.2s) OK
|- payment (800ms)
    |- stripe_api_call (750ms)

Time savings: days of debugging shrink to minutes with tracing.


10. Best Practices

10.1 Good Span Names

Bad: db_query / http_request Good: SELECT users by id / GET /api/users/{id}

Control cardinality — variables as attributes, names as patterns.

10.2 Meaningful Attributes

span.set_attribute("user.id", user_id)
span.set_attribute("user.tier", "premium")
span.set_attribute("db.statement", query)

Follow OTel Semantic Conventions: http.method, db.system, messaging.system, etc.

10.3 Error Recording

try:
    do_something()
except Exception as e:
    span.record_exception(e)
    span.set_status(Status(StatusCode.ERROR, str(e)))
    raise

10.4 Avoid Excess Spans

Span per tiny function becomes noise — use meaningful work units.

10.5 Security

Never put password, credit_card, or api_key into attributes — mask or use IDs.

10.6 Correlate Traces + Logs + Metrics

Join with shared trace_id:

import logging
from opentelemetry import trace

current_span = trace.get_current_span()
ctx = current_span.get_span_context()

logger.info("Order processed", extra={
    "trace_id": format(ctx.trace_id, "032x"),
    "span_id": format(ctx.span_id, "016x"),
    "order_id": order_id
})

Search logs by trace_id for full flow visualization.


Quiz

1. Why did OpenTelemetry become the standard?

A: (1) Vendor-neutral (code uses OTel API only, any backend works), (2) all major languages, (3) auto-instrumentation (Java agent, Python decorators), (4) CNCF graduated, (5) merger of OpenCensus and OpenTracing. Result: every observability tool (Jaeger, Tempo, Datadog, New Relic) is OTel-compatible.

2. Head vs Tail Sampling?

A: Head: decide at request start (e.g., 10% probability). Simple but drops 90% of error traces. Tail: decide after completion, allowing policies like "100% errors, 100% slow, 1% normal". Far more useful for debugging but requires buffering all spans. Tail sampling is standard at large scale; head is fine for simple environments.

3. Why is Tempo cheaper than Jaeger?

A: Object storage. Jaeger uses Cassandra or Elasticsearch (search index required, expensive). Tempo stores traces on S3/GCS with no index — fetch by trace_id only. Weaker search but 90% cheaper storage. TraceQL is closing the gap. At 10B spans/month: Datadog 30k+vsselfhostedTempoonS330k+ vs self-hosted Tempo on S3 500.

4. What is W3C Trace Context?

A: Standard HTTP header for cross-service trace propagation. traceparent carries version-trace_id-parent_span_id-trace_flags, e.g. traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01. Previously each vendor used its own header (X-B3-TraceId, X-Datadog-Trace-Id); now unified. Supported by OTel, Jaeger, Tempo, Datadog.

5. Core value of distributed tracing?

A: Visualize the full flow of one request across microservices. A monolith needs only a stack trace, but in microservices it's hard to know which service is slow or where an error began. Tracing gives: (1) immediate bottleneck discovery (DB 1100ms), (2) error origin, (3) service dependency mapping, (4) optimization priorities. Days-long debugging becomes minutes.


References