Skip to content
Published on

Practical Guide to Distributed Tracing: OpenTelemetry, Jaeger, Grafana Tempo

Authors
  • Name
    Twitter
Distributed Tracing OpenTelemetry

Introduction

In a microservices architecture, a single user request is processed across multiple services. When a request originating from an API Gateway sequentially or concurrently calls authentication, order, payment, and notification services, you need to trace the entire call chain to determine why a specific request is slow. Logs and metrics from a single service alone are insufficient to understand causal relationships between services.

Distributed tracing solves this problem. Every time a request crosses a service boundary, a unique Trace ID is propagated. Each service records the work it performs as a Span, allowing the entire request flow to be reconstructed as a single trace. This concept, which originated from Google's Dapper paper (2010), has now been standardized as OpenTelemetry, a CNCF project, and can be stored and analyzed through backends like Jaeger and Grafana Tempo.

This post covers the full pipeline needed for production environments with practical code examples, starting from core distributed tracing concepts, through Python/Go instrumentation with the OpenTelemetry SDK, OpenTelemetry Collector configuration, Jaeger and Grafana Tempo backend setup, sampling strategies, cost optimization, and operational considerations.


Core Concepts of Distributed Tracing

Trace, Span, Context Propagation

Let's review the three core concepts of distributed tracing.

A Trace represents the entire path of a single user request as it traverses the system. It is identified by a unique Trace ID and consists of a collection of multiple Spans.

A Span represents a single unit of work within a trace. Each Span has a unique Span ID, parent Span ID, start/end times, Attributes, Events, and Status. There are five types of SpanKind: SERVER, CLIENT, PRODUCER, CONSUMER, and INTERNAL.

Context Propagation is the mechanism for passing Trace ID and Span ID across service boundaries. Context is propagated through HTTP headers (W3C Trace Context's traceparent and tracestate), gRPC metadata, message queue headers, and more.

W3C Trace Context Standard

W3C Trace Context is the standard for context propagation in distributed tracing. The format of the traceparent header is as follows:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             |  |                                |                |
          Version  Trace ID (32 chars)           Parent Span ID   Flags

OpenTelemetry uses W3C Trace Context by default and also supports B3 (Zipkin) and Jaeger format propagation.

OpenTelemetry Architecture Overview

OpenTelemetry is a vendor-neutral framework for generating, collecting, and transmitting telemetry data (traces, metrics, logs). The main components are:

  • API: Interface definitions for instrumentation (vendor-independent)
  • SDK: Implementation of the API (sampling, batching, exporting, etc.)
  • Collector: A standalone process that receives, processes, and exports telemetry data
  • Instrumentation Libraries: Automatic instrumentation for libraries and frameworks

OpenTelemetry SDK Instrumentation (Python/Go)

Python Instrumentation

Let's look at how to apply OpenTelemetry to a Python application. First, install the required packages:

pip install opentelemetry-api \
  opentelemetry-sdk \
  opentelemetry-exporter-otlp \
  opentelemetry-instrumentation-flask \
  opentelemetry-instrumentation-requests \
  opentelemetry-instrumentation-sqlalchemy

Here is the complete code for applying OpenTelemetry to a Flask application:

# tracing.py - OpenTelemetry initialization module
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME, SERVICE_VERSION
from opentelemetry.propagate import set_global_textmap
from opentelemetry.propagators.composite import CompositePropagator
from opentelemetry.trace.propagation import TraceContextTextMapPropagator
from opentelemetry.baggage.propagation import W3CBaggagePropagator

def init_tracer(service_name: str, otlp_endpoint: str = "localhost:4317"):
    """Initialize the OpenTelemetry TracerProvider and apply global settings."""
    resource = Resource.create({
        SERVICE_NAME: service_name,
        SERVICE_VERSION: "1.0.0",
        "deployment.environment": "production",
    })

    provider = TracerProvider(resource=resource)

    # OTLP gRPC Exporter configuration
    otlp_exporter = OTLPSpanExporter(endpoint=otlp_endpoint, insecure=True)
    span_processor = BatchSpanProcessor(
        otlp_exporter,
        max_queue_size=2048,
        max_export_batch_size=512,
        schedule_delay_millis=5000,
    )
    provider.add_span_processor(span_processor)

    # Register global TracerProvider
    trace.set_tracer_provider(provider)

    # W3C Trace Context + Baggage propagation setup
    set_global_textmap(CompositePropagator([
        TraceContextTextMapPropagator(),
        W3CBaggagePropagator(),
    ]))

    return provider
# app.py - Applying tracing to a Flask application
from flask import Flask, request, jsonify
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from tracing import init_tracer
import requests

# Initialize tracer
provider = init_tracer("order-service", "otel-collector:4317")

app = Flask(__name__)

# Automatic instrumentation: Flask and requests library
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()

# Obtain tracer for manual instrumentation
tracer = trace.get_tracer("order-service", "1.0.0")

@app.route("/orders", methods=["POST"])
def create_order():
    """Order creation API - adds manual Spans for fine-grained tracing"""
    with tracer.start_as_current_span(
        "validate_order",
        attributes={"order.type": "standard"}
    ) as span:
        order_data = request.get_json()
        span.set_attribute("order.item_count", len(order_data.get("items", [])))

        # Inventory check - external service call (auto-instrumented)
        inventory_resp = requests.post(
            "http://inventory-service:8080/check",
            json=order_data["items"]
        )

        if inventory_resp.status_code != 200:
            span.set_status(trace.StatusCode.ERROR, "Inventory check failed")
            span.record_exception(Exception("Insufficient inventory"))
            return jsonify({"error": "insufficient_inventory"}), 400

    with tracer.start_as_current_span("process_payment") as span:
        payment_resp = requests.post(
            "http://payment-service:8080/charge",
            json={"amount": order_data["total"]}
        )
        span.set_attribute("payment.method", order_data.get("payment_method", "card"))

    return jsonify({"order_id": "ORD-12345", "status": "confirmed"}), 201

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8080)

Go Instrumentation

In Go applications, the go.opentelemetry.io/otel package is used. Let's look at an example of applying tracing to an HTTP server and gRPC client.

// tracing.go - OpenTelemetry initialization
package main

import (
    "context"
    "log"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/propagation"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
    "go.opentelemetry.io/otel/trace"
    "google.golang.org/grpc"
    "google.golang.org/grpc/credentials/insecure"
)

func initTracer(ctx context.Context, serviceName, endpoint string) (*sdktrace.TracerProvider, error) {
    // Create OTLP Exporter with gRPC connection
    conn, err := grpc.NewClient(endpoint,
        grpc.WithTransportCredentials(insecure.NewCredentials()),
    )
    if err != nil {
        return nil, err
    }

    exporter, err := otlptracegrpc.New(ctx, otlptracegrpc.WithGRPCConn(conn))
    if err != nil {
        return nil, err
    }

    // Define Resource - service metadata
    res, err := resource.New(ctx,
        resource.WithAttributes(
            semconv.ServiceName(serviceName),
            semconv.ServiceVersion("1.0.0"),
            semconv.DeploymentEnvironment("production"),
        ),
    )
    if err != nil {
        return nil, err
    }

    // Create TracerProvider
    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter,
            sdktrace.WithMaxQueueSize(2048),
            sdktrace.WithMaxExportBatchSize(512),
        ),
        sdktrace.WithResource(res),
        sdktrace.WithSampler(sdktrace.ParentBased(
            sdktrace.TraceIDRatioBased(0.1), // 10% sampling
        )),
    )

    // Global settings
    otel.SetTracerProvider(tp)
    otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
        propagation.TraceContext{},
        propagation.Baggage{},
    ))

    return tp, nil
}

// handleOrder - Create manual Span in HTTP handler
func handleOrder(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()
    tracer := otel.Tracer("order-handler")

    ctx, span := tracer.Start(ctx, "process-order",
        trace.WithSpanKind(trace.SpanKindServer),
    )
    defer span.End()

    // Pass context to business logic
    orderID, err := processOrder(ctx)
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        http.Error(w, "order failed", http.StatusInternalServerError)
        return
    }

    span.SetAttributes(attribute.String("order.id", orderID))
    w.WriteHeader(http.StatusCreated)
}

Automatic vs Manual Instrumentation

AspectAutomatic InstrumentationManual Instrumentation
How to applyUse libraries/agentsCreate Spans directly in code
Code changesMinimal (initialization only)Inserted throughout business logic
CoverageFramework-level: HTTP, DB, gRPC, etc.Can cover custom business logic
Fine controlLimitedFreely add attributes, events, and links
Recommended useUse as the base skeletonAdd supplementally for key business logic

In practice, a hybrid strategy is recommended: use automatic instrumentation to generate baseline Spans for HTTP/gRPC/DB calls, and add manual instrumentation for critical sections of business logic.


OpenTelemetry Collector Configuration

The OpenTelemetry Collector serves as a pipeline for receiving (Receivers), processing (Processors), and exporting (Exporters) telemetry data. By placing a Collector between applications and backends instead of sending data directly, you can centrally manage retries, batching, sampling, and routing.

Collector Configuration File

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
    send_batch_max_size: 2048

  memory_limiter:
    check_interval: 1s
    limit_mib: 1024
    spike_limit_mib: 256

  resource:
    attributes:
      - key: environment
        value: production
        action: upsert
      - key: cluster
        value: k8s-prod-01
        action: upsert

  # Tail-based sampling: collect 100% of errors and slow requests
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      - name: error-policy
        type: status_code
        status_code:
          status_codes:
            - ERROR
      - name: latency-policy
        type: latency
        latency:
          threshold_ms: 1000
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

exporters:
  otlp/jaeger:
    endpoint: jaeger-collector:4317
    tls:
      insecure: true

  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  debug:
    verbosity: basic

extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  zpages:
    endpoint: 0.0.0.0:55679

service:
  extensions: [health_check, zpages]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, resource, tail_sampling, batch]
      exporters: [otlp/jaeger, otlp/tempo, debug]
  telemetry:
    logs:
      level: info
    metrics:
      address: 0.0.0.0:8888

Collector Docker Compose Deployment

# docker-compose.yaml (Collector section)
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.116.0
    command: ['--config=/etc/otel-collector-config.yaml']
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml:ro
    ports:
      - '4317:4317' # OTLP gRPC
      - '4318:4318' # OTLP HTTP
      - '8888:8888' # Collector metrics
      - '13133:13133' # Health check
      - '55679:55679' # zPages
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 2g
          cpus: '1.0'
    healthcheck:
      test: ['CMD', 'wget', '--spider', '-q', 'http://localhost:13133']
      interval: 10s
      timeout: 5s
      retries: 3

An important consideration in Collector configuration is the order of processors. memory_limiter must always be placed first to prevent OOM, and batch should be positioned last to improve network efficiency.


Jaeger Backend Setup and Operations

Jaeger is an open-source distributed tracing system developed by Uber and is a CNCF Graduated project. Jaeger v2 has been rebuilt on top of the OpenTelemetry Collector, providing native OTLP support.

Jaeger Production Deployment

For production environments, deploying separate components is recommended over the Jaeger all-in-one setup. Let's look at a configuration using Elasticsearch as the storage backend.

# docker-compose-jaeger.yaml
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.17.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - 'ES_JAVA_OPTS=-Xms1g -Xmx1g'
    ports:
      - '9200:9200'
    volumes:
      - es_data:/usr/share/elasticsearch/data
    healthcheck:
      test: ['CMD-SHELL', 'curl -f http://localhost:9200/_cluster/health || exit 1']
      interval: 10s
      timeout: 5s
      retries: 10

  jaeger:
    image: jaegertracing/jaeger:2.4.0
    environment:
      - SPAN_STORAGE_TYPE=elasticsearch
      - ES_SERVER_URLS=http://elasticsearch:9200
      - ES_NUM_SHARDS=3
      - ES_NUM_REPLICAS=1
      - ES_INDEX_PREFIX=jaeger
      - ES_TAGS_AS_FIELDS_ALL=true
      - COLLECTOR_OTLP_ENABLED=true
    ports:
      - '16686:16686' # Jaeger UI
      - '4317:4317' # OTLP gRPC
      - '4318:4318' # OTLP HTTP
      - '14250:14250' # gRPC Collector
    depends_on:
      elasticsearch:
        condition: service_healthy
    restart: unless-stopped

volumes:
  es_data:
    driver: local

Jaeger Index Management

Index management is essential when operating Jaeger with Elasticsearch. Set up a cron job to automatically delete old traces.

#!/bin/bash
# jaeger-index-cleaner.sh - Delete Jaeger indices older than 14 days
ES_URL="http://elasticsearch:9200"
RETENTION_DAYS=14

# Calculate cutoff date
CUTOFF_DATE=$(date -d "-${RETENTION_DAYS} days" +%Y-%m-%d)

echo "Deleting Jaeger indices older than ${CUTOFF_DATE}..."

# Query and delete jaeger-span-* and jaeger-service-* indices
for INDEX_TYPE in "jaeger-span" "jaeger-service"; do
  INDICES=$(curl -s "${ES_URL}/_cat/indices/${INDEX_TYPE}-*" \
    | awk '{print $3}' \
    | sort)

  for INDEX in ${INDICES}; do
    # Extract date from index name (format: jaeger-span-2026-03-01)
    INDEX_DATE=$(echo "${INDEX}" | grep -oP '\d{4}-\d{2}-\d{2}')
    if [[ "${INDEX_DATE}" < "${CUTOFF_DATE}" ]]; then
      echo "Deleting index: ${INDEX}"
      curl -s -X DELETE "${ES_URL}/${INDEX}"
    fi
  done
done

echo "Cleanup completed."

Key Jaeger Features

  • Trace search: Search traces by service name, operation name, tags, and time range
  • Trace comparison: Compare two traces side by side to analyze performance differences
  • Dependency graph: Visualize inter-service call relationships as a DAG
  • SPM (Service Performance Monitoring): Automatic generation of RED metrics (Rate, Error, Duration)

Grafana Tempo: Large-Scale Trace Storage

Grafana Tempo is a backend designed for large-scale distributed tracing. Unlike Jaeger, it does not create indices and stores traces directly in object storage (S3, GCS, Azure Blob), dramatically reducing costs.

Tempo Configuration

# tempo-config.yaml
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

ingester:
  max_block_duration: 5m
  max_block_bytes: 1073741824 # 1GB
  flush_check_period: 10s

compactor:
  compaction:
    block_retention: 336h # 14 days
  ring:
    kvstore:
      store: memberlist

storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces-prod
      endpoint: s3.ap-northeast-2.amazonaws.com
      region: ap-northeast-2
      # IAM Role-based authentication recommended (avoid hardcoding access_key)
    wal:
      path: /var/tempo/wal
    local:
      path: /var/tempo/blocks
    pool:
      max_workers: 100
      queue_depth: 10000

querier:
  search:
    query_timeout: 30s
  frontend_worker:
    frontend_address: tempo-query-frontend:9095

metrics_generator:
  registry:
    external_labels:
      source: tempo
      cluster: k8s-prod
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write
        send_exemplars: true
  traces_storage:
    path: /var/tempo/generator/traces
  processor:
    service_graphs:
      dimensions:
        - service.namespace
        - deployment.environment
    span_metrics:
      dimensions:
        - http.method
        - http.status_code
        - http.route

overrides:
  defaults:
    metrics_generator:
      processors: [service-graphs, span-metrics]

Tempo + Grafana Integration

The true power of Tempo comes from its integration with Grafana. You can query traces using TraceQL and correlate logs (Loki) and metrics (Prometheus/Mimir) with traces.

By adding Tempo as a data source in Grafana settings and enabling the Trace to Logs and Trace to Metrics features, you can jump directly from a specific Span in a trace to related logs and metrics. This "correlation analysis" is the key differentiator of the Grafana stack.

TraceQL Query Examples

TraceQL, introduced in Tempo 2.0, is a query language dedicated to traces.

// Search for traces with Spans in error status
{ status = error }

// Search for Spans taking more than 1 second in a specific service
{ resource.service.name = "order-service" && duration > 1s }

// Search for Spans that returned HTTP 500 errors
{ span.http.status_code = 500 }

// Search for call relationships between two services
{ resource.service.name = "api-gateway" } >> { resource.service.name = "payment-service" }

Tracing Backend Comparison (Jaeger vs Tempo vs Zipkin)

When choosing a tracing backend, it is important to select the right tool for your environment and requirements. Here is a comparison of the major tracing backends.

ItemJaegerGrafana TempoZipkinAWS X-Ray
DeveloperUber (CNCF Graduated)Grafana LabsTwitter (now X)AWS
StorageElasticsearch, Cassandra, KafkaS3, GCS, Azure BlobElasticsearch, MySQL, CassandraAWS managed storage
IndexingFull indexingNo indexing (Trace ID-based)Full indexingProprietary indexing
QueryingService, tag, time-based searchTraceQL, Trace ID searchService, tag, time-based searchFilter Expressions
Storage cost/GBHigh (ES indexing)Very low (object storage)High (ES indexing)Medium (managed)
OTLP supportNative (v2)NativeRequires separate CollectorSeparate SDK/Collector
ScalingHorizontal scaling possibleExcellent horizontal scalingLimitedAutomatic (managed)
UIBuilt-in UI (good)Grafana (excellent)Built-in UI (basic)AWS Console
Metrics generationSPM (v2)Metrics GeneratorNoneProprietary metrics
Log integrationLimitedNative Loki integrationNoneCloudWatch integration
Best suited forStandalone tracing systemGrafana stack environmentsSmall-scale/learningAWS-native

Selection Guide

  • Jaeger: Teams that need a standalone tracing system, require rich search capabilities, and have Elasticsearch/Cassandra operational expertise
  • Grafana Tempo: Environments already using Grafana, Loki, and Prometheus/Mimir where cost efficiency matters and large-scale trace processing is needed
  • Zipkin: For quick prototyping or learning purposes, or when compatibility with legacy Zipkin format is required
  • AWS X-Ray: Environments primarily using AWS services (Lambda, ECS, EKS) that prefer managed services

Sampling Strategies and Cost Optimization

In production environments, collecting all traces leads to surging storage costs and network load. An effective sampling strategy is key to balancing cost and visibility.

Head-based Sampling

Sampling decisions are made when the request starts. Implementation is simple, but error traces can be missed.

# Head-based probabilistic sampling setup in Python
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased, ParentBased

# ParentBased: if the parent Span is sampled, the child is also sampled
# Root Spans without a parent are sampled at 10% probability
sampler = ParentBased(root=TraceIdRatioBased(0.1))

provider = TracerProvider(
    resource=resource,
    sampler=sampler,
)

Tail-based Sampling

Sampling decisions are made after the trace is complete, based on all the Spans. This uses the tail_sampling processor in the OpenTelemetry Collector and can collect 100% of error traces and high-latency traces. However, since it puts memory pressure on the Collector, the decision_wait time and num_traces settings need to be tuned for your environment.

Cost Optimization Strategies

  1. Tiered sampling: Collect 5-10% of normal traffic, 100% of errors and slow requests
  2. Per-service differentiated sampling: Higher rates for critical services, 0% for internal health checks
  3. Storage tiering: Fast storage for recent data, low-cost storage for older data
  4. TTL settings: Limit trace retention to 7-14 days (most debugging happens within 48 hours)
  5. Attribute filtering: Remove unnecessary Span Attributes at the Collector to reduce data size

Operational Considerations and Troubleshooting

Missing Context Propagation

The most common issue in distributed tracing is missing context propagation. If the traceparent header is not passed during inter-service calls, the trace gets broken.

Symptoms: When viewing a trace in Jaeger/Tempo, there is only one Span, or Spans are separated into different Trace IDs.

Causes and solutions:

  • Missing automatic instrumentation for HTTP client libraries: Apply RequestsInstrumentor().instrument() or otelhttp.NewTransport()
  • Context not propagated through message queues (Kafka, RabbitMQ): Manually inject/extract traceparent in message headers
  • Missing gRPC interceptor: Add otelgrpc.UnaryClientInterceptor()

Collector Bottlenecks and Backpressure

When traffic spikes, the Collector may not handle the throughput, causing Span loss.

Metrics to monitor:

  • otelcol_exporter_send_failed_spans: Number of Spans that failed to export
  • otelcol_processor_batch_timeout_trigger_send: Forced sends due to timeout
  • otelcol_receiver_refused_spans: Number of refused Spans

Solutions:

  • Always configure the memory_limiter processor to prevent OOM
  • Horizontally scale Collectors behind a load balancer
  • Tune send_batch_size and timeout in the batch processor

High Cardinality Attributes

Placing high-cardinality values such as user IDs, session IDs, or request bodies in Span Attributes causes indexing costs to skyrocket (especially in Jaeger + Elasticsearch environments).

Guidelines:

  • Keep Attribute value cardinality to thousands or fewer
  • Record unique identifiers in Span Events or logs instead of Span Attributes
  • Set Jaeger's ES_TAGS_AS_FIELDS_ALL=false and index only the necessary tags

Failure Cases and Recovery Procedures

Case 1: Trace Loss Due to Collector OOM

Situation: A Collector using tail-based sampling exceeded its memory limit due to a traffic spike, causing the process to terminate. All traces were lost for approximately 5 minutes.

Root cause: The tail_sampling processor's num_traces was set to 50,000, but during peak hours, 10,000 new traces per second were flowing in. With decision_wait set to 30 seconds, up to 300,000 traces needed to be held in memory.

Recovery procedure:

  1. Verify automatic Collector Pod restart (Kubernetes liveness probe)
  2. Increase num_traces to 200,000 and reduce decision_wait to 10 seconds
  3. Set memory_limiter's limit_mib to 80% of the actual Pod memory limit
  4. Horizontally scale to 3 Collector instances

Preventive measures: Set up alerting rules for Collector memory usage and apply autoscaling (HPA).

Case 2: Jaeger Outage Due to Elasticsearch Index Explosion

Situation: After deploying a new service, instrumentation was added that included the entire request body as a Span Attribute. Within 24 hours, Elasticsearch disk usage exceeded 95%, and index writes were blocked.

Root cause: A developer added span.set_attribute("request.body", json.dumps(request_body)) for debugging purposes. This Attribute averaged 10KB in size, and with 1,000 requests per second, approximately 800GB of additional data was generated daily.

Recovery procedure:

  1. Deploy a hotfix to remove the problematic Span Attribute from the service
  2. Temporarily adjust Elasticsearch disk watermarks to resume writes
  3. Manually delete indices from the problem period to free disk space
  4. Add an Attribute size limit processor to the Collector

Preventive measures: Use the attributes processor at the Collector level to limit Attribute value sizes, and mandate instrumentation code reviews in the CI/CD pipeline.

Case 3: Abnormal Span Ordering Due to Clock Skew

Situation: When viewing traces in the Jaeger UI, child Spans appeared to start before parent Spans, or Span durations were displayed as negative.

Root cause: Time synchronization (NTP) between container host nodes was not functioning properly, causing clock skew of up to 500ms.

Recovery procedure:

  1. Check NTP synchronization status on all nodes: chronyc tracking
  2. Correct NTP server configuration and force synchronization
  3. Disregard trace data from the period with significant clock skew

Preventive measures: Configure chrony or systemd-timesyncd on Kubernetes nodes, and set up monitoring alerts for NTP offset.


Conclusion

Distributed tracing is an essential observability tool in microservices architectures. By instrumenting in a vendor-neutral manner through OpenTelemetry and using Jaeger or Grafana Tempo as backends, you can transparently trace request flows in production environments.

Key takeaways:

  1. Standardize instrumentation with OpenTelemetry. Build the base skeleton with automatic instrumentation, and supplement with manual instrumentation for business logic.
  2. Place an OpenTelemetry Collector in the middle. Centrally manage retries, sampling, and routing.
  3. Choose a backend that fits your environment. If you use the Grafana stack, Tempo is suitable; for a standalone system, Jaeger is a good fit.
  4. Optimize costs with tail-based sampling. Always collect error and high-latency requests, and sample normal requests at a low rate.
  5. Monitor the operational pipeline. Collect metrics from the Collector and backends themselves, and set up alerts.

The most important thing when adopting distributed tracing is "not trying to instrument everything." Define your Critical User Journeys first, prioritize instrumenting services along those paths, and then gradually expand. This is the pragmatic approach.


References