Skip to content
Published on

Grafana Tempo Distributed Tracing and TraceQL Operations Guide 2026

Authors
  • Name
    Twitter
Grafana Tempo Distributed Tracing and TraceQL Operations Guide 2026

Overview

As microservices architecture has become mainstream, environments where a single request passes through dozens of services are now commonplace. In such environments, distributed tracing is essential for tracking the root cause of failures. Grafana Tempo is an open-source distributed tracing backend released by Grafana Labs in 2020 that can operate with only object storage, dramatically reducing infrastructure complexity and cost.

Tempo's core philosophy is simple. It does not create separate indexes for trace data, instead searching spans through Trace ID-based lookups and the TraceQL query engine. Thanks to this approach, storage costs are significantly lower compared to Jaeger or Zipkin, and petabyte-scale traces can be reliably stored.

This article covers Tempo's internal architecture, three deployment modes, TraceQL query syntax, span metrics generation and service graphs, OpenTelemetry Collector integration, storage optimization, Grafana dashboard configuration, troubleshooting, and real-world failure cases and recovery experiences from production operations.

Tempo Architecture

Internally, Tempo uses multiple components that work together to collect, store, and query trace data. Understanding each component's role helps quickly identify bottlenecks when failures occur.

Core Components

Distributor is the entry point that receives span data from clients. It supports various protocols including Jaeger, Zipkin, and OpenTelemetry (OTLP), and routes received spans to the appropriate Ingester using consistent hashing based on Trace ID hash.

Ingester indexes received span data and flushes it in block units to object storage after a certain period. It maintains a WAL (Write-Ahead Log) to minimize data loss even during abnormal process termination.

Query Frontend is the component called when clients like Grafana request Trace ID lookups or TraceQL searches. It distributes requests across multiple Queriers to search block data in parallel, reducing response time.

Querier is the worker that actually processes requests received from the Query Frontend. It searches both the Ingester's in-memory data and object storage block data to combine results.

Compactor periodically merges small blocks stored in object storage into larger blocks. This improves query performance and optimizes storage usage.

Metrics Generator is an optional component that automatically generates RED (Rate, Error, Duration) metrics and service graphs from received span data. Generated metrics are sent to Mimir or Prometheus via Prometheus-compatible remote write.

Data Flow

[Application] --> [OTel Collector] --> [Distributor]
                                           |
                                    [Hash Ring]
                                           |
                                      [Ingester]
                                       /      \
                              [WAL]         [Object Storage]
                                                  |
                              [Compactor] <-------+
                                                  |
                              [Query Frontend] ---+---> [Querier]

Spans arrive at the Distributor from the application via OTel Collector, then are distributed to Ingesters through the hash ring. The Ingester first writes to the WAL, then flushes blocks to object storage at configured intervals (default 30 minutes). The Compactor merges small blocks, and the Querier searches both Ingester in-memory and object storage data.

Deployment Modes

Tempo provides three deployment modes that can be selected based on the organization's scale and requirements.

Deployment Mode Comparison

ItemMonolithicScalable Single BinaryMicroservices
StructureSingle binary, single processSingle binary, multiple instancesIndependent process per component
ScalabilityVertical scaling onlyHorizontal scalingIndependent horizontal per component
Recommended trafficUnder 100GB/day100GB to 1TB/dayOver 1TB/day
Operational complexityLowMediumHigh
High availabilityLimitedBasic supportFull support
Suitable environmentDev/test, small-scaleMedium-scale productionLarge-scale production, multi-tenant
Kubernetes requiredNoRecommendedRequired

Monolithic Mode

All components run in a single process. Suitable for local environments or small workloads with the simplest configuration.

# tempo-config.yaml (Monolithic)
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: '0.0.0.0:4317'
        http:
          endpoint: '0.0.0.0:4318'
    jaeger:
      protocols:
        thrift_http:
          endpoint: '0.0.0.0:14268'
    zipkin:
      endpoint: '0.0.0.0:9411'

ingester:
  max_block_duration: 5m
  max_block_bytes: 1073741824 # 1GB

storage:
  trace:
    backend: local
    wal:
      path: /var/tempo/wal
    local:
      path: /var/tempo/blocks
    pool:
      max_workers: 100
      queue_depth: 10000

compactor:
  compaction:
    block_retention: 72h

metrics_generator:
  registry:
    external_labels:
      source: tempo
      cluster: local
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write
        send_exemplars: true
  traces_storage:
    path: /var/tempo/generator/traces
  processor:
    service_graphs:
      dimensions:
        - service.namespace
        - deployment.environment
    span_metrics:
      dimensions:
        - http.method
        - http.status_code
        - http.route

overrides:
  defaults:
    metrics_generator:
      processors:
        - service-graphs
        - span-metrics

Scalable Single Binary Mode

Achieves horizontal scaling by running the same binary as multiple instances. As a middle ground between Monolithic and Microservices, it provides scalability without significantly increasing configuration complexity. Each instance runs with the target flag set to scalable-single-binary.

Microservices Mode

Each component is deployed as an independent process, enabling individual scaling. In large-scale environments, specific components (e.g., Ingester) can be scaled out, or Queriers can be adjusted to match traffic patterns. In Kubernetes environments, using the Helm chart (tempo-distributed) makes deployment convenient.

Quick Start with Docker Compose

To quickly try Tempo in a local environment, use Docker Compose. The configuration below brings up Tempo (Monolithic), OTel Collector, Grafana, and Prometheus all at once.

# docker-compose.yaml
version: '3.9'

services:
  tempo:
    image: grafana/tempo:2.7.1
    command: ['-config.file=/etc/tempo/tempo.yaml']
    volumes:
      - ./tempo.yaml:/etc/tempo/tempo.yaml
      - tempo-data:/var/tempo
    ports:
      - '3200:3200' # Tempo HTTP API
      - '4317:4317' # OTLP gRPC
      - '4318:4318' # OTLP HTTP
      - '9411:9411' # Zipkin
      - '14268:14268' # Jaeger HTTP
    networks:
      - observability

  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.118.0
    command: ['--config=/etc/otel-collector/config.yaml']
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector/config.yaml
    ports:
      - '4327:4317' # OTLP gRPC (for app access)
      - '4328:4318' # OTLP HTTP
    depends_on:
      - tempo
    networks:
      - observability

  prometheus:
    image: prom/prometheus:v3.2.1
    volumes:
      - ./prometheus.yaml:/etc/prometheus/prometheus.yml
    ports:
      - '9090:9090'
    networks:
      - observability

  grafana:
    image: grafana/grafana:11.5.2
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    volumes:
      - ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml
    ports:
      - '3000:3000'
    depends_on:
      - tempo
      - prometheus
    networks:
      - observability

volumes:
  tempo-data:

networks:
  observability:
    driver: bridge

After running docker compose up -d, access Grafana at http://localhost:3000 where the Tempo datasource is automatically provisioned, allowing you to search traces immediately.

TraceQL Query Syntax

TraceQL is Tempo's dedicated query language, following a syntax system similar to PromQL and LogQL. It selects spansets with curly braces {} and chains filters and aggregations with pipeline operators.

Basic Structure

A TraceQL query consists of three main elements:

  • Intrinsics: Span's built-in properties (name, status, duration, kind, rootName, rootServiceName, traceDuration)
  • Attributes: Custom key-value pairs using scope prefixes (span., resource., link., event.)
  • Operators: Comparison (=, !=, >, <, >=, <=), regex (=~, !~), logical (&&, ||), structural (>, >>, <, <<, ~)

TraceQL Query Examples

// 1. Find error spans for a specific service
{ resource.service.name = "payment-service" && status = error }

// 2. HTTP GET request spans taking over 500ms
{ span.http.method = "GET" && duration > 500ms }

// 3. Spans returning 5xx responses on a specific route
{ span.http.route = "/api/v1/orders" && span.http.status_code >= 500 }

// 4. Trace call relationship between two services (structural operator)
{ resource.service.name = "api-gateway" } >> { resource.service.name = "order-service" }

// 5. Filter spans with direct parent-child relationship
{ resource.service.name = "frontend" } > { span.http.status_code = 503 }

// 6. Explore sibling span relationships
{ span.db.system = "postgresql" } ~ { span.db.system = "redis" }

// 7. Span name matching using regex
{ name =~ "HTTP.*POST" && resource.deployment.environment = "production" }

// 8. Filter by total trace duration
{ traceDuration > 3s }

// 9. Filter by root service
{ rootServiceName = "ingress-nginx" && duration > 1s }

// 10. Analysis using aggregation functions
{ resource.service.name = "checkout-service" } | rate()

// 11. Check latency distribution with histogram
{ resource.service.name = "search-service" } | histogram_over_time(duration)

// 12. Anomaly detection based on count
{ status = error } | count() > 100

Key Aggregation Functions

FunctionDescriptionExample
rate()Spans per second rate{} | rate()
count()Matching span count{ status = error } | count()
avg(field)Field average value{} | avg(duration)
max(field)Field maximum value{} | max(duration)
min(field)Field minimum value{} | min(duration)
p50/p90/p95/p99(field)Percentiles{} | p99(duration)
histogram_over_time(field)Histogram over time{} | histogram_over_time(duration)
quantile_over_time(field, q)Quantile over time{} | quantile_over_time(duration, 0.95)

Span Metrics and Service Graphs

Tempo's Metrics Generator is a powerful feature that automatically generates metrics from received spans. Without separate metric collection, you can obtain RED metrics and service dependency graphs from trace data alone.

Span Metrics Generator

The span metrics processor converts Request Rate, Error Rate, and Duration distribution from all incoming spans into Prometheus metrics. The main metrics generated are:

  • traces_spanmetrics_calls_total: Total span call count
  • traces_spanmetrics_latency_bucket: Latency histogram buckets
  • traces_spanmetrics_size_total: Total span size

By configuring dimensions, you can add span attributes like http.method, http.status_code, and http.route as metric labels, allowing fine-grained RED metrics observation per endpoint.

Service Graph Generator

The service graph processor analyzes client-server span pairs to automatically map call relationships between services. The service topology can be visually confirmed in Grafana's service graph view, with request rate, error rate, and latency displayed on each edge.

Key configuration parameters include:

  • max_items: Maximum number of service pairs to track (default 10000)
  • wait: Wait time for incomplete edges (default 10s)
  • dimensions: Custom labels to add to the service graph
  • histogram_buckets: Latency histogram bucket boundaries (default 0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4, 12.8)

Tempo vs Jaeger vs Zipkin Comparison

When selecting a distributed tracing backend, comparing the characteristics of each tool is important.

ItemGrafana TempoJaegerZipkin
Initial release2020 (Grafana Labs)2015 (Uber)2012 (Twitter)
CNCF status-Graduated-
Storage methodObject storage (no index)Elasticsearch, Cassandra, etc.Elasticsearch, Cassandra, MySQL
IndexingNone (Trace ID + TraceQL)Tag-based index creationTag-based index creation
Storage costLow (S3/GCS pricing)High (includes index storage)High
Ingestion protocolsOTLP, Jaeger, ZipkinOTLP, JaegerZipkin, OTLP (limited)
Query languageTraceQLTag-based searchTag-based search
Built-in UIGrafana integrationJaeger UIZipkin UI
Metrics generationBuilt-in (Metrics Generator)External tools neededExternal tools needed
ScalabilityExcellent (PB scale)ModerateLimited
Grafana integrationNativePluginPlugin
Maintained byGrafana Labs (commercial support)CNCF communityVolunteer community

Selection Criteria Summary: If you already use the Grafana ecosystem and want to store large-scale traces at low cost, Tempo is optimal. If you need an independent tracing system and rich tag-based search is essential, consider Jaeger. For small teams looking to quickly adopt tracing, Zipkin remains a viable option.

OpenTelemetry Collector Integration

The most recommended way to send traces to Tempo is using OpenTelemetry Collector as an intermediate pipeline. The Collector collects traces from various sources, performs batch processing and retries, then reliably sends them to Tempo.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: '0.0.0.0:4317'
      http:
        endpoint: '0.0.0.0:4318'

processors:
  batch:
    timeout: 5s
    send_batch_size: 10000
    send_batch_max_size: 11000

  memory_limiter:
    check_interval: 1s
    limit_mib: 4096
    spike_limit_mib: 512

  attributes:
    actions:
      - key: deployment.environment
        value: production
        action: upsert

  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      - name: errors-policy
        type: status_code
        status_code:
          status_codes:
            - ERROR
      - name: slow-traces-policy
        type: latency
        latency:
          threshold_ms: 1000
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

exporters:
  otlp/tempo:
    endpoint: 'tempo:4317'
    tls:
      insecure: true
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s
    sending_queue:
      enabled: true
      num_consumers: 10
      queue_size: 5000

  debug:
    verbosity: basic

service:
  telemetry:
    logs:
      level: info
    metrics:
      address: '0.0.0.0:8888'

  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, attributes, batch]
      exporters: [otlp/tempo, debug]

The key aspects of this configuration are:

  • tail_sampling: Error spans are collected at 100%, slow traces over 1 second are also fully collected, and the rest are sampled at 10% probability. This ensures important traces are not missed while reducing storage costs.
  • memory_limiter: Limits Collector memory usage to 4GB to prevent OOM.
  • sending_queue: Buffers data in the queue and retries even during temporary Tempo outages.
  • batch: Groups spans into batches of 10,000 for transmission, improving network efficiency.

Storage Optimization

Tempo's storage design is centered on object storage. In production environments, choose S3, GCS, or Azure Blob Storage as the backend.

Storage Backend Comparison

ItemAmazon S3Google Cloud StorageAzure Blob Storage
Config keys3gcsazure
AuthenticationIAM Role, Access KeyService Account, Workload IdentityManaged Identity, SAS Token
Cost (GB/month)$0.023 (Standard)$0.020 (Standard)$0.018 (Hot)
Region availability33+ regions40+ regions60+ regions
Tempo compatibilityFull supportFull supportFull support
Lifecycle policyS3 LifecycleObject LifecycleLifecycle Management

S3 Backend Configuration Example

storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces-prod
      endpoint: s3.ap-northeast-2.amazonaws.com
      region: ap-northeast-2
      access_key: ${S3_ACCESS_KEY}
      secret_key: ${S3_SECRET_KEY}
      # Or omit access_key/secret_key when using IAM Role
    wal:
      path: /var/tempo/wal
    block:
      bloom_filter_false_positive: 0.01
      v2_index_downsample_bytes: 1048576
      v2_encoding: zstd
    blocklist_poll: 5m
    pool:
      max_workers: 200
      queue_depth: 20000

compactor:
  compaction:
    block_retention: 336h # 14-day retention
    compacted_block_retention: 1h
    compaction_window: 4h
    max_block_bytes: 107374182400 # 100GB
    max_compaction_objects: 6000000
    retention_concurrency: 10
  ring:
    kvstore:
      store: memberlist

Storage Optimization Tips

Block Encoding: Setting v2_encoding to zstd achieves approximately 30-40% higher compression ratio compared to snappy, but with slightly increased CPU usage. Choose snappy for write-heavy workloads, or zstd when storage cost is the priority.

Bloom Filter Tuning: Lowering bloom_filter_false_positive (e.g., 0.01 to 0.005) improves query accuracy but increases bloom filter size. In environments with frequent queries, reducing the false positive rate is beneficial for overall performance.

Block Retention Period: Set block_retention according to business requirements. 14 days (336h) is typical, but compliance requirements may necessitate 90 days or more. In such cases, using object storage lifecycle policies to automatically transition to Infrequent Access (S3) or Nearline (GCS) tiers can reduce costs.

Compactor Tuning: Setting max_block_bytes too high causes Compactor memory usage to spike, while setting it too low increases the number of blocks and degrades query performance. Around 100GB is a balanced value.

Grafana Dashboard Configuration

Tempo integrates natively with Grafana, providing rich tracing visualization without a separate UI. Below are the Grafana datasource provisioning configuration and dashboard configuration examples.

Datasource Provisioning

# grafana-datasources.yaml
apiVersion: 1

datasources:
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    uid: tempo
    jsonData:
      httpMethod: GET
      tracesToLogsV2:
        datasourceUid: loki
        spanStartTimeShift: '-1h'
        spanEndTimeShift: '1h'
        filterByTraceID: true
        filterBySpanID: true
      tracesToMetrics:
        datasourceUid: prometheus
        spanStartTimeShift: '-1h'
        spanEndTimeShift: '1h'
        tags:
          - key: service.name
            value: service
          - key: http.method
            value: method
      tracesToProfiles:
        datasourceUid: pyroscope
        profileTypeId: 'process_cpu:cpu:nanoseconds:cpu:nanoseconds'
        tags:
          - key: service.name
            value: service_name
      serviceMap:
        datasourceUid: prometheus
      nodeGraph:
        enabled: true
      search:
        hide: false
      traceQuery:
        timeShiftEnabled: true
        spanStartTimeShift: '-30m'
        spanEndTimeShift: '30m'

Dashboard JSON Snippet

The following is a Grafana dashboard panel configuration showing request rate and error rate by service.

{
  "panels": [
    {
      "title": "Service Request Rate",
      "type": "timeseries",
      "datasource": { "uid": "prometheus", "type": "prometheus" },
      "targets": [
        {
          "expr": "sum(rate(traces_spanmetrics_calls_total{status_code!=\"STATUS_CODE_ERROR\"}[5m])) by (service)",
          "legendFormat": "{{ service }}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "reqps",
          "custom": { "drawStyle": "line", "lineWidth": 2 }
        }
      }
    },
    {
      "title": "Service Error Rate",
      "type": "timeseries",
      "datasource": { "uid": "prometheus", "type": "prometheus" },
      "targets": [
        {
          "expr": "sum(rate(traces_spanmetrics_calls_total{status_code=\"STATUS_CODE_ERROR\"}[5m])) by (service) / sum(rate(traces_spanmetrics_calls_total[5m])) by (service) * 100",
          "legendFormat": "{{ service }}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "thresholds": {
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 1 },
              { "color": "red", "value": 5 }
            ]
          }
        }
      }
    },
    {
      "title": "P99 Latency by Service",
      "type": "timeseries",
      "datasource": { "uid": "prometheus", "type": "prometheus" },
      "targets": [
        {
          "expr": "histogram_quantile(0.99, sum(rate(traces_spanmetrics_latency_bucket[5m])) by (le, service))",
          "legendFormat": "{{ service }}"
        }
      ],
      "fieldConfig": {
        "defaults": { "unit": "s" }
      }
    }
  ]
}

Key Integration Features

When using Tempo in Grafana, the most powerful features are the three cross-datasource integrations: Traces to Logs, Traces to Metrics, and Traces to Profiles.

  • Traces to Logs: Clicking a specific span in the trace view navigates directly to Loki logs for that time window. It automatically filters by Trace ID and Span ID, showing only related logs.
  • Traces to Metrics: You can jump to Prometheus metric queries based on span attributes. When slow spans are found, you can immediately check CPU and memory metrics for that service.
  • Traces to Profiles: When integrated with Pyroscope, you can trace the cause of slow spans down to the code level (function call profiles).

Troubleshooting

This section covers common issues and solutions encountered when operating Tempo.

Ingester Out of Memory (OOM)

Symptom: Ingester Pods repeatedly restart with OOMKilled status.

Cause: In-memory blocks become excessively large due to traffic spikes, or max_block_duration is set too long.

Solution: Reduce ingester.max_block_duration to 5 minutes to shorten the flush cycle, and limit ingester.max_block_bytes to a range of 500MB to 1GB. Kubernetes resource requests and limits should also be set sufficiently. Increasing the number of Ingester instances to distribute load is also effective.

TraceQL Query Timeout

Symptom: "context deadline exceeded" errors occur repeatedly during TraceQL searches.

Cause: Occurs when there are too many blocks (Compactor not functioning) or the search scope is too broad.

Solution: Verify that the Compactor is operating normally and adjust compaction_window appropriately. Set query_frontend.max_retries to 3 and limit results with query_frontend.search.default_result_limit. Narrowing the query time range is also an immediate mitigation.

Missing Spans

Symptom: Some spans are missing from traces, resulting in incomplete trace queries.

Cause: Often caused by hash ring inconsistency between Distributor and Ingester, network partitions, or sampling policy mismatches.

Solution: Check for "ring not healthy" messages in distributor logs. Verify that the Memberlist communication port (default 7946) is open in the firewall. Validate that the OTel Collector's tail_sampling policy is working as intended, and temporarily enable the debug exporter to trace span flow.

Compactor Block Merge Failure

Symptom: The number of blocks in object storage keeps increasing and query performance gradually degrades.

Cause: Compactor memory shortage, object storage permission issues, or max_compaction_objects limit exceeded.

Solution: Increase the Compactor's memory allocation and reconfirm storage IAM permissions (ListBucket, GetObject, PutObject, DeleteObject). Gradually increase compaction.max_compaction_objects to handle large blocks.

Operations Checklist

This is a checklist for reliably operating Tempo in production environments.

Pre-deployment Checks

  • Determine deployment mode (based on daily traffic: under 100GB Monolithic, 100GB-1TB Scalable, over 1TB Microservices)
  • Create object storage bucket and configure IAM permissions
  • Verify disk IOPS for WAL storage path (SSD recommended, minimum 3000 IOPS)
  • Configure network policies (Memberlist 7946/TCP, OTLP 4317-4318/TCP)
  • Provision TLS certificates (mTLS recommended)
  • Set resource requests/limits (Ingester: minimum 4GB RAM, Compactor: minimum 8GB RAM)

Essential Monitoring Metrics

  • tempo_ingester_live_traces: Active trace count (memory pressure indicator)
  • tempo_ingester_bytes_received_total: Bytes received per second
  • tempo_compactor_blocks_total: Object storage block count (alert on sustained increase)
  • tempo_distributor_spans_received_total: Received span count (check for drops)
  • tempo_query_frontend_queries_total: Query throughput and error rate
  • tempo_discarded_spans_total: Discarded span count (investigate immediately if non-zero)

Regular Inspection Items

  • Weekly: Check Compactor block merge status, monitor block count trends
  • Weekly: Check WAL disk usage and verify flush operation
  • Monthly: Review storage costs and reassess retention periods
  • Monthly: Benchmark TraceQL query performance (track response times for key query patterns)
  • Quarterly: Plan Tempo version upgrades and conduct compatibility tests

Failure Cases and Recovery

Case 1: Data Loss Due to Ingester WAL Corruption

Situation: An unexpected Kubernetes node shutdown corrupted the WAL on 2 out of 3 Ingesters. The Ingesters failed to recover WAL on restart, resulting in approximately 15 minutes of trace data loss.

Recovery Process: First, manually cleared the corrupted WAL directories and restarted the Ingesters. For the lost time window, partial recovery was achieved by resending some data buffered in the OTel Collector's sending_queue.

Lessons Learned: Set the Ingester's replication_factor to 3 so that identical spans are replicated to at least 2 Ingesters. Fixed the WAL path to local NVMe SSD and changed the PV (PersistentVolume) reclaimPolicy to Retain to preserve WAL even during Pod rescheduling. Increased Ingester Pod's terminationGracePeriodSeconds to 300 seconds to allow flush time during shutdown.

Case 2: Query Performance Collapse Due to Compactor Failure

Situation: After an S3 IAM policy change, the Compactor lost DeleteObject permissions, and block merging was interrupted for 2 weeks. Over 500,000 small blocks accumulated, causing TraceQL search response time to surge from the usual 2 seconds to 45 seconds.

Recovery Process: The S3 IAM policy was immediately corrected and the Compactor was restarted. However, attempting to merge 500,000 blocks at once caused Compactor OOM. By lowering compaction.max_compaction_objects from 1 million to 100,000 and reducing compaction_window to 1 hour, blocks were gradually merged. Full normalization took 3 days.

Lessons Learned: Set up an alarm on the tempo_compactor_blocks_total metric to receive immediate notification when the block count increases abnormally. Added a check item to the change management process to verify whether Tempo-related permissions are affected when IAM policies change.

Case 3: Cardinality Explosion from Indiscriminate Custom Attributes

Situation: The development team indiscriminately added user IDs (user.id) as span attributes, and this attribute was included in the Metrics Generator's dimensions, causing cardinality to explode to millions. Prometheus remote write became a bottleneck, delaying the entire metrics collection.

Recovery Process: Immediately removed user.id from dimensions and restarted the Metrics Generator. Deleted the affected time series in Prometheus to reclaim storage.

Lessons Learned: Always verify the cardinality of attributes added to dimensions in advance. Established a policy where attributes that could exceed 1000 cardinality are used only for TraceQL search instead of as metric labels. Also added a safety measure by setting overrides.defaults.metrics_generator.max_active_series to limit the number of time series.

References