Split View: Grafana Tempo 분산 트레이싱과 TraceQL 운영 가이드 2026

Grafana Tempo 분산 트레이싱과 TraceQL 운영 가이드 2026

개요
Tempo 아키텍처
- 핵심 컴포넌트
- 데이터 흐름
배포 모드
Docker Compose로 빠르게 시작하기
TraceQL 쿼리 문법
스팬 메트릭과 서비스 그래프
- 스팬 메트릭(Span Metrics) 생성기
- 서비스 그래프(Service Graph) 생성기
Tempo vs Jaeger vs Zipkin 비교
OpenTelemetry Collector 연동
스토리지 최적화
Grafana 대시보드 구성
트러블슈팅
운영 체크리스트
실패 사례와 복구
참고자료

개요

마이크로서비스 아키텍처가 보편화되면서 단일 요청이 수십 개 서비스를 거쳐 처리되는 환경이 일상이 되었다. 이 환경에서 장애의 근본 원인을 추적하려면 분산 트레이싱이 필수적이다. Grafana Tempo는 Grafana Labs가 2020년에 공개한 오픈소스 분산 트레이싱 백엔드로, 오브젝트 스토리지만으로 운영할 수 있어 인프라 복잡도와 비용을 대폭 줄여준다.

Tempo의 핵심 철학은 단순하다. 트레이스 데이터에 대한 별도의 인덱스를 생성하지 않고, Trace ID 기반 조회와 TraceQL 쿼리 엔진을 통해 스팬을 검색한다. 이 접근법 덕분에 Jaeger나 Zipkin 대비 스토리지 비용이 크게 절감되며, 페타바이트 규모의 트레이스도 안정적으로 보관할 수 있다.

이 글에서는 Tempo의 내부 아키텍처, 세 가지 배포 모드, TraceQL 쿼리 문법, 스팬 메트릭 생성과 서비스 그래프, OpenTelemetry Collector 연동, 스토리지 최적화, Grafana 대시보드 구성, 트러블슈팅, 그리고 실제 운영에서 겪은 실패 사례와 복구 경험까지 다룬다.

Tempo 아키텍처

Tempo는 내부적으로 여러 컴포넌트가 협력하여 트레이스 데이터를 수집하고 저장하며 조회한다. 각 컴포넌트의 역할을 이해하면 장애 발생 시 병목 지점을 빠르게 파악할 수 있다.

핵심 컴포넌트

Distributor는 클라이언트로부터 스팬 데이터를 수신하는 진입점이다. Jaeger, Zipkin, OpenTelemetry(OTLP) 등 다양한 프로토콜을 지원하며, 수신한 스팬을 Trace ID 해시 기반으로 일관된 해시 링을 사용하여 적절한 Ingester로 라우팅한다.

Ingester는 수신된 스팬 데이터를 인덱싱하고, 일정 시간이 지나면 오브젝트 스토리지에 블록 단위로 플러시한다. WAL(Write-Ahead Log)을 유지하여 프로세스 비정상 종료 시에도 데이터 손실을 최소화한다.

Query Frontend는 Grafana 등의 클라이언트가 Trace ID 조회나 TraceQL 검색을 요청할 때 호출되는 컴포넌트다. 요청을 여러 Querier에 분산시켜 병렬로 블록 데이터를 검색함으로써 응답 시간을 단축한다.

Querier는 Query Frontend로부터 전달받은 요청을 실제로 처리하는 워커다. Ingester의 인메모리 데이터와 오브젝트 스토리지의 블록 데이터를 모두 탐색하여 결과를 조합한다.

Compactor는 오브젝트 스토리지에 저장된 소규모 블록들을 주기적으로 병합하여 대규모 블록으로 만든다. 이를 통해 쿼리 성능을 향상시키고 스토리지 사용량을 최적화한다.

Metrics Generator는 수신된 스팬 데이터로부터 RED(Rate, Error, Duration) 메트릭과 서비스 그래프를 자동 생성하는 선택적 컴포넌트다. 생성된 메트릭은 Prometheus 호환 원격 쓰기를 통해 Mimir나 Prometheus로 전송된다.

데이터 흐름

[Application] --> [OTel Collector] --> [Distributor]
                                           |
                                    [Hash Ring]
                                           |
                                      [Ingester]
                                       /      \
                              [WAL]         [Object Storage]
                                                  |
                              [Compactor] <-------+
                                                  |
                              [Query Frontend] ---+---> [Querier]

스팬은 애플리케이션에서 OTel Collector를 거쳐 Distributor에 도달하고, 해시 링을 통해 Ingester로 분배된다. Ingester는 WAL에 먼저 기록한 뒤, 설정된 주기(기본 30분)마다 오브젝트 스토리지에 블록을 플러시한다. Compactor가 소규모 블록을 병합하고, Querier는 Ingester 인메모리와 오브젝트 스토리지 양쪽 모두에서 데이터를 검색한다.

배포 모드

Tempo는 세 가지 배포 모드를 제공하며, 조직의 규모와 요구사항에 따라 선택할 수 있다.

배포 모드 비교

항목	Monolithic	Scalable Single Binary	Microservices
구조	단일 바이너리, 단일 프로세스	단일 바이너리, 다중 인스턴스	컴포넌트별 독립 프로세스
확장성	수직 확장만 가능	수평 확장 가능	컴포넌트별 독립 수평 확장
권장 트래픽	일 100GB 이하	일 100GB ~ 1TB	일 1TB 이상
운영 복잡도	낮음	중간	높음
고가용성	제한적	기본 지원	완전 지원
적합 환경	개발/테스트, 소규모	중규모 프로덕션	대규모 프로덕션, 멀티 테넌트
Kubernetes 필요	아니오	권장	필수

Monolithic 모드

모든 컴포넌트가 하나의 프로세스에서 실행된다. 로컬 환경이나 소규모 워크로드에 적합하며 설정이 가장 간단하다.

# tempo-config.yaml (Monolithic)
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: '0.0.0.0:4317'
        http:
          endpoint: '0.0.0.0:4318'
    jaeger:
      protocols:
        thrift_http:
          endpoint: '0.0.0.0:14268'
    zipkin:
      endpoint: '0.0.0.0:9411'

ingester:
  max_block_duration: 5m
  max_block_bytes: 1073741824 # 1GB

storage:
  trace:
    backend: local
    wal:
      path: /var/tempo/wal
    local:
      path: /var/tempo/blocks
    pool:
      max_workers: 100
      queue_depth: 10000

compactor:
  compaction:
    block_retention: 72h

metrics_generator:
  registry:
    external_labels:
      source: tempo
      cluster: local
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write
        send_exemplars: true
  traces_storage:
    path: /var/tempo/generator/traces
  processor:
    service_graphs:
      dimensions:
        - service.namespace
        - deployment.environment
    span_metrics:
      dimensions:
        - http.method
        - http.status_code
        - http.route

overrides:
  defaults:
    metrics_generator:
      processors:
        - service-graphs
        - span-metrics

Scalable Single Binary 모드

동일한 바이너리를 여러 인스턴스로 실행하여 수평 확장을 달성한다. Monolithic과 Microservices의 중간 지점으로, 설정 복잡도를 크게 높이지 않으면서 확장성을 확보할 수 있다. 각 인스턴스는 target 플래그를 scalable-single-binary로 설정하여 실행한다.

Microservices 모드

각 컴포넌트를 독립적인 프로세스로 배포하여 개별 확장이 가능하다. 대규모 환경에서 특정 컴포넌트(예: Ingester)만 스케일 아웃하거나, Querier를 트래픽 패턴에 맞게 조절할 수 있다. Kubernetes 환경에서 Helm 차트(tempo-distributed)를 이용하면 배포가 편리하다.

Docker Compose로 빠르게 시작하기

로컬 환경에서 Tempo를 빠르게 체험하려면 Docker Compose를 활용한다. 아래 구성은 Tempo(Monolithic), OTel Collector, Grafana, Prometheus를 한 번에 올리는 예시다.

# docker-compose.yaml
version: '3.9'

services:
  tempo:
    image: grafana/tempo:2.7.1
    command: ['-config.file=/etc/tempo/tempo.yaml']
    volumes:
      - ./tempo.yaml:/etc/tempo/tempo.yaml
      - tempo-data:/var/tempo
    ports:
      - '3200:3200' # Tempo HTTP API
      - '4317:4317' # OTLP gRPC
      - '4318:4318' # OTLP HTTP
      - '9411:9411' # Zipkin
      - '14268:14268' # Jaeger HTTP
    networks:
      - observability

  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.118.0
    command: ['--config=/etc/otel-collector/config.yaml']
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector/config.yaml
    ports:
      - '4327:4317' # OTLP gRPC (앱에서 접근용)
      - '4328:4318' # OTLP HTTP
    depends_on:
      - tempo
    networks:
      - observability

  prometheus:
    image: prom/prometheus:v3.2.1
    volumes:
      - ./prometheus.yaml:/etc/prometheus/prometheus.yml
    ports:
      - '9090:9090'
    networks:
      - observability

  grafana:
    image: grafana/grafana:11.5.2
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    volumes:
      - ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml
    ports:
      - '3000:3000'
    depends_on:
      - tempo
      - prometheus
    networks:
      - observability

volumes:
  tempo-data:

networks:
  observability:
    driver: bridge

docker compose up -d 실행 후 http://localhost:3000에서 Grafana에 접속하면 Tempo 데이터소스가 자동으로 프로비저닝되어 바로 트레이스를 검색할 수 있다.

TraceQL 쿼리 문법

TraceQL은 Tempo 전용 쿼리 언어로, PromQL이나 LogQL과 유사한 문법 체계를 따른다. 중괄호 {}로 스팬셋(spanset)을 선택하고, 파이프라인 연산자로 필터와 집계를 연결한다.

기본 구조

TraceQL 쿼리는 크게 세 가지 요소로 구성된다.

Intrinsics: 스팬의 고유 속성(name, status, duration, kind, rootName, rootServiceName, traceDuration)
Attributes: 커스텀 키-값 쌍으로, 스코프 접두사(span., resource., link., event.)를 사용
연산자: 비교(=, !=, >, <, >=, <=), 정규식(=~, !~), 논리(&&, ||), 구조(>, >>, <, <<, ~)

TraceQL 쿼리 예시 모음

// 1. 특정 서비스의 에러 스팬 찾기
{ resource.service.name = "payment-service" && status = error }

// 2. 500ms 이상 걸린 HTTP GET 요청 스팬
{ span.http.method = "GET" && duration > 500ms }

// 3. 특정 경로에서 5xx 응답을 반환한 스팬
{ span.http.route = "/api/v1/orders" && span.http.status_code >= 500 }

// 4. 두 서비스 간 호출 관계 추적 (구조 연산자)
{ resource.service.name = "api-gateway" } >> { resource.service.name = "order-service" }

// 5. 부모-자식 직접 관계인 스팬 필터
{ resource.service.name = "frontend" } > { span.http.status_code = 503 }

// 6. 형제 스팬 관계 탐색
{ span.db.system = "postgresql" } ~ { span.db.system = "redis" }

// 7. 정규식을 활용한 스팬 이름 매칭
{ name =~ "HTTP.*POST" && resource.deployment.environment = "production" }

// 8. 전체 트레이스 지속시간 기준 필터링
{ traceDuration > 3s }

// 9. 루트 서비스 기준 필터링
{ rootServiceName = "ingress-nginx" && duration > 1s }

// 10. 집계 함수를 사용한 분석
{ resource.service.name = "checkout-service" } | rate()

// 11. 히스토그램으로 지연 시간 분포 확인
{ resource.service.name = "search-service" } | histogram_over_time(duration)

// 12. 카운트 기반 이상 탐지
{ status = error } | count() > 100

주요 집계 함수

함수	설명	예시
`rate()`	초당 스팬 발생률	`{} \| rate()`
`count()`	매칭 스팬 수	`{ status = error } \| count()`
`avg(field)`	필드 평균값	`{} \| avg(duration)`
`max(field)`	필드 최대값	`{} \| max(duration)`
`min(field)`	필드 최소값	`{} \| min(duration)`
`p50/p90/p95/p99(field)`	퍼센타일	`{} \| p99(duration)`
`histogram_over_time(field)`	시간대별 히스토그램	`{} \| histogram_over_time(duration)`
`quantile_over_time(field, q)`	시간대별 분위수	`{} \| quantile_over_time(duration, 0.95)`

스팬 메트릭과 서비스 그래프

Tempo의 Metrics Generator는 수신 스팬으로부터 자동으로 메트릭을 생성하는 강력한 기능이다. 별도의 메트릭 수집 없이도 트레이스 데이터만으로 RED 메트릭과 서비스 의존성 그래프를 얻을 수 있다.

스팬 메트릭(Span Metrics) 생성기

스팬 메트릭 프로세서는 수신되는 모든 스팬에서 요청률(Rate), 에러율(Error), 지연시간 분포(Duration)를 Prometheus 메트릭으로 변환한다. 생성되는 주요 메트릭은 다음과 같다.

traces_spanmetrics_calls_total: 스팬 호출 총 횟수
traces_spanmetrics_latency_bucket: 지연시간 히스토그램 버킷
traces_spanmetrics_size_total: 스팬 크기 합계

dimensions 설정으로 http.method, http.status_code, http.route 등의 스팬 속성을 메트릭 레이블로 추가할 수 있어, 엔드포인트별 RED 메트릭을 세밀하게 관찰할 수 있다.

서비스 그래프(Service Graph) 생성기

서비스 그래프 프로세서는 클라이언트-서버 스팬 쌍을 분석하여 서비스 간 호출 관계를 자동으로 매핑한다. Grafana의 서비스 그래프 뷰에서 시각적으로 서비스 토폴로지를 확인할 수 있으며, 각 엣지에 요청률과 에러율, 지연시간이 표시된다.

주요 설정 파라미터는 다음과 같다.

max_items: 추적할 최대 서비스 쌍 수 (기본값 10000)
wait: 불완전한 엣지의 대기 시간 (기본값 10s)
dimensions: 서비스 그래프에 추가할 커스텀 레이블
histogram_buckets: 지연시간 히스토그램 버킷 경계 (기본값 0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4, 12.8)

Tempo vs Jaeger vs Zipkin 비교

분산 트레이싱 백엔드를 선택할 때, 각 도구의 특성을 비교하는 것이 중요하다.

항목	Grafana Tempo	Jaeger	Zipkin
최초 공개	2020 (Grafana Labs)	2015 (Uber)	2012 (Twitter)
CNCF 상태	-	Graduated	-
스토리지 방식	오브젝트 스토리지 (인덱스 없음)	Elasticsearch, Cassandra, 등	Elasticsearch, Cassandra, MySQL
인덱싱	없음 (Trace ID + TraceQL)	태그 기반 인덱스 생성	태그 기반 인덱스 생성
스토리지 비용	낮음 (S3/GCS 단가)	높음 (인덱스 스토리지 포함)	높음
수집 프로토콜	OTLP, Jaeger, Zipkin	OTLP, Jaeger	Zipkin, OTLP(제한적)
쿼리 언어	TraceQL	태그 기반 검색	태그 기반 검색
내장 UI	Grafana 연동	Jaeger UI	Zipkin UI
메트릭 생성	내장 (Metrics Generator)	외부 도구 필요	외부 도구 필요
확장성	뛰어남 (PB 규모)	보통	제한적
Grafana 통합	네이티브	플러그인	플러그인
유지보수 주체	Grafana Labs (상업적 지원)	CNCF 커뮤니티	자원봉사 커뮤니티

선택 기준 요약: Grafana 생태계를 이미 사용 중이고, 대규모 트레이스를 저비용으로 보관하고 싶다면 Tempo가 최적이다. 독립적인 트레이싱 시스템이 필요하고 태그 기반 풍부한 검색이 핵심이라면 Jaeger를 고려하라. 소규모 팀에서 빠르게 트레이싱을 도입하려면 Zipkin도 여전히 유효한 선택지다.

OpenTelemetry Collector 연동

Tempo로 트레이스를 전송하는 가장 권장되는 방법은 OpenTelemetry Collector를 중간 파이프라인으로 사용하는 것이다. Collector는 다양한 소스에서 트레이스를 수집하고, 배치 처리와 재시도를 수행한 뒤 Tempo에 안정적으로 전송한다.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: '0.0.0.0:4317'
      http:
        endpoint: '0.0.0.0:4318'

processors:
  batch:
    timeout: 5s
    send_batch_size: 10000
    send_batch_max_size: 11000

  memory_limiter:
    check_interval: 1s
    limit_mib: 4096
    spike_limit_mib: 512

  attributes:
    actions:
      - key: deployment.environment
        value: production
        action: upsert

  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      - name: errors-policy
        type: status_code
        status_code:
          status_codes:
            - ERROR
      - name: slow-traces-policy
        type: latency
        latency:
          threshold_ms: 1000
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

exporters:
  otlp/tempo:
    endpoint: 'tempo:4317'
    tls:
      insecure: true
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s
    sending_queue:
      enabled: true
      num_consumers: 10
      queue_size: 5000

  debug:
    verbosity: basic

service:
  telemetry:
    logs:
      level: info
    metrics:
      address: '0.0.0.0:8888'

  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, attributes, batch]
      exporters: [otlp/tempo, debug]

이 구성에서 핵심적인 부분은 다음과 같다.

tail_sampling: 에러 스팬은 100% 수집하고, 1초 이상 걸린 느린 트레이스도 전량 수집하며, 나머지는 10% 확률로 샘플링한다. 이렇게 하면 중요한 트레이스는 놓치지 않으면서 스토리지 비용을 절감할 수 있다.
memory_limiter: Collector 메모리 사용량을 4GB로 제한하여 OOM을 방지한다.
sending_queue: 일시적인 Tempo 장애 시에도 큐에 데이터를 버퍼링하고 재시도한다.
batch: 스팬을 10,000개씩 배치로 묶어 전송하여 네트워크 효율을 높인다.

스토리지 최적화

Tempo의 스토리지 설계는 오브젝트 스토리지 중심이다. 운영 환경에서는 S3, GCS, Azure Blob Storage 중 하나를 백엔드로 선택한다.

스토리지 백엔드 비교

항목	Amazon S3	Google Cloud Storage	Azure Blob Storage
설정 키	`s3`	`gcs`	`azure`
인증 방식	IAM Role, Access Key	Service Account, Workload Identity	Managed Identity, SAS Token
비용 (GB/월)	$0.023 (Standard)	$0.020 (Standard)	$0.018 (Hot)
리전 가용성	33+ 리전	40+ 리전	60+ 리전
Tempo 호환성	완전 지원	완전 지원	완전 지원
생명주기 정책	S3 Lifecycle	Object Lifecycle	Lifecycle Management

S3 백엔드 설정 예시

storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces-prod
      endpoint: s3.ap-northeast-2.amazonaws.com
      region: ap-northeast-2
      access_key: ${S3_ACCESS_KEY}
      secret_key: ${S3_SECRET_KEY}
      # 또는 IAM Role 사용 시 access_key/secret_key 생략
    wal:
      path: /var/tempo/wal
    block:
      bloom_filter_false_positive: 0.01
      v2_index_downsample_bytes: 1048576
      v2_encoding: zstd
    blocklist_poll: 5m
    pool:
      max_workers: 200
      queue_depth: 20000

compactor:
  compaction:
    block_retention: 336h # 14일 보관
    compacted_block_retention: 1h
    compaction_window: 4h
    max_block_bytes: 107374182400 # 100GB
    max_compaction_objects: 6000000
    retention_concurrency: 10
  ring:
    kvstore:
      store: memberlist

스토리지 최적화 팁

블록 인코딩: v2_encoding을 zstd로 설정하면 snappy 대비 약 30-40% 더 높은 압축률을 달성하지만, CPU 사용량이 다소 증가한다. 쓰기 워크로드가 많다면 snappy, 스토리지 비용이 우선이라면 zstd를 선택하라.

Bloom Filter 튜닝: bloom_filter_false_positive를 낮추면(예: 0.01 -> 0.005) 쿼리 정확도가 향상되지만 블룸 필터 크기가 커진다. 쿼리 빈도가 높은 환경에서는 오탐률을 줄이는 것이 전체적인 성능에 유리하다.

블록 보관 주기: block_retention을 비즈니스 요구사항에 맞게 설정하라. 14일(336h)이 일반적이지만, 규정 준수가 필요한 환경에서는 90일 이상으로 늘려야 할 수도 있다. 이 경우 오브젝트 스토리지의 생명주기 정책으로 Infrequent Access(S3) 또는 Nearline(GCS) 계층으로 자동 전환하면 비용을 절감할 수 있다.

Compactor 튜닝: max_block_bytes를 너무 크게 설정하면 Compactor 메모리 사용량이 급증하고, 너무 작으면 블록 수가 늘어나 쿼리 성능이 저하된다. 100GB 전후가 균형 잡힌 값이다.

Grafana 대시보드 구성

Tempo는 Grafana와 네이티브로 통합되어 별도의 UI 없이도 풍부한 트레이싱 시각화를 제공한다. 다음은 Grafana 데이터소스 프로비저닝 설정과 대시보드 구성 예시다.

데이터소스 프로비저닝

# grafana-datasources.yaml
apiVersion: 1

datasources:
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    uid: tempo
    jsonData:
      httpMethod: GET
      tracesToLogsV2:
        datasourceUid: loki
        spanStartTimeShift: '-1h'
        spanEndTimeShift: '1h'
        filterByTraceID: true
        filterBySpanID: true
      tracesToMetrics:
        datasourceUid: prometheus
        spanStartTimeShift: '-1h'
        spanEndTimeShift: '1h'
        tags:
          - key: service.name
            value: service
          - key: http.method
            value: method
      tracesToProfiles:
        datasourceUid: pyroscope
        profileTypeId: 'process_cpu:cpu:nanoseconds:cpu:nanoseconds'
        tags:
          - key: service.name
            value: service_name
      serviceMap:
        datasourceUid: prometheus
      nodeGraph:
        enabled: true
      search:
        hide: false
      traceQuery:
        timeShiftEnabled: true
        spanStartTimeShift: '-30m'
        spanEndTimeShift: '30m'

대시보드 JSON 스니펫

다음은 서비스별 요청률과 에러율을 보여주는 Grafana 대시보드 패널 설정이다.

{
  "panels": [
    {
      "title": "Service Request Rate",
      "type": "timeseries",
      "datasource": { "uid": "prometheus", "type": "prometheus" },
      "targets": [
        {
          "expr": "sum(rate(traces_spanmetrics_calls_total{status_code!=\"STATUS_CODE_ERROR\"}[5m])) by (service)",
          "legendFormat": "{{ service }}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "reqps",
          "custom": { "drawStyle": "line", "lineWidth": 2 }
        }
      }
    },
    {
      "title": "Service Error Rate",
      "type": "timeseries",
      "datasource": { "uid": "prometheus", "type": "prometheus" },
      "targets": [
        {
          "expr": "sum(rate(traces_spanmetrics_calls_total{status_code=\"STATUS_CODE_ERROR\"}[5m])) by (service) / sum(rate(traces_spanmetrics_calls_total[5m])) by (service) * 100",
          "legendFormat": "{{ service }}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "thresholds": {
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 1 },
              { "color": "red", "value": 5 }
            ]
          }
        }
      }
    },
    {
      "title": "P99 Latency by Service",
      "type": "timeseries",
      "datasource": { "uid": "prometheus", "type": "prometheus" },
      "targets": [
        {
          "expr": "histogram_quantile(0.99, sum(rate(traces_spanmetrics_latency_bucket[5m])) by (le, service))",
          "legendFormat": "{{ service }}"
        }
      ],
      "fieldConfig": {
        "defaults": { "unit": "s" }
      }
    }
  ]
}

핵심 연동 기능

Grafana에서 Tempo를 사용할 때 가장 강력한 기능은 Traces to Logs, Traces to Metrics, Traces to Profiles의 세 가지 크로스 데이터소스 연동이다.

Traces to Logs: 트레이스 뷰에서 특정 스팬을 클릭하면 해당 시간대의 Loki 로그로 바로 이동한다. Trace ID와 Span ID로 자동 필터링되어 관련 로그만 표시된다.
Traces to Metrics: 스팬 속성을 기반으로 Prometheus 메트릭 쿼리로 점프할 수 있다. 느린 스팬이 발견되면 해당 서비스의 CPU, 메모리 메트릭을 즉시 확인할 수 있다.
Traces to Profiles: Pyroscope와 연동하면 느린 스팬의 원인을 코드 레벨(함수 호출 프로파일)까지 추적할 수 있다.

트러블슈팅

Tempo 운영 시 자주 발생하는 문제와 해결 방법을 정리한다.

Ingester 메모리 부족 (OOM)

증상: Ingester Pod가 반복적으로 OOMKilled 상태로 재시작된다.

원인: 트래픽 급증으로 인메모리 블록이 과도하게 커지거나, max_block_duration이 너무 길게 설정되었을 때 발생한다.

해결: ingester.max_block_duration을 5분으로 줄여 플러시 주기를 단축하고, ingester.max_block_bytes를 500MB ~ 1GB 범위로 제한한다. Kubernetes 리소스 요청과 제한도 충분히 설정해야 한다. Ingester 인스턴스 수를 늘려 부하를 분산하는 것도 효과적이다.

TraceQL 쿼리 타임아웃

증상: TraceQL 검색 시 "context deadline exceeded" 에러가 반복적으로 발생한다.

원인: 블록 수가 너무 많거나(Compactor 미작동), 검색 범위가 지나치게 넓을 때 발생한다.

해결: Compactor가 정상 동작하는지 확인하고 compaction_window를 적절히 조정한다. query_frontend.max_retries를 3으로 설정하고, query_frontend.search.default_result_limit으로 결과 수를 제한한다. 쿼리 시간 범위를 좁히는 것도 즉각적인 완화 방법이다.

스팬 누락 (Missing Spans)

증상: 트레이스에 일부 스팬이 빠져 있어 불완전한 트레이스가 조회된다.

원인: Distributor와 Ingester 간 해시 링 불일치, 네트워크 파티션, 또는 샘플링 정책 불일치가 원인인 경우가 많다.

해결: distributor 로그에서 "ring not healthy" 메시지를 확인한다. Memberlist 통신 포트(기본 7946)가 방화벽에서 열려 있는지 점검한다. OTel Collector의 tail_sampling 정책이 의도한 대로 작동하는지 검증하고, debug exporter를 일시적으로 활성화하여 스팬 흐름을 추적한다.

Compactor 블록 병합 실패

증상: 오브젝트 스토리지의 블록 수가 계속 증가하고, 쿼리 성능이 점진적으로 저하된다.

원인: Compactor 메모리 부족, 오브젝트 스토리지 권한 문제, 또는 max_compaction_objects 제한 초과가 원인이다.

해결: Compactor의 메모리 할당을 늘리고, 스토리지 IAM 권한(ListBucket, GetObject, PutObject, DeleteObject)을 재확인한다. compaction.max_compaction_objects를 단계적으로 늘려 대규모 블록도 처리할 수 있게 한다.

운영 체크리스트

프로덕션 환경에서 Tempo를 안정적으로 운영하기 위한 체크리스트다.

배포 전 체크

배포 모드 결정 (일 트래픽량 기준: 100GB 이하 Monolithic, 100GB~1TB Scalable, 1TB 이상 Microservices)
오브젝트 스토리지 버킷 생성 및 IAM 권한 설정
WAL 저장 경로의 디스크 IOPS 확인 (SSD 권장, 최소 3000 IOPS)
네트워크 정책 설정 (Memberlist 7946/TCP, OTLP 4317-4318/TCP)
TLS 인증서 프로비저닝 (mTLS 권장)
리소스 요청/제한 설정 (Ingester: 최소 4GB RAM, Compactor: 최소 8GB RAM)

모니터링 필수 메트릭

tempo_ingester_live_traces: 활성 트레이스 수 (메모리 압박 지표)
tempo_ingester_bytes_received_total: 초당 수신 바이트
tempo_compactor_blocks_total: 오브젝트 스토리지 블록 수 (지속 증가 시 경고)
tempo_distributor_spans_received_total: 수신 스팬 수 (드롭 여부 확인)
tempo_query_frontend_queries_total: 쿼리 처리량 및 에러율
tempo_discarded_spans_total: 버려진 스팬 수 (0이 아니면 즉시 조사)

정기 점검 항목

주간: Compactor 블록 병합 상태 확인, 블록 수 추이 모니터링
주간: WAL 디스크 사용량 확인 및 플러시 정상 여부 검증
월간: 스토리지 비용 리뷰 및 보관 주기 재평가
월간: TraceQL 쿼리 성능 벤치마크 (주요 쿼리 패턴의 응답 시간 추적)
분기별: Tempo 버전 업그레이드 계획 수립 및 호환성 테스트

실패 사례와 복구

사례 1: Ingester WAL 손상으로 인한 데이터 유실

상황: Kubernetes 노드의 갑작스러운 종료로 Ingester 3대 중 2대의 WAL이 손상되었다. Ingester가 재시작 시 WAL을 복구하지 못해 약 15분간의 트레이스 데이터가 유실되었다.

복구 과정: 먼저 손상된 WAL 디렉토리를 수동으로 비우고 Ingester를 재시작했다. 유실된 시간대의 트레이스는 OTel Collector의 sending_queue에 버퍼링된 일부 데이터를 재전송하여 부분 복구했다.

교훈: Ingester의 replication_factor를 3으로 설정하여 최소 2대의 Ingester에 동일 스팬이 복제되도록 했다. WAL 경로를 로컬 NVMe SSD에 고정하고, PV(PersistentVolume)의 reclaimPolicy를 Retain으로 변경하여 Pod 재스케줄링 시에도 WAL이 보존되도록 했다. Ingester Pod의 terminationGracePeriodSeconds를 300초로 늘려 종료 시 플러시 시간을 확보했다.

사례 2: Compactor 장애로 인한 쿼리 성능 붕괴

상황: S3 IAM 정책 변경 후 Compactor가 DeleteObject 권한을 잃어 블록 병합이 2주간 중단되었다. 소규모 블록이 50만 개 이상 누적되면서 TraceQL 검색 응답 시간이 평소 2초에서 45초로 급증했다.

복구 과정: S3 IAM 정책을 즉시 수정하고 Compactor를 재시작했다. 그러나 50만 개 블록을 한꺼번에 병합하려 하자 Compactor OOM이 발생했다. compaction.max_compaction_objects를 100만에서 10만으로 낮추고, compaction_window를 1시간으로 축소하여 점진적으로 블록을 병합했다. 전체 정상화에 3일이 소요되었다.

교훈: tempo_compactor_blocks_total 메트릭에 대한 알람을 설정하여 블록 수가 비정상적으로 증가하면 즉시 알림을 받도록 했다. IAM 정책 변경 시 Tempo 관련 권한이 영향받는지 체크하는 항목을 변경 관리 프로세스에 추가했다.

사례 3: 무분별한 커스텀 속성으로 카디널리티 폭발

상황: 개발팀이 사용자 ID(user.id)를 스팬 속성으로 무분별하게 추가하면서, Metrics Generator의 dimensions에 이 속성이 포함되어 카디널리티가 수백만으로 폭발했다. Prometheus 원격 쓰기가 병목이 되어 메트릭 수집 전체가 지연되었다.

복구 과정: 즉시 user.id를 dimensions에서 제거하고 Metrics Generator를 재시작했다. Prometheus에서 해당 시계열을 삭제하여 스토리지를 회수했다.

교훈: dimensions에 추가하는 속성의 카디널리티를 반드시 사전 검증하라. 카디널리티가 1000을 초과할 수 있는 속성은 메트릭 레이블 대신 TraceQL 검색으로만 활용하는 정책을 수립했다. overrides.defaults.metrics_generator.max_active_series를 설정하여 시계열 수를 제한하는 안전장치도 추가했다.

참고자료

Grafana Tempo Distributed Tracing and TraceQL Operations Guide 2026

Overview
Tempo Architecture
- Core Components
- Data Flow
Deployment Modes
Quick Start with Docker Compose
TraceQL Query Syntax
Span Metrics and Service Graphs
- Span Metrics Generator
- Service Graph Generator
Tempo vs Jaeger vs Zipkin Comparison
OpenTelemetry Collector Integration
Storage Optimization
Grafana Dashboard Configuration
Troubleshooting
Operations Checklist
Failure Cases and Recovery
References
Quiz

Overview

As microservices architecture has become mainstream, environments where a single request passes through dozens of services are now commonplace. In such environments, distributed tracing is essential for tracking the root cause of failures. Grafana Tempo is an open-source distributed tracing backend released by Grafana Labs in 2020 that can operate with only object storage, dramatically reducing infrastructure complexity and cost.

Tempo's core philosophy is simple. It does not create separate indexes for trace data, instead searching spans through Trace ID-based lookups and the TraceQL query engine. Thanks to this approach, storage costs are significantly lower compared to Jaeger or Zipkin, and petabyte-scale traces can be reliably stored.

This article covers Tempo's internal architecture, three deployment modes, TraceQL query syntax, span metrics generation and service graphs, OpenTelemetry Collector integration, storage optimization, Grafana dashboard configuration, troubleshooting, and real-world failure cases and recovery experiences from production operations.

Tempo Architecture

Internally, Tempo uses multiple components that work together to collect, store, and query trace data. Understanding each component's role helps quickly identify bottlenecks when failures occur.

Core Components

Distributor is the entry point that receives span data from clients. It supports various protocols including Jaeger, Zipkin, and OpenTelemetry (OTLP), and routes received spans to the appropriate Ingester using consistent hashing based on Trace ID hash.

Ingester indexes received span data and flushes it in block units to object storage after a certain period. It maintains a WAL (Write-Ahead Log) to minimize data loss even during abnormal process termination.

Query Frontend is the component called when clients like Grafana request Trace ID lookups or TraceQL searches. It distributes requests across multiple Queriers to search block data in parallel, reducing response time.

Querier is the worker that actually processes requests received from the Query Frontend. It searches both the Ingester's in-memory data and object storage block data to combine results.

Compactor periodically merges small blocks stored in object storage into larger blocks. This improves query performance and optimizes storage usage.

Metrics Generator is an optional component that automatically generates RED (Rate, Error, Duration) metrics and service graphs from received span data. Generated metrics are sent to Mimir or Prometheus via Prometheus-compatible remote write.

Data Flow

[Application] --> [OTel Collector] --> [Distributor]
                                           |
                                    [Hash Ring]
                                           |
                                      [Ingester]
                                       /      \
                              [WAL]         [Object Storage]
                                                  |
                              [Compactor] <-------+
                                                  |
                              [Query Frontend] ---+---> [Querier]

Spans arrive at the Distributor from the application via OTel Collector, then are distributed to Ingesters through the hash ring. The Ingester first writes to the WAL, then flushes blocks to object storage at configured intervals (default 30 minutes). The Compactor merges small blocks, and the Querier searches both Ingester in-memory and object storage data.

Deployment Modes

Tempo provides three deployment modes that can be selected based on the organization's scale and requirements.

Deployment Mode Comparison

Item	Monolithic	Scalable Single Binary	Microservices
Structure	Single binary, single process	Single binary, multiple instances	Independent process per component
Scalability	Vertical scaling only	Horizontal scaling	Independent horizontal per component
Recommended traffic	Under 100GB/day	100GB to 1TB/day	Over 1TB/day
Operational complexity	Low	Medium	High
High availability	Limited	Basic support	Full support
Suitable environment	Dev/test, small-scale	Medium-scale production	Large-scale production, multi-tenant
Kubernetes required	No	Recommended	Required

Monolithic Mode

All components run in a single process. Suitable for local environments or small workloads with the simplest configuration.

# tempo-config.yaml (Monolithic)
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: '0.0.0.0:4317'
        http:
          endpoint: '0.0.0.0:4318'
    jaeger:
      protocols:
        thrift_http:
          endpoint: '0.0.0.0:14268'
    zipkin:
      endpoint: '0.0.0.0:9411'

ingester:
  max_block_duration: 5m
  max_block_bytes: 1073741824 # 1GB

storage:
  trace:
    backend: local
    wal:
      path: /var/tempo/wal
    local:
      path: /var/tempo/blocks
    pool:
      max_workers: 100
      queue_depth: 10000

compactor:
  compaction:
    block_retention: 72h

metrics_generator:
  registry:
    external_labels:
      source: tempo
      cluster: local
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write
        send_exemplars: true
  traces_storage:
    path: /var/tempo/generator/traces
  processor:
    service_graphs:
      dimensions:
        - service.namespace
        - deployment.environment
    span_metrics:
      dimensions:
        - http.method
        - http.status_code
        - http.route

overrides:
  defaults:
    metrics_generator:
      processors:
        - service-graphs
        - span-metrics

Scalable Single Binary Mode

Achieves horizontal scaling by running the same binary as multiple instances. As a middle ground between Monolithic and Microservices, it provides scalability without significantly increasing configuration complexity. Each instance runs with the target flag set to scalable-single-binary.

Microservices Mode

Each component is deployed as an independent process, enabling individual scaling. In large-scale environments, specific components (e.g., Ingester) can be scaled out, or Queriers can be adjusted to match traffic patterns. In Kubernetes environments, using the Helm chart (tempo-distributed) makes deployment convenient.

Quick Start with Docker Compose

To quickly try Tempo in a local environment, use Docker Compose. The configuration below brings up Tempo (Monolithic), OTel Collector, Grafana, and Prometheus all at once.

# docker-compose.yaml
version: '3.9'

services:
  tempo:
    image: grafana/tempo:2.7.1
    command: ['-config.file=/etc/tempo/tempo.yaml']
    volumes:
      - ./tempo.yaml:/etc/tempo/tempo.yaml
      - tempo-data:/var/tempo
    ports:
      - '3200:3200' # Tempo HTTP API
      - '4317:4317' # OTLP gRPC
      - '4318:4318' # OTLP HTTP
      - '9411:9411' # Zipkin
      - '14268:14268' # Jaeger HTTP
    networks:
      - observability

  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.118.0
    command: ['--config=/etc/otel-collector/config.yaml']
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector/config.yaml
    ports:
      - '4327:4317' # OTLP gRPC (for app access)
      - '4328:4318' # OTLP HTTP
    depends_on:
      - tempo
    networks:
      - observability

  prometheus:
    image: prom/prometheus:v3.2.1
    volumes:
      - ./prometheus.yaml:/etc/prometheus/prometheus.yml
    ports:
      - '9090:9090'
    networks:
      - observability

  grafana:
    image: grafana/grafana:11.5.2
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    volumes:
      - ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml
    ports:
      - '3000:3000'
    depends_on:
      - tempo
      - prometheus
    networks:
      - observability

volumes:
  tempo-data:

networks:
  observability:
    driver: bridge

After running docker compose up -d, access Grafana at http://localhost:3000 where the Tempo datasource is automatically provisioned, allowing you to search traces immediately.

TraceQL Query Syntax

TraceQL is Tempo's dedicated query language, following a syntax system similar to PromQL and LogQL. It selects spansets with curly braces {} and chains filters and aggregations with pipeline operators.

Basic Structure

A TraceQL query consists of three main elements:

Intrinsics: Span's built-in properties (name, status, duration, kind, rootName, rootServiceName, traceDuration)
Attributes: Custom key-value pairs using scope prefixes (span., resource., link., event.)
Operators: Comparison (=, !=, >, <, >=, <=), regex (=~, !~), logical (&&, ||), structural (>, >>, <, <<, ~)

TraceQL Query Examples

// 1. Find error spans for a specific service
{ resource.service.name = "payment-service" && status = error }

// 2. HTTP GET request spans taking over 500ms
{ span.http.method = "GET" && duration > 500ms }

// 3. Spans returning 5xx responses on a specific route
{ span.http.route = "/api/v1/orders" && span.http.status_code >= 500 }

// 4. Trace call relationship between two services (structural operator)
{ resource.service.name = "api-gateway" } >> { resource.service.name = "order-service" }

// 5. Filter spans with direct parent-child relationship
{ resource.service.name = "frontend" } > { span.http.status_code = 503 }

// 6. Explore sibling span relationships
{ span.db.system = "postgresql" } ~ { span.db.system = "redis" }

// 7. Span name matching using regex
{ name =~ "HTTP.*POST" && resource.deployment.environment = "production" }

// 8. Filter by total trace duration
{ traceDuration > 3s }

// 9. Filter by root service
{ rootServiceName = "ingress-nginx" && duration > 1s }

// 10. Analysis using aggregation functions
{ resource.service.name = "checkout-service" } | rate()

// 11. Check latency distribution with histogram
{ resource.service.name = "search-service" } | histogram_over_time(duration)

// 12. Anomaly detection based on count
{ status = error } | count() > 100

Key Aggregation Functions

Function	Description	Example
`rate()`	Spans per second rate	`{} \| rate()`
`count()`	Matching span count	`{ status = error } \| count()`
`avg(field)`	Field average value	`{} \| avg(duration)`
`max(field)`	Field maximum value	`{} \| max(duration)`
`min(field)`	Field minimum value	`{} \| min(duration)`
`p50/p90/p95/p99(field)`	Percentiles	`{} \| p99(duration)`
`histogram_over_time(field)`	Histogram over time	`{} \| histogram_over_time(duration)`
`quantile_over_time(field, q)`	Quantile over time	`{} \| quantile_over_time(duration, 0.95)`

Span Metrics and Service Graphs

Tempo's Metrics Generator is a powerful feature that automatically generates metrics from received spans. Without separate metric collection, you can obtain RED metrics and service dependency graphs from trace data alone.

Span Metrics Generator

The span metrics processor converts Request Rate, Error Rate, and Duration distribution from all incoming spans into Prometheus metrics. The main metrics generated are:

traces_spanmetrics_calls_total: Total span call count
traces_spanmetrics_latency_bucket: Latency histogram buckets
traces_spanmetrics_size_total: Total span size

By configuring dimensions, you can add span attributes like http.method, http.status_code, and http.route as metric labels, allowing fine-grained RED metrics observation per endpoint.

Service Graph Generator

The service graph processor analyzes client-server span pairs to automatically map call relationships between services. The service topology can be visually confirmed in Grafana's service graph view, with request rate, error rate, and latency displayed on each edge.

Key configuration parameters include:

max_items: Maximum number of service pairs to track (default 10000)
wait: Wait time for incomplete edges (default 10s)
dimensions: Custom labels to add to the service graph
histogram_buckets: Latency histogram bucket boundaries (default 0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4, 12.8)

Tempo vs Jaeger vs Zipkin Comparison

When selecting a distributed tracing backend, comparing the characteristics of each tool is important.

Item	Grafana Tempo	Jaeger	Zipkin
Initial release	2020 (Grafana Labs)	2015 (Uber)	2012 (Twitter)
CNCF status	-	Graduated	-
Storage method	Object storage (no index)	Elasticsearch, Cassandra, etc.	Elasticsearch, Cassandra, MySQL
Indexing	None (Trace ID + TraceQL)	Tag-based index creation	Tag-based index creation
Storage cost	Low (S3/GCS pricing)	High (includes index storage)	High
Ingestion protocols	OTLP, Jaeger, Zipkin	OTLP, Jaeger	Zipkin, OTLP (limited)
Query language	TraceQL	Tag-based search	Tag-based search
Built-in UI	Grafana integration	Jaeger UI	Zipkin UI
Metrics generation	Built-in (Metrics Generator)	External tools needed	External tools needed
Scalability	Excellent (PB scale)	Moderate	Limited
Grafana integration	Native	Plugin	Plugin
Maintained by	Grafana Labs (commercial support)	CNCF community	Volunteer community

Selection Criteria Summary: If you already use the Grafana ecosystem and want to store large-scale traces at low cost, Tempo is optimal. If you need an independent tracing system and rich tag-based search is essential, consider Jaeger. For small teams looking to quickly adopt tracing, Zipkin remains a viable option.

OpenTelemetry Collector Integration

The most recommended way to send traces to Tempo is using OpenTelemetry Collector as an intermediate pipeline. The Collector collects traces from various sources, performs batch processing and retries, then reliably sends them to Tempo.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: '0.0.0.0:4317'
      http:
        endpoint: '0.0.0.0:4318'

processors:
  batch:
    timeout: 5s
    send_batch_size: 10000
    send_batch_max_size: 11000

  memory_limiter:
    check_interval: 1s
    limit_mib: 4096
    spike_limit_mib: 512

  attributes:
    actions:
      - key: deployment.environment
        value: production
        action: upsert

  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      - name: errors-policy
        type: status_code
        status_code:
          status_codes:
            - ERROR
      - name: slow-traces-policy
        type: latency
        latency:
          threshold_ms: 1000
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

exporters:
  otlp/tempo:
    endpoint: 'tempo:4317'
    tls:
      insecure: true
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s
    sending_queue:
      enabled: true
      num_consumers: 10
      queue_size: 5000

  debug:
    verbosity: basic

service:
  telemetry:
    logs:
      level: info
    metrics:
      address: '0.0.0.0:8888'

  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, attributes, batch]
      exporters: [otlp/tempo, debug]

The key aspects of this configuration are:

tail_sampling: Error spans are collected at 100%, slow traces over 1 second are also fully collected, and the rest are sampled at 10% probability. This ensures important traces are not missed while reducing storage costs.
memory_limiter: Limits Collector memory usage to 4GB to prevent OOM.
sending_queue: Buffers data in the queue and retries even during temporary Tempo outages.
batch: Groups spans into batches of 10,000 for transmission, improving network efficiency.

Storage Optimization

Tempo's storage design is centered on object storage. In production environments, choose S3, GCS, or Azure Blob Storage as the backend.

Storage Backend Comparison

Item	Amazon S3	Google Cloud Storage	Azure Blob Storage
Config key	`s3`	`gcs`	`azure`
Authentication	IAM Role, Access Key	Service Account, Workload Identity	Managed Identity, SAS Token
Cost (GB/month)	$0.023 (Standard)	$0.020 (Standard)	$0.018 (Hot)
Region availability	33+ regions	40+ regions	60+ regions
Tempo compatibility	Full support	Full support	Full support
Lifecycle policy	S3 Lifecycle	Object Lifecycle	Lifecycle Management

S3 Backend Configuration Example

storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces-prod
      endpoint: s3.ap-northeast-2.amazonaws.com
      region: ap-northeast-2
      access_key: ${S3_ACCESS_KEY}
      secret_key: ${S3_SECRET_KEY}
      # Or omit access_key/secret_key when using IAM Role
    wal:
      path: /var/tempo/wal
    block:
      bloom_filter_false_positive: 0.01
      v2_index_downsample_bytes: 1048576
      v2_encoding: zstd
    blocklist_poll: 5m
    pool:
      max_workers: 200
      queue_depth: 20000

compactor:
  compaction:
    block_retention: 336h # 14-day retention
    compacted_block_retention: 1h
    compaction_window: 4h
    max_block_bytes: 107374182400 # 100GB
    max_compaction_objects: 6000000
    retention_concurrency: 10
  ring:
    kvstore:
      store: memberlist

Storage Optimization Tips

Block Encoding: Setting v2_encoding to zstd achieves approximately 30-40% higher compression ratio compared to snappy, but with slightly increased CPU usage. Choose snappy for write-heavy workloads, or zstd when storage cost is the priority.

Bloom Filter Tuning: Lowering bloom_filter_false_positive (e.g., 0.01 to 0.005) improves query accuracy but increases bloom filter size. In environments with frequent queries, reducing the false positive rate is beneficial for overall performance.

Block Retention Period: Set block_retention according to business requirements. 14 days (336h) is typical, but compliance requirements may necessitate 90 days or more. In such cases, using object storage lifecycle policies to automatically transition to Infrequent Access (S3) or Nearline (GCS) tiers can reduce costs.

Compactor Tuning: Setting max_block_bytes too high causes Compactor memory usage to spike, while setting it too low increases the number of blocks and degrades query performance. Around 100GB is a balanced value.

Grafana Dashboard Configuration

Tempo integrates natively with Grafana, providing rich tracing visualization without a separate UI. Below are the Grafana datasource provisioning configuration and dashboard configuration examples.

Datasource Provisioning

# grafana-datasources.yaml
apiVersion: 1

datasources:
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    uid: tempo
    jsonData:
      httpMethod: GET
      tracesToLogsV2:
        datasourceUid: loki
        spanStartTimeShift: '-1h'
        spanEndTimeShift: '1h'
        filterByTraceID: true
        filterBySpanID: true
      tracesToMetrics:
        datasourceUid: prometheus
        spanStartTimeShift: '-1h'
        spanEndTimeShift: '1h'
        tags:
          - key: service.name
            value: service
          - key: http.method
            value: method
      tracesToProfiles:
        datasourceUid: pyroscope
        profileTypeId: 'process_cpu:cpu:nanoseconds:cpu:nanoseconds'
        tags:
          - key: service.name
            value: service_name
      serviceMap:
        datasourceUid: prometheus
      nodeGraph:
        enabled: true
      search:
        hide: false
      traceQuery:
        timeShiftEnabled: true
        spanStartTimeShift: '-30m'
        spanEndTimeShift: '30m'

Dashboard JSON Snippet

The following is a Grafana dashboard panel configuration showing request rate and error rate by service.

{
  "panels": [
    {
      "title": "Service Request Rate",
      "type": "timeseries",
      "datasource": { "uid": "prometheus", "type": "prometheus" },
      "targets": [
        {
          "expr": "sum(rate(traces_spanmetrics_calls_total{status_code!=\"STATUS_CODE_ERROR\"}[5m])) by (service)",
          "legendFormat": "{{ service }}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "reqps",
          "custom": { "drawStyle": "line", "lineWidth": 2 }
        }
      }
    },
    {
      "title": "Service Error Rate",
      "type": "timeseries",
      "datasource": { "uid": "prometheus", "type": "prometheus" },
      "targets": [
        {
          "expr": "sum(rate(traces_spanmetrics_calls_total{status_code=\"STATUS_CODE_ERROR\"}[5m])) by (service) / sum(rate(traces_spanmetrics_calls_total[5m])) by (service) * 100",
          "legendFormat": "{{ service }}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "thresholds": {
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 1 },
              { "color": "red", "value": 5 }
            ]
          }
        }
      }
    },
    {
      "title": "P99 Latency by Service",
      "type": "timeseries",
      "datasource": { "uid": "prometheus", "type": "prometheus" },
      "targets": [
        {
          "expr": "histogram_quantile(0.99, sum(rate(traces_spanmetrics_latency_bucket[5m])) by (le, service))",
          "legendFormat": "{{ service }}"
        }
      ],
      "fieldConfig": {
        "defaults": { "unit": "s" }
      }
    }
  ]
}

Key Integration Features

When using Tempo in Grafana, the most powerful features are the three cross-datasource integrations: Traces to Logs, Traces to Metrics, and Traces to Profiles.

Traces to Logs: Clicking a specific span in the trace view navigates directly to Loki logs for that time window. It automatically filters by Trace ID and Span ID, showing only related logs.
Traces to Metrics: You can jump to Prometheus metric queries based on span attributes. When slow spans are found, you can immediately check CPU and memory metrics for that service.
Traces to Profiles: When integrated with Pyroscope, you can trace the cause of slow spans down to the code level (function call profiles).

Troubleshooting

This section covers common issues and solutions encountered when operating Tempo.

Ingester Out of Memory (OOM)

Symptom: Ingester Pods repeatedly restart with OOMKilled status.

Cause: In-memory blocks become excessively large due to traffic spikes, or max_block_duration is set too long.

Solution: Reduce ingester.max_block_duration to 5 minutes to shorten the flush cycle, and limit ingester.max_block_bytes to a range of 500MB to 1GB. Kubernetes resource requests and limits should also be set sufficiently. Increasing the number of Ingester instances to distribute load is also effective.

TraceQL Query Timeout

Symptom: "context deadline exceeded" errors occur repeatedly during TraceQL searches.

Cause: Occurs when there are too many blocks (Compactor not functioning) or the search scope is too broad.

Solution: Verify that the Compactor is operating normally and adjust compaction_window appropriately. Set query_frontend.max_retries to 3 and limit results with query_frontend.search.default_result_limit. Narrowing the query time range is also an immediate mitigation.

Missing Spans

Symptom: Some spans are missing from traces, resulting in incomplete trace queries.

Cause: Often caused by hash ring inconsistency between Distributor and Ingester, network partitions, or sampling policy mismatches.

Solution: Check for "ring not healthy" messages in distributor logs. Verify that the Memberlist communication port (default 7946) is open in the firewall. Validate that the OTel Collector's tail_sampling policy is working as intended, and temporarily enable the debug exporter to trace span flow.

Compactor Block Merge Failure

Symptom: The number of blocks in object storage keeps increasing and query performance gradually degrades.

Cause: Compactor memory shortage, object storage permission issues, or max_compaction_objects limit exceeded.

Solution: Increase the Compactor's memory allocation and reconfirm storage IAM permissions (ListBucket, GetObject, PutObject, DeleteObject). Gradually increase compaction.max_compaction_objects to handle large blocks.

Operations Checklist

This is a checklist for reliably operating Tempo in production environments.

Pre-deployment Checks

Determine deployment mode (based on daily traffic: under 100GB Monolithic, 100GB-1TB Scalable, over 1TB Microservices)
Create object storage bucket and configure IAM permissions
Verify disk IOPS for WAL storage path (SSD recommended, minimum 3000 IOPS)
Configure network policies (Memberlist 7946/TCP, OTLP 4317-4318/TCP)
Provision TLS certificates (mTLS recommended)
Set resource requests/limits (Ingester: minimum 4GB RAM, Compactor: minimum 8GB RAM)

Essential Monitoring Metrics

tempo_ingester_live_traces: Active trace count (memory pressure indicator)
tempo_ingester_bytes_received_total: Bytes received per second
tempo_compactor_blocks_total: Object storage block count (alert on sustained increase)
tempo_distributor_spans_received_total: Received span count (check for drops)
tempo_query_frontend_queries_total: Query throughput and error rate
tempo_discarded_spans_total: Discarded span count (investigate immediately if non-zero)

Regular Inspection Items

Weekly: Check Compactor block merge status, monitor block count trends
Weekly: Check WAL disk usage and verify flush operation
Monthly: Review storage costs and reassess retention periods
Monthly: Benchmark TraceQL query performance (track response times for key query patterns)
Quarterly: Plan Tempo version upgrades and conduct compatibility tests

Failure Cases and Recovery

Case 1: Data Loss Due to Ingester WAL Corruption

Situation: An unexpected Kubernetes node shutdown corrupted the WAL on 2 out of 3 Ingesters. The Ingesters failed to recover WAL on restart, resulting in approximately 15 minutes of trace data loss.

Recovery Process: First, manually cleared the corrupted WAL directories and restarted the Ingesters. For the lost time window, partial recovery was achieved by resending some data buffered in the OTel Collector's sending_queue.

Lessons Learned: Set the Ingester's replication_factor to 3 so that identical spans are replicated to at least 2 Ingesters. Fixed the WAL path to local NVMe SSD and changed the PV (PersistentVolume) reclaimPolicy to Retain to preserve WAL even during Pod rescheduling. Increased Ingester Pod's terminationGracePeriodSeconds to 300 seconds to allow flush time during shutdown.

Case 2: Query Performance Collapse Due to Compactor Failure

Situation: After an S3 IAM policy change, the Compactor lost DeleteObject permissions, and block merging was interrupted for 2 weeks. Over 500,000 small blocks accumulated, causing TraceQL search response time to surge from the usual 2 seconds to 45 seconds.

Recovery Process: The S3 IAM policy was immediately corrected and the Compactor was restarted. However, attempting to merge 500,000 blocks at once caused Compactor OOM. By lowering compaction.max_compaction_objects from 1 million to 100,000 and reducing compaction_window to 1 hour, blocks were gradually merged. Full normalization took 3 days.

Lessons Learned: Set up an alarm on the tempo_compactor_blocks_total metric to receive immediate notification when the block count increases abnormally. Added a check item to the change management process to verify whether Tempo-related permissions are affected when IAM policies change.

Case 3: Cardinality Explosion from Indiscriminate Custom Attributes

Situation: The development team indiscriminately added user IDs (user.id) as span attributes, and this attribute was included in the Metrics Generator's dimensions, causing cardinality to explode to millions. Prometheus remote write became a bottleneck, delaying the entire metrics collection.

Recovery Process: Immediately removed user.id from dimensions and restarted the Metrics Generator. Deleted the affected time series in Prometheus to reclaim storage.

Lessons Learned: Always verify the cardinality of attributes added to dimensions in advance. Established a policy where attributes that could exceed 1000 cardinality are used only for TraceQL search instead of as metric labels. Also added a safety measure by setting overrides.defaults.metrics_generator.max_active_series to limit the number of time series.

References

Quiz

Q1: What is the main topic covered in "Grafana Tempo Distributed Tracing and TraceQL Operations Guide 2026"?

Grafana Tempo distributed tracing and TraceQL operations guide. Covering Tempo architecture, deployment modes, TraceQL query syntax, span metrics generation, Grafana dashboard integration, and storage optimization.

Q2: Describe the Tempo Architecture.

Internally, Tempo uses multiple components that work together to collect, store, and query trace data. Understanding each component's role helps quickly identify bottlenecks when failures occur. Core Components Distributor is the entry point that receives span data from clients.

Q3: Explain the core concept of Deployment Modes.

Tempo provides three deployment modes that can be selected based on the organization's scale and requirements. Deployment Mode Comparison Monolithic Mode All components run in a single process. Suitable for local environments or small workloads with the simplest configuration.

Q4: What are the key aspects of Quick Start with Docker Compose?

To quickly try Tempo in a local environment, use Docker Compose. The configuration below brings up Tempo (Monolithic), OTel Collector, Grafana, and Prometheus all at once.

Q5: How does TraceQL Query Syntax work?

TraceQL is Tempo's dedicated query language, following a syntax system similar to PromQL and LogQL. It selects spansets with curly braces and chains filters and aggregations with pipeline operators.