Skip to content

Split View: OpenTelemetry 분산 트레이싱 실전 가이드 — 계측부터 Jaeger 시각화까지

|

OpenTelemetry 분산 트레이싱 실전 가이드 — 계측부터 Jaeger 시각화까지

OpenTelemetry Distributed Tracing

들어가며

마이크로서비스 환경에서 하나의 사용자 요청은 수십 개의 서비스를 거칩니다. 어디서 병목이 발생하는지, 어떤 서비스가 느린지 파악하려면 분산 트레이싱이 필수입니다.

**OpenTelemetry(OTel)**는 CNCF의 Observability 표준으로, Traces, Metrics, Logs를 통합하는 벤더 중립적 프레임워크입니다.

핵심 개념

Trace, Span, Context

Trace (전체 요청의 여정):
├── Span A: API Gateway (50ms)
│   ├── Span B: Auth Service (10ms)
│   ├── Span C: Order Service (35ms)
│   │   ├── Span D: Database Query (15ms)
│   │   └── Span E: Payment Service (18ms)
│   │       └── Span F: Bank API Call (12ms)
│   └── Span G: Notification Service (5ms, async)
# Span의 구조
{
    "traceId": "abc123def456...",        # 전체 Trace 고유 ID (128-bit)
    "spanId": "span789...",              # 이 Span 고유 ID (64-bit)
    "parentSpanId": "parent456...",      # 부모 Span ID
    "name": "POST /api/orders",          # Span 이름
    "kind": "SERVER",                    # CLIENT, SERVER, PRODUCER, CONSUMER, INTERNAL
    "startTime": "2026-03-03T12:00:00Z",
    "endTime": "2026-03-03T12:00:00.050Z",
    "status": {"code": "OK"},
    "attributes": {                       # 메타데이터
        "http.method": "POST",
        "http.url": "/api/orders",
        "http.status_code": 201,
        "service.name": "order-service"
    },
    "events": [                           # Span 내 이벤트
        {
            "name": "order.validated",
            "timestamp": "2026-03-03T12:00:00.010Z",
            "attributes": {"order_id": "ORD-123"}
        }
    ]
}

Context Propagation

서비스 간 컨텍스트 전파:

[Service A]                    [Service B]
    │                              │
    │ traceparent: 00-abc123-span1-01
    │ ─────────────────────────> (같은 traceId로 자식 Span 생성)
    │                              │
    │                              │  traceparent: 00-abc123-span2-01
    │                              │  ──────────> [Service C]

HTTP Header:
traceparent: 00-traceId-spanId-flags
: traceparent: 00-abc123def456789-span12345678-01

Python 계측

자동 계측 (Zero-Code)

# 의존성 설치
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install  # 자동 계측 패키지 설치

# 자동 계측으로 앱 실행
opentelemetry-instrument \
  --service_name order-service \
  --traces_exporter otlp \
  --metrics_exporter otlp \
  --exporter_otlp_endpoint http://otel-collector:4317 \
  python app.py

수동 계측 (세밀한 제어)

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.semconv.resource import ResourceAttributes

# 1. TracerProvider 설정
resource = Resource.create({
    ResourceAttributes.SERVICE_NAME: "order-service",
    ResourceAttributes.SERVICE_VERSION: "1.2.0",
    ResourceAttributes.DEPLOYMENT_ENVIRONMENT: "production",
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://otel-collector:4317")
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# 2. Tracer 생성
tracer = trace.get_tracer("order-service", "1.2.0")

# 3. Span 생성 및 사용
@tracer.start_as_current_span("create_order")
def create_order(order_data: dict) -> dict:
    span = trace.get_current_span()

    # 속성 추가
    span.set_attribute("order.customer_id", order_data["customer_id"])
    span.set_attribute("order.total_amount", order_data["total"])
    span.set_attribute("order.item_count", len(order_data["items"]))

    # 이벤트 추가
    span.add_event("order.validation_started")

    # 검증
    validate_order(order_data)
    span.add_event("order.validation_completed")

    # 결제 처리 (자식 Span 자동 생성)
    payment_result = process_payment(order_data)

    # 결과 기록
    span.set_attribute("order.id", payment_result["order_id"])
    span.set_status(trace.StatusCode.OK)

    return payment_result


@tracer.start_as_current_span("process_payment")
def process_payment(order_data: dict) -> dict:
    span = trace.get_current_span()
    span.set_attribute("payment.method", order_data.get("payment_method", "card"))

    try:
        result = payment_client.charge(order_data)
        span.set_attribute("payment.transaction_id", result["txn_id"])
        return result
    except Exception as e:
        span.set_status(trace.StatusCode.ERROR, str(e))
        span.record_exception(e)
        raise

FastAPI 통합

from fastapi import FastAPI, Request
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

app = FastAPI()

# 자동 계측 적용
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()  # 외부 HTTP 호출
SQLAlchemyInstrumentor().instrument(engine=db_engine)  # DB 쿼리

@app.post("/api/orders")
async def create_order(request: Request, order: OrderRequest):
    # 현재 Span에 비즈니스 컨텍스트 추가
    span = trace.get_current_span()
    span.set_attribute("order.customer_id", order.customer_id)
    span.set_attribute("order.region", order.shipping_region)

    result = await order_service.create(order)
    return result

Java/Spring Boot 계측

Spring Boot 자동 설정

<dependency>
    <groupId>io.opentelemetry.instrumentation</groupId>
    <artifactId>opentelemetry-spring-boot-starter</artifactId>
    <version>2.11.0</version>
</dependency>
# application.yml
otel:
  service:
    name: order-service
  exporter:
    otlp:
      endpoint: http://otel-collector:4317
  traces:
    sampler: parentbased_traceidratio
    sampler.arg: '0.1' # 10% 샘플링 (프로덕션)
@RestController
@RequiredArgsConstructor
public class OrderController {

    private final Tracer tracer;
    private final OrderService orderService;

    @PostMapping("/api/orders")
    public ResponseEntity<OrderResponse> createOrder(@RequestBody OrderRequest request) {
        Span span = Span.current();
        span.setAttribute("order.customer_id", request.getCustomerId());
        span.setAttribute("order.total", request.getTotal().doubleValue());

        OrderResponse response = orderService.create(request);

        span.setAttribute("order.id", response.getOrderId());
        return ResponseEntity.status(HttpStatus.CREATED).body(response);
    }
}

// 수동 Span 생성
@Service
public class PaymentService {

    @Autowired
    private Tracer tracer;

    public PaymentResult processPayment(Order order) {
        Span span = tracer.spanBuilder("process_payment")
            .setAttribute("payment.amount", order.getTotal().doubleValue())
            .startSpan();

        try (Scope scope = span.makeCurrent()) {
            PaymentResult result = paymentGateway.charge(order);
            span.setAttribute("payment.txn_id", result.getTransactionId());
            return result;
        } catch (Exception e) {
            span.setStatus(StatusCode.ERROR, e.getMessage());
            span.recordException(e);
            throw e;
        } finally {
            span.end();
        }
    }
}

OTel Collector 구성

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
    send_batch_max_size: 2048

  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

  attributes:
    actions:
      - key: environment
        value: production
        action: upsert

  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 1000 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes, tail_sampling]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

Kubernetes 배포

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
spec:
  replicas: 2
  selector:
    matchLabels:
      app: otel-collector
  template:
    spec:
      containers:
        - name: collector
          image: otel/opentelemetry-collector-contrib:0.96.0
          args: ['--config=/conf/config.yaml']
          ports:
            - containerPort: 4317 # gRPC
            - containerPort: 4318 # HTTP
            - containerPort: 8889 # Prometheus metrics
          resources:
            requests:
              cpu: 200m
              memory: 256Mi
            limits:
              cpu: 1000m
              memory: 512Mi
          volumeMounts:
            - name: config
              mountPath: /conf
      volumes:
        - name: config
          configMap:
            name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
  name: otel-collector
spec:
  selector:
    app: otel-collector
  ports:
    - name: grpc
      port: 4317
    - name: http
      port: 4318

샘플링 전략

from opentelemetry.sdk.trace.sampling import (
    TraceIdRatioBased,
    ParentBasedTraceIdRatio,
    ALWAYS_ON,
    ALWAYS_OFF,
)

# 개발 환경: 전부 수집
dev_sampler = ALWAYS_ON

# 프로덕션: 10% 샘플링 (부모 Span 결정 따름)
prod_sampler = ParentBasedTraceIdRatio(0.1)

# 커스텀 샘플러: 에러는 100%, 정상은 5%
class SmartSampler:
    def should_sample(self, context, trace_id, name, kind, attributes, links):
        # 에러 Span은 무조건 수집
        if attributes and attributes.get("error") == True:
            return SamplingResult(Decision.RECORD_AND_SAMPLE)
        # 정상은 5%
        if trace_id % 100 < 5:
            return SamplingResult(Decision.RECORD_AND_SAMPLE)
        return SamplingResult(Decision.DROP)

Jaeger UI 활용

# Docker로 Jaeger 실행
docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/all-in-one:1.62

# 브라우저에서 http://localhost:16686 접속

주요 기능:

1. Trace 검색
   - Service 선택 → Operation 필터 → Duration 범위
   - Tags 검색: http.status_code=500

2. Trace 타임라인 (Waterfall)
   -Span의 시작/종료 시간 시각화
   - 병렬 처리와 순차 처리 구분
   -Span의 속성/이벤트/로그 상세 확인

3. Service Dependency Graph
   - 서비스 간 호출 관계 시각화
   - 호출 빈도와 에러율 표시

4. Compare
   -Trace 비교하여 성능 차이 분석

베스트 프랙티스

1. Semantic Conventions 사용

from opentelemetry.semconv.trace import SpanAttributes

# 표준 속성 사용 (벤더 간 호환)
span.set_attribute(SpanAttributes.HTTP_METHOD, "POST")
span.set_attribute(SpanAttributes.HTTP_URL, "/api/orders")
span.set_attribute(SpanAttributes.HTTP_STATUS_CODE, 201)
span.set_attribute(SpanAttributes.DB_SYSTEM, "postgresql")
span.set_attribute(SpanAttributes.DB_STATEMENT, "SELECT * FROM orders")

2. 민감 정보 제거

# Collector에서 PII 제거
processors:
  attributes:
    actions:
      - key: http.request.header.authorization
        action: delete
      - key: db.statement
        action: hash  # SQL을 해시로 대체
      - key: user.email
        action: delete

3. 에러 기록

try:
    result = external_api.call()
except Exception as e:
    span.set_status(trace.StatusCode.ERROR, str(e))
    span.record_exception(e)
    # record_exception은 자동으로 스택 트레이스 포함
    raise

퀴즈

Q1. Trace, Span, Context의 관계는?

Trace는 하나의 요청 전체 경로, Span은 그 중 하나의 작업 단위, Context는 Trace/Span 정보를 서비스 간에 전파하는 메커니즘입니다.

Q2. W3C Trace Context의 traceparent 형식은?

version-traceId-spanId-flags 형식입니다. 예: 00-abc123...-span123...-01

Q3. OpenTelemetry Collector의 역할은?

Receive(수신) → Process(처리/필터링/샘플링) → Export(전송)의 파이프라인을 담당합니다. 애플리케이션과 백엔드(Jaeger, Tempo 등) 사이의 중간 계층입니다.

Q4. Tail Sampling vs Head Sampling의 차이는?

Head Sampling은 Trace 시작 시 수집 여부를 결정하고, Tail Sampling은 Trace 완료 후 전체 정보를 보고 결정합니다. Tail Sampling이 에러 Trace를 100% 수집하는 등 더 정밀한 제어가 가능합니다.

Q5. 프로덕션에서 샘플링률을 100%로 설정하면 안 되는 이유는?

Trace 데이터 볼륨이 매우 커서 스토리지 비용 증가, 네트워크 부하, 그리고 Collector 과부하가 발생합니다. 보통 5~10%가 적절합니다.

Q6. record_exception()과 set_status(ERROR)의 차이는?

record_exception()은 Span에 예외 이벤트(스택 트레이스 포함)를 기록하고, set_status(ERROR)는 Span의 상태를 에러로 표시합니다. 보통 둘 다 함께 사용합니다.

Q7. Semantic Conventions를 사용하는 이유는?

표준화된 속성 이름을 사용하면 다른 벤더의 백엔드(Jaeger, Datadog, Grafana 등)에서도 일관되게 Trace를 분석할 수 있습니다.

마무리

OpenTelemetry는 분산 시스템의 Observability 표준으로 자리잡았습니다. 자동 계측으로 빠르게 시작하고, 필요에 따라 수동 계측으로 비즈니스 컨텍스트를 추가하세요. OTel Collector를 중간 계층으로 두면 샘플링, 필터링, 다중 백엔드 전송 등 유연한 운영이 가능합니다.

참고 자료

OpenTelemetry Distributed Tracing Practical Guide — From Instrumentation to Jaeger Visualization

OpenTelemetry Distributed Tracing

Introduction

In microservice environments, a single user request passes through dozens of services. To identify where bottlenecks occur and which services are slow, distributed tracing is essential.

OpenTelemetry (OTel) is the CNCF Observability standard, a vendor-neutral framework that unifies Traces, Metrics, and Logs.

Core Concepts

Trace, Span, Context

Trace (the journey of an entire request):
├── Span A: API Gateway (50ms)
│   ├── Span B: Auth Service (10ms)
│   ├── Span C: Order Service (35ms)
│   │   ├── Span D: Database Query (15ms)
│   │   └── Span E: Payment Service (18ms)
│   │       └── Span F: Bank API Call (12ms)
│   └── Span G: Notification Service (5ms, async)
# Structure of a Span
{
    "traceId": "abc123def456...",        # Unique Trace ID (128-bit)
    "spanId": "span789...",              # Unique Span ID (64-bit)
    "parentSpanId": "parent456...",      # Parent Span ID
    "name": "POST /api/orders",          # Span name
    "kind": "SERVER",                    # CLIENT, SERVER, PRODUCER, CONSUMER, INTERNAL
    "startTime": "2026-03-03T12:00:00Z",
    "endTime": "2026-03-03T12:00:00.050Z",
    "status": {"code": "OK"},
    "attributes": {                       # Metadata
        "http.method": "POST",
        "http.url": "/api/orders",
        "http.status_code": 201,
        "service.name": "order-service"
    },
    "events": [                           # Events within the Span
        {
            "name": "order.validated",
            "timestamp": "2026-03-03T12:00:00.010Z",
            "attributes": {"order_id": "ORD-123"}
        }
    ]
}

Context Propagation

Context propagation between services:

[Service A]                    [Service B]
    │                              │
    │ traceparent: 00-abc123-span1-01
    │ ─────────────────────────> (Creates child Span with same traceId)
    │                              │
    │                              │  traceparent: 00-abc123-span2-01
    │                              │  ──────────> [Service C]

HTTP Header:
traceparent: 00-{traceId}-{spanId}-{flags}
Example: traceparent: 00-abc123def456789-span12345678-01

Python Instrumentation

Auto-Instrumentation (Zero-Code)

# Install dependencies
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install  # Install auto-instrumentation packages

# Run app with auto-instrumentation
opentelemetry-instrument \
  --service_name order-service \
  --traces_exporter otlp \
  --metrics_exporter otlp \
  --exporter_otlp_endpoint http://otel-collector:4317 \
  python app.py

Manual Instrumentation (Fine-Grained Control)

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.semconv.resource import ResourceAttributes

# 1. Configure TracerProvider
resource = Resource.create({
    ResourceAttributes.SERVICE_NAME: "order-service",
    ResourceAttributes.SERVICE_VERSION: "1.2.0",
    ResourceAttributes.DEPLOYMENT_ENVIRONMENT: "production",
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://otel-collector:4317")
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# 2. Create Tracer
tracer = trace.get_tracer("order-service", "1.2.0")

# 3. Create and use Spans
@tracer.start_as_current_span("create_order")
def create_order(order_data: dict) -> dict:
    span = trace.get_current_span()

    # Add attributes
    span.set_attribute("order.customer_id", order_data["customer_id"])
    span.set_attribute("order.total_amount", order_data["total"])
    span.set_attribute("order.item_count", len(order_data["items"]))

    # Add events
    span.add_event("order.validation_started")

    # Validation
    validate_order(order_data)
    span.add_event("order.validation_completed")

    # Process payment (child Span created automatically)
    payment_result = process_payment(order_data)

    # Record result
    span.set_attribute("order.id", payment_result["order_id"])
    span.set_status(trace.StatusCode.OK)

    return payment_result


@tracer.start_as_current_span("process_payment")
def process_payment(order_data: dict) -> dict:
    span = trace.get_current_span()
    span.set_attribute("payment.method", order_data.get("payment_method", "card"))

    try:
        result = payment_client.charge(order_data)
        span.set_attribute("payment.transaction_id", result["txn_id"])
        return result
    except Exception as e:
        span.set_status(trace.StatusCode.ERROR, str(e))
        span.record_exception(e)
        raise

FastAPI Integration

from fastapi import FastAPI, Request
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

app = FastAPI()

# Apply auto-instrumentation
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()  # External HTTP calls
SQLAlchemyInstrumentor().instrument(engine=db_engine)  # DB queries

@app.post("/api/orders")
async def create_order(request: Request, order: OrderRequest):
    # Add business context to current Span
    span = trace.get_current_span()
    span.set_attribute("order.customer_id", order.customer_id)
    span.set_attribute("order.region", order.shipping_region)

    result = await order_service.create(order)
    return result

Java/Spring Boot Instrumentation

Spring Boot Auto-Configuration

<dependency>
    <groupId>io.opentelemetry.instrumentation</groupId>
    <artifactId>opentelemetry-spring-boot-starter</artifactId>
    <version>2.11.0</version>
</dependency>
# application.yml
otel:
  service:
    name: order-service
  exporter:
    otlp:
      endpoint: http://otel-collector:4317
  traces:
    sampler: parentbased_traceidratio
    sampler.arg: '0.1' # 10% sampling (production)
@RestController
@RequiredArgsConstructor
public class OrderController {

    private final Tracer tracer;
    private final OrderService orderService;

    @PostMapping("/api/orders")
    public ResponseEntity<OrderResponse> createOrder(@RequestBody OrderRequest request) {
        Span span = Span.current();
        span.setAttribute("order.customer_id", request.getCustomerId());
        span.setAttribute("order.total", request.getTotal().doubleValue());

        OrderResponse response = orderService.create(request);

        span.setAttribute("order.id", response.getOrderId());
        return ResponseEntity.status(HttpStatus.CREATED).body(response);
    }
}

// Manual Span creation
@Service
public class PaymentService {

    @Autowired
    private Tracer tracer;

    public PaymentResult processPayment(Order order) {
        Span span = tracer.spanBuilder("process_payment")
            .setAttribute("payment.amount", order.getTotal().doubleValue())
            .startSpan();

        try (Scope scope = span.makeCurrent()) {
            PaymentResult result = paymentGateway.charge(order);
            span.setAttribute("payment.txn_id", result.getTransactionId());
            return result;
        } catch (Exception e) {
            span.setStatus(StatusCode.ERROR, e.getMessage());
            span.recordException(e);
            throw e;
        } finally {
            span.end();
        }
    }
}

OTel Collector Configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
    send_batch_max_size: 2048

  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

  attributes:
    actions:
      - key: environment
        value: production
        action: upsert

  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 1000 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes, tail_sampling]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
spec:
  replicas: 2
  selector:
    matchLabels:
      app: otel-collector
  template:
    spec:
      containers:
        - name: collector
          image: otel/opentelemetry-collector-contrib:0.96.0
          args: ['--config=/conf/config.yaml']
          ports:
            - containerPort: 4317 # gRPC
            - containerPort: 4318 # HTTP
            - containerPort: 8889 # Prometheus metrics
          resources:
            requests:
              cpu: 200m
              memory: 256Mi
            limits:
              cpu: 1000m
              memory: 512Mi
          volumeMounts:
            - name: config
              mountPath: /conf
      volumes:
        - name: config
          configMap:
            name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
  name: otel-collector
spec:
  selector:
    app: otel-collector
  ports:
    - name: grpc
      port: 4317
    - name: http
      port: 4318

Sampling Strategies

from opentelemetry.sdk.trace.sampling import (
    TraceIdRatioBased,
    ParentBasedTraceIdRatio,
    ALWAYS_ON,
    ALWAYS_OFF,
)

# Development: collect everything
dev_sampler = ALWAYS_ON

# Production: 10% sampling (follows parent Span decision)
prod_sampler = ParentBasedTraceIdRatio(0.1)

# Custom sampler: 100% for errors, 5% for normal
class SmartSampler:
    def should_sample(self, context, trace_id, name, kind, attributes, links):
        # Always collect error Spans
        if attributes and attributes.get("error") == True:
            return SamplingResult(Decision.RECORD_AND_SAMPLE)
        # 5% for normal
        if trace_id % 100 < 5:
            return SamplingResult(Decision.RECORD_AND_SAMPLE)
        return SamplingResult(Decision.DROP)

Using Jaeger UI

# Run Jaeger with Docker
docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/all-in-one:1.62

# Access http://localhost:16686 in your browser

Key features:

1. Trace Search
   - Select ServiceFilter by OperationDuration range
   - Tag search: http.status_code=500

2. Trace Timeline (Waterfall)
   - Visualize start/end times of each Span
   - Distinguish parallel from sequential processing
   - View detailed attributes/events/logs for each Span

3. Service Dependency Graph
   - Visualize call relationships between services
   - Display call frequency and error rate

4. Compare
   - Compare two Traces to analyze performance differences

Best Practices

1. Use Semantic Conventions

from opentelemetry.semconv.trace import SpanAttributes

# Use standard attributes (cross-vendor compatibility)
span.set_attribute(SpanAttributes.HTTP_METHOD, "POST")
span.set_attribute(SpanAttributes.HTTP_URL, "/api/orders")
span.set_attribute(SpanAttributes.HTTP_STATUS_CODE, 201)
span.set_attribute(SpanAttributes.DB_SYSTEM, "postgresql")
span.set_attribute(SpanAttributes.DB_STATEMENT, "SELECT * FROM orders")

2. Remove Sensitive Information

# Remove PII in the Collector
processors:
  attributes:
    actions:
      - key: http.request.header.authorization
        action: delete
      - key: db.statement
        action: hash  # Replace SQL with hash
      - key: user.email
        action: delete

3. Error Recording

try:
    result = external_api.call()
except Exception as e:
    span.set_status(trace.StatusCode.ERROR, str(e))
    span.record_exception(e)
    # record_exception automatically includes stack trace
    raise

Quiz

Q1. What is the relationship between Trace, Span, and Context?

A Trace represents the entire path of a request, a Span is a single unit of work within it, and Context is the mechanism that propagates Trace/Span information between services.

Q2. What is the format of W3C Trace Context's traceparent?

version-traceId-spanId-flags format. Example: 00-abc123...-span123...-01

Q3. What is the role of the OpenTelemetry Collector?

It handles the pipeline of Receive (ingestion) then Process (processing/filtering/sampling) then Export (transmission). It acts as an intermediate layer between applications and backends (Jaeger, Tempo, etc.).

Q4. What is the difference between Tail Sampling and Head Sampling?

Head Sampling decides whether to collect at the start of a Trace, while Tail Sampling decides after the Trace is complete with full information. Tail Sampling enables more precise control, such as collecting 100% of error Traces.

Q5. Why should you not set the sampling rate to 100% in production?

Trace data volume becomes very large, leading to increased storage costs, network load, and Collector overload. Typically 5-10% is appropriate.

Q6. What is the difference between record_exception() and set_status(ERROR)?

record_exception() records an exception event (including stack trace) on the Span, while set_status(ERROR) marks the Span's status as error. Typically both are used together.

Q7. Why use Semantic Conventions?

Using standardized attribute names allows consistent Trace analysis across different vendor backends (Jaeger, Datadog, Grafana, etc.).

Conclusion

OpenTelemetry has established itself as the Observability standard for distributed systems. Start quickly with auto-instrumentation, then add business context with manual instrumentation as needed. Using the OTel Collector as an intermediate layer enables flexible operations including sampling, filtering, and multi-backend transmission.

References