Split View: OpenTelemetry 분산 트레이싱 실전 가이드: 계측·수집·분석 파이프라인 구축과 운영

OpenTelemetry 분산 트레이싱 실전 가이드: 계측·수집·분석 파이프라인 구축과 운영

들어가며
OpenTelemetry 아키텍처 개요
- 핵심 구성 요소
- 트레이스 모델
수동 계측 (Manual Instrumentation)
자동 계측 (Auto-Instrumentation)
- Python 자동 계측
- Node.js 자동 계측
OpenTelemetry Collector 파이프라인
샘플링 전략
- Head-based 샘플링 vs Tail-based 샘플링
- 샘플링 설정 예시
백엔드 비교
컨텍스트 전파 (Context Propagation)
eBPF 기반 제로코드 계측
실패 사례와 복구 절차
프로덕션 체크리스트
참고자료

들어가며

마이크로서비스 아키텍처에서 하나의 사용자 요청은 수십 개의 서비스를 거쳐 처리된다. 어느 서비스에서 지연이 발생했는지, 어떤 호출 경로에서 오류가 전파되었는지를 파악하려면 분산 트레이싱이 필수다. OpenTelemetry(OTel)는 CNCF 졸업 프로젝트로, 트레이스(Traces), 메트릭(Metrics), 로그(Logs)를 통합된 API와 SDK로 수집하는 관측 표준이다.

이 글에서는 OpenTelemetry의 아키텍처와 핵심 개념부터, Python/Node.js/Go 언어별 계측 방법, Collector 파이프라인 설정, 샘플링 전략, 백엔드 비교, 그리고 프로덕션 환경에서 겪는 실패 사례와 체크리스트까지 실전 운영에 필요한 모든 내용을 다룬다.

OpenTelemetry 아키텍처 개요

핵심 구성 요소

OpenTelemetry는 다음 구성 요소로 이루어져 있다.

API: 벤더 독립적인 계측 인터페이스. 라이브러리 개발자가 사용한다.
SDK: API의 구체적인 구현체. 샘플링, 배치 처리, 내보내기(Export) 등을 담당한다.
Collector: 텔레메트리 데이터를 수신(receive), 처리(process), 내보내는(export) 독립 실행형 프로세스다.
Exporters: 수집된 데이터를 Jaeger, Tempo, Datadog 등 백엔드로 전송하는 모듈이다.
Instrumentation Libraries: 자동 계측을 지원하는 프레임워크별 라이브러리다.

트레이스 모델

분산 트레이싱의 핵심 개념은 다음과 같다.

개념	설명
Trace	하나의 요청에 대한 전체 경로. 여러 Span으로 구성됨
Span	트레이스 내의 개별 작업 단위
SpanContext	SpanID, TraceID, TraceFlags, TraceState를 포함하는 컨텍스트
TraceID	트레이스를 고유하게 식별하는 128비트 ID
SpanID	스팬을 고유하게 식별하는 64비트 ID
Parent Span	현재 스팬을 생성한 상위 스팬
Baggage	트레이스 전체에 걸쳐 전파되는 키-값 쌍
Attributes	스팬에 첨부되는 메타데이터 (key-value)
Events	스팬 내에서 발생한 시점 기반 이벤트 (로그와 유사)
Links	다른 트레이스/스팬과의 인과 관계 연결

수동 계측 (Manual Instrumentation)

Python 계측

# pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.semconv.resource import ResourceAttributes

# 리소스 정의
resource = Resource.create({
    ResourceAttributes.SERVICE_NAME: "order-service",
    ResourceAttributes.SERVICE_VERSION: "1.2.0",
    ResourceAttributes.DEPLOYMENT_ENVIRONMENT: "production",
})

# TracerProvider 설정
provider = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

# Tracer 생성
tracer = trace.get_tracer("order-service", "1.2.0")


# 사용 예시: 주문 처리
def create_order(customer_id: str, items: list) -> dict:
    with tracer.start_as_current_span(
        "create_order",
        attributes={
            "customer.id": customer_id,
            "order.item_count": len(items),
        },
    ) as span:
        try:
            # 재고 확인
            with tracer.start_as_current_span("check_inventory") as inventory_span:
                available = check_inventory(items)
                inventory_span.set_attribute("inventory.all_available", available)

            if not available:
                span.set_status(trace.StatusCode.ERROR, "Inventory not available")
                raise ValueError("Some items are out of stock")

            # 결제 처리
            with tracer.start_as_current_span("process_payment") as payment_span:
                payment_result = process_payment(customer_id, items)
                payment_span.set_attribute("payment.transaction_id", payment_result["tx_id"])
                payment_span.add_event("payment_completed", {
                    "amount": payment_result["amount"],
                    "currency": "KRW",
                })

            # 주문 저장
            with tracer.start_as_current_span("save_order"):
                order = save_to_database(customer_id, items, payment_result)

            span.set_attribute("order.id", order["id"])
            return order

        except Exception as e:
            span.set_status(trace.StatusCode.ERROR, str(e))
            span.record_exception(e)
            raise

Node.js 계측

// npm install @opentelemetry/api @opentelemetry/sdk-node
// npm install @opentelemetry/exporter-trace-otlp-grpc
// npm install @opentelemetry/semantic-conventions

const { NodeSDK } = require('@opentelemetry/sdk-node')
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc')
const { Resource } = require('@opentelemetry/resources')
const { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } = require('@opentelemetry/semantic-conventions')
const { trace, SpanStatusCode } = require('@opentelemetry/api')

// SDK 초기화
const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'user-service',
    [ATTR_SERVICE_VERSION]: '2.1.0',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4317',
  }),
})

sdk.start()

const tracer = trace.getTracer('user-service', '2.1.0')

// 사용 예시: 사용자 조회
async function getUser(userId) {
  return tracer.startActiveSpan('getUser', async (span) => {
    try {
      span.setAttribute('user.id', userId)

      // DB 조회
      const user = await tracer.startActiveSpan('db.query', async (dbSpan) => {
        dbSpan.setAttribute('db.system', 'postgresql')
        dbSpan.setAttribute('db.statement', 'SELECT * FROM users WHERE id = ?')
        const result = await db.query('SELECT * FROM users WHERE id = $1', [userId])
        dbSpan.setAttribute('db.row_count', result.rows.length)
        dbSpan.end()
        return result.rows[0]
      })

      if (!user) {
        span.setStatus({ code: SpanStatusCode.ERROR, message: 'User not found' })
        return null
      }

      // 캐시 업데이트
      await tracer.startActiveSpan('cache.set', async (cacheSpan) => {
        cacheSpan.setAttribute('cache.system', 'redis')
        cacheSpan.setAttribute('cache.key', `user:${userId}`)
        await redis.set(`user:${userId}`, JSON.stringify(user), 'EX', 3600)
        cacheSpan.end()
      })

      span.setStatus({ code: SpanStatusCode.OK })
      return user
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message })
      span.recordException(error)
      throw error
    } finally {
      span.end()
    }
  })
}

Go 계측

package main

import (
    "context"
    "log"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/codes"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.26.0"
    "go.opentelemetry.io/otel/trace"
)

func initTracer() (*sdktrace.TracerProvider, error) {
    exporter, err := otlptracegrpc.New(
        context.Background(),
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    res, err := resource.New(
        context.Background(),
        resource.WithAttributes(
            semconv.ServiceName("payment-service"),
            semconv.ServiceVersion("3.0.1"),
            semconv.DeploymentEnvironmentKey.String("production"),
        ),
    )
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(res),
        sdktrace.WithSampler(sdktrace.ParentBased(sdktrace.TraceIDRatioBased(0.1))),
    )
    otel.SetTracerProvider(tp)
    return tp, nil
}

var tracer = otel.Tracer("payment-service")

func ProcessPayment(ctx context.Context, orderID string, amount float64) error {
    ctx, span := tracer.Start(ctx, "ProcessPayment",
        trace.WithAttributes(
            attribute.String("order.id", orderID),
            attribute.Float64("payment.amount", amount),
        ),
    )
    defer span.End()

    // 사기 탐지 확인
    ctx, fraudSpan := tracer.Start(ctx, "fraud_detection")
    isFraud, err := checkFraud(ctx, orderID, amount)
    if err != nil {
        fraudSpan.SetStatus(codes.Error, err.Error())
        fraudSpan.RecordError(err)
        fraudSpan.End()
        return err
    }
    fraudSpan.SetAttributes(attribute.Bool("fraud.detected", isFraud))
    fraudSpan.End()

    if isFraud {
        span.SetStatus(codes.Error, "Fraud detected")
        return fmt.Errorf("fraud detected for order %s", orderID)
    }

    // 결제 게이트웨이 호출
    ctx, gwSpan := tracer.Start(ctx, "payment_gateway_call")
    txID, err := callPaymentGateway(ctx, amount)
    if err != nil {
        gwSpan.SetStatus(codes.Error, err.Error())
        gwSpan.RecordError(err)
        gwSpan.End()
        return err
    }
    gwSpan.SetAttributes(attribute.String("payment.transaction_id", txID))
    gwSpan.End()

    span.SetStatus(codes.Ok, "Payment processed successfully")
    return nil
}

자동 계측 (Auto-Instrumentation)

Python 자동 계측

# 자동 계측 패키지 설치
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install

# 환경변수로 설정하여 실행
OTEL_SERVICE_NAME=order-service \
OTEL_TRACES_EXPORTER=otlp \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \
OTEL_PYTHON_LOG_CORRELATION=true \
opentelemetry-instrument python app.py

Node.js 자동 계측

// tracing.js - 앱 시작 전에 로드
const { NodeSDK } = require('@opentelemetry/sdk-node')
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node')
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc')

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4317',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': {
        ignoreIncomingPaths: ['/health', '/ready'],
      },
      '@opentelemetry/instrumentation-express': {
        enabled: true,
      },
      '@opentelemetry/instrumentation-pg': {
        enabled: true,
        enhancedDatabaseReporting: true,
      },
    }),
  ],
})

sdk.start()

process.on('SIGTERM', () => {
  sdk.shutdown().then(() => process.exit(0))
})

# 자동 계측으로 앱 실행
node --require ./tracing.js app.js

OpenTelemetry Collector 파이프라인

Collector 아키텍처

Collector는 세 가지 컴포넌트로 구성된다.

Receivers: 텔레메트리 데이터를 수신하는 입구. OTLP, Jaeger, Zipkin 등의 프로토콜 지원.
Processors: 데이터를 변환, 필터링, 배치 처리. 속성 추가/삭제, 샘플링 등.
Exporters: 처리된 데이터를 백엔드로 전송. Jaeger, Tempo, Datadog 등.

Collector 설정 예시

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  # Jaeger 포맷도 수신 가능
  jaeger:
    protocols:
      thrift_http:
        endpoint: 0.0.0.0:14268

processors:
  # 배치 처리로 네트워크 효율성 향상
  batch:
    send_batch_size: 1024
    send_batch_max_size: 2048
    timeout: 5s

  # 메모리 사용량 제한
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

  # 리소스 속성 추가
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert
      - key: cluster
        value: ap-northeast-2-prod
        action: upsert

  # 불필요한 속성 제거 (비용 절감)
  attributes:
    actions:
      - key: http.request.header.authorization
        action: delete
      - key: db.statement
        action: hash # SQL 쿼리 해싱 (보안)

  # 테일 기반 샘플링
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: errors-policy
        type: status_code
        status_code:
          status_codes:
            - ERROR
      - name: slow-traces-policy
        type: latency
        latency:
          threshold_ms: 1000
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

exporters:
  # Grafana Tempo로 전송
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  # Jaeger로 전송
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  # 디버그 로그 출력
  debug:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp, jaeger]
      processors: [memory_limiter, resource, attributes, tail_sampling, batch]
      exporters: [otlp/tempo, debug]

  telemetry:
    logs:
      level: info
    metrics:
      address: 0.0.0.0:8888

Collector 배포 모드

# Docker Compose로 Collector 배포
version: '3.8'
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.96.0
    command: ['--config=/etc/otel/config.yaml']
    volumes:
      - ./otel-collector-config.yaml:/etc/otel/config.yaml
    ports:
      - '4317:4317' # OTLP gRPC
      - '4318:4318' # OTLP HTTP
      - '8888:8888' # Prometheus metrics
      - '8889:8889' # Prometheus exporter
      - '13133:13133' # Health check
    deploy:
      resources:
        limits:
          memory: 1G
          cpus: '1.0'

샘플링 전략

Head-based 샘플링 vs Tail-based 샘플링

특성	Head-based	Tail-based
결정 시점	트레이스 시작 시	트레이스 완료 후
정보 기반	TraceID 해시	전체 트레이스 데이터
장점	오버헤드 낮음, 구현 간단	오류/지연 트레이스를 확실히 포착
단점	중요 트레이스 누락 가능	메모리 사용량 높음, 복잡
적합한 환경	대규모 트래픽, 비용 민감	디버깅 중심, 품질 우선
구현 위치	SDK (클라이언트)	Collector (서버)

샘플링 설정 예시

# Python SDK에서 head-based 샘플링 설정
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.sampling import (
    TraceIdRatioBased,
    ParentBased,
    ALWAYS_ON,
    ALWAYS_OFF,
)

# 10% 확률 샘플링 (부모 스팬의 결정을 따름)
sampler = ParentBased(root=TraceIdRatioBased(0.1))

provider = TracerProvider(
    resource=resource,
    sampler=sampler,
)

# Collector에서 tail-based 샘플링 설정
processors:
  tail_sampling:
    decision_wait: 30s
    num_traces: 50000
    expected_new_traces_per_sec: 1000
    policies:
      # 에러가 있는 트레이스는 100% 수집
      - name: error-traces
        type: status_code
        status_code:
          status_codes: [ERROR]

      # 1초 이상 걸린 트레이스는 100% 수집
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 1000

      # 특정 서비스의 트레이스는 50% 수집
      - name: critical-service
        type: string_attribute
        string_attribute:
          key: service.name
          values: [payment-service, auth-service]
          enabled_regex_matching: false
        type: and
        and:
          and_sub_policy:
            - name: sample-critical
              type: probabilistic
              probabilistic:
                sampling_percentage: 50

      # 나머지는 5% 수집
      - name: default
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

백엔드 비교

항목	Jaeger	Grafana Tempo	Zipkin	Datadog	New Relic
라이선스	Apache 2.0	AGPLv3	Apache 2.0	상용	상용
스토리지	Cassandra, ES, Memory	Object Storage (S3 등)	Cassandra, ES, MySQL	자체 스토리지	자체 스토리지
쿼리 언어	자체 UI/API	TraceQL	자체 UI/API	자체 쿼리	NRQL
비용	무료 (인프라 비용)	무료 (인프라 비용)	무료 (인프라 비용)	트레이스당 과금	트레이스당 과금
스케일링	수평 확장 가능	매우 뛰어남	제한적	자동	자동
OTel 지원	네이티브	네이티브	네이티브	네이티브	네이티브
로그/메트릭 통합	제한적	Grafana 스택과 통합	제한적	완전 통합	완전 통합
운영 복잡도	보통	낮음 (오브젝트 스토리지)	낮음	없음 (SaaS)	없음 (SaaS)

컨텍스트 전파 (Context Propagation)

W3C TraceContext

W3C TraceContext는 표준 HTTP 헤더를 통해 트레이스 정보를 전파하는 표준이다.

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             ^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^ ^^
             버전     trace-id (32 hex)          parent-id (16 hex) flags

# Python에서 W3C TraceContext 전파 설정
from opentelemetry import propagate
from opentelemetry.propagators.composite import CompositePropagator
from opentelemetry.propagators.textmap import DefaultTextMapPropagator

# W3C TraceContext (기본)
propagate.set_global_textmap(
    CompositePropagator([
        DefaultTextMapPropagator(),  # W3C TraceContext
    ])
)

# HTTP 요청에 컨텍스트 주입
import requests
from opentelemetry.propagate import inject

headers = {}
inject(headers)  # traceparent, tracestate 헤더 자동 추가
response = requests.get("http://downstream-service/api/data", headers=headers)

B3 전파 (Zipkin 호환)

# B3 전파 설정 (Zipkin 호환이 필요한 경우)
from opentelemetry.propagators.b3 import B3MultiFormat

propagate.set_global_textmap(
    CompositePropagator([
        DefaultTextMapPropagator(),  # W3C
        B3MultiFormat(),             # B3 (Zipkin 호환)
    ])
)

로그와 메트릭 상관관계

# 트레이스 ID를 로그에 포함하여 상관관계 설정
import logging
from opentelemetry import trace

class TraceIdFilter(logging.Filter):
    def filter(self, record):
        span = trace.get_current_span()
        if span.is_recording():
            ctx = span.get_span_context()
            record.trace_id = format(ctx.trace_id, '032x')
            record.span_id = format(ctx.span_id, '016x')
        else:
            record.trace_id = '0' * 32
            record.span_id = '0' * 16
        return True

# 로그 설정
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter(
    '%(asctime)s %(levelname)s [trace_id=%(trace_id)s span_id=%(span_id)s] %(message)s'
))
handler.addFilter(TraceIdFilter())
logger = logging.getLogger(__name__)
logger.addHandler(handler)

eBPF 기반 제로코드 계측

eBPF(extended Berkeley Packet Filter)를 활용하면 애플리케이션 코드를 수정하지 않고도 커널 수준에서 트레이싱 데이터를 수집할 수 있다. Grafana Beyla가 대표적인 도구다.

# Kubernetes에서 Grafana Beyla 배포
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: beyla
spec:
  selector:
    matchLabels:
      app: beyla
  template:
    metadata:
      labels:
        app: beyla
    spec:
      hostPID: true
      containers:
        - name: beyla
          image: grafana/beyla:latest
          securityContext:
            privileged: true
          env:
            - name: BEYLA_OPEN_PORT
              value: '80,443,8080,3000'
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: 'http://otel-collector:4317'
            - name: BEYLA_SERVICE_NAMESPACE
              value: 'production'
          volumeMounts:
            - name: sys-kernel
              mountPath: /sys/kernel
      volumes:
        - name: sys-kernel
          hostPath:
            path: /sys/kernel

eBPF 기반 계측의 장단점은 다음과 같다.

장점: 코드 변경 불필요, 언어 독립적, 낮은 오버헤드
단점: 비즈니스 컨텍스트(사용자 ID 등) 추가 불가, Linux 커널 4.18+ 필요, 일부 프로토콜만 지원

실패 사례와 복구 절차

사례 1: 비동기 경계에서 컨텍스트 유실

# 문제: 비동기 태스크에서 트레이스 컨텍스트가 사라짐
import asyncio
from opentelemetry import trace, context

async def process_order(order_id: str):
    with tracer.start_as_current_span("process_order") as span:
        # 잘못된 방식: 새 태스크에서 컨텍스트 전파 안 됨
        asyncio.create_task(send_notification(order_id))  # 컨텍스트 유실!

# 해결: 컨텍스트를 명시적으로 전달
async def process_order_fixed(order_id: str):
    with tracer.start_as_current_span("process_order") as span:
        ctx = context.get_current()
        asyncio.create_task(send_notification_with_context(order_id, ctx))

async def send_notification_with_context(order_id: str, ctx):
    token = context.attach(ctx)
    try:
        with tracer.start_as_current_span("send_notification"):
            # 알림 전송 로직
            pass
    finally:
        context.detach(token)

사례 2: 샘플링 설정 오류로 트레이스 누락

# 문제: head-based 샘플링을 0.1%로 설정하여 에러 트레이스도 대부분 누락
# SDK 설정
sampler: TraceIdRatioBased(0.001)  # 0.1% - 너무 낮음

# 해결: ParentBased + tail_sampling 조합 사용
# SDK에서는 전량 수집
sampler: ParentBased(root=ALWAYS_ON)

# Collector에서 tail-based 샘플링으로 에러/지연 보장
processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: latency
        type: latency
        latency:
          threshold_ms: 500
      - name: default
        type: probabilistic
        probabilistic:
          sampling_percentage: 1

사례 3: Collector 메모리 부족 (OOM)

# 문제: 트래픽 급증 시 Collector가 OOM으로 종료

# 해결: memory_limiter 프로세서 반드시 추가
processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1024 # 최대 메모리 사용량
    spike_limit_mib: 256 # 스파이크 허용 범위
    limit_percentage: 80 # 전체 메모리의 80%

  batch:
    send_batch_size: 512 # 배치 크기 줄임
    timeout: 2s

service:
  pipelines:
    traces:
      # memory_limiter를 프로세서 체인 맨 앞에 배치
      processors: [memory_limiter, batch, tail_sampling]

사례 4: 서비스 간 전파 헤더 불일치

서비스 A는 W3C TraceContext를, 서비스 B는 B3 형식을 사용하면 컨텍스트가 끊어진다.

해결 방법은 모든 서비스에서 동일한 전파 형식을 사용하거나, CompositePropagator로 여러 형식을 동시에 지원하는 것이다.

프로덕션 체크리스트

계측

모든 서비스에 OpenTelemetry SDK가 설치되어 있는지 확인
서비스명, 버전, 환경 정보가 리소스 속성에 포함되어 있는지 확인
주요 비즈니스 트랜잭션에 커스텀 스팬이 추가되어 있는지 확인
민감 정보(비밀번호, 토큰 등)가 스팬 속성에 포함되지 않는지 확인
비동기 경계에서 컨텍스트가 올바르게 전파되는지 확인

Collector

memory_limiter 프로세서가 파이프라인 맨 앞에 배치되어 있는지 확인
배치 프로세서의 크기와 타임아웃이 적절한지 확인
health check 엔드포인트가 설정되어 있는지 확인
Collector 자체의 메트릭이 모니터링되고 있는지 확인
보안에 민감한 속성이 attributes 프로세서로 제거/해싱되는지 확인

샘플링

에러 트레이스가 100% 수집되는지 확인
지연 트레이스(SLO 위반)가 수집되는지 확인
샘플링 비율이 비용 예산 내인지 확인
head-based와 tail-based 샘플링의 조합이 적절한지 확인

운영

트레이스-로그-메트릭 상관관계가 설정되어 있는지 확인
대시보드에서 서비스 맵과 의존성 그래프가 표시되는지 확인
알림 규칙이 트레이스 기반 SLO와 연동되어 있는지 확인
트레이스 데이터 보존 기간이 설정되어 있는지 확인
컨텍스트 전파 형식이 모든 서비스에서 일관적인지 확인

참고자료

OpenTelemetry Distributed Tracing Practical Guide: Building and Operating Instrumentation, Collection, and Analysis Pipelines

Introduction
OpenTelemetry Architecture Overview
- Core Components
- Trace Model
Manual Instrumentation
Auto-Instrumentation
- Python Auto-Instrumentation
- Node.js Auto-Instrumentation
OpenTelemetry Collector Pipeline
Sampling Strategies
- Head-based vs Tail-based Sampling
- Sampling Configuration Examples
Backend Comparison
Context Propagation
eBPF-Based Zero-Code Instrumentation
Failure Cases and Recovery Procedures
Production Checklist
References

Introduction

In a microservices architecture, a single user request can traverse dozens of services before being fulfilled. Distributed tracing is essential to identify which service caused latency or where errors propagated along the call chain. OpenTelemetry (OTel) is a CNCF graduated project that provides a unified API and SDK for collecting traces, metrics, and logs as an observability standard.

This article covers everything needed for production operations, from OpenTelemetry's architecture and core concepts, to language-specific instrumentation in Python/Node.js/Go, Collector pipeline configuration, sampling strategies, backend comparison, and failure cases with checklists encountered in production environments.

OpenTelemetry Architecture Overview

Core Components

OpenTelemetry consists of the following components:

API: A vendor-neutral instrumentation interface. Used by library developers.
SDK: Concrete implementation of the API. Handles sampling, batching, and exporting.
Collector: A standalone process that receives, processes, and exports telemetry data.
Exporters: Modules that send collected data to backends such as Jaeger, Tempo, or Datadog.
Instrumentation Libraries: Framework-specific libraries supporting automatic instrumentation.

Trace Model

The core concepts of distributed tracing are as follows:

Concept	Description
Trace	The complete path of a single request. Composed of multiple Spans
Span	An individual unit of work within a trace
SpanContext	Context containing SpanID, TraceID, TraceFlags, TraceState
TraceID	128-bit ID that uniquely identifies a trace
SpanID	64-bit ID that uniquely identifies a span
Parent Span	The parent span that created the current span
Baggage	Key-value pairs propagated across the entire trace
Attributes	Metadata (key-value pairs) attached to a span
Events	Point-in-time events within a span (similar to logs)
Links	Causal relationship connections to other traces/spans

Manual Instrumentation

Python Instrumentation

# pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.semconv.resource import ResourceAttributes

# Define resource
resource = Resource.create({
    ResourceAttributes.SERVICE_NAME: "order-service",
    ResourceAttributes.SERVICE_VERSION: "1.2.0",
    ResourceAttributes.DEPLOYMENT_ENVIRONMENT: "production",
})

# Configure TracerProvider
provider = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

# Create Tracer
tracer = trace.get_tracer("order-service", "1.2.0")


# Usage example: Order processing
def create_order(customer_id: str, items: list) -> dict:
    with tracer.start_as_current_span(
        "create_order",
        attributes={
            "customer.id": customer_id,
            "order.item_count": len(items),
        },
    ) as span:
        try:
            # Check inventory
            with tracer.start_as_current_span("check_inventory") as inventory_span:
                available = check_inventory(items)
                inventory_span.set_attribute("inventory.all_available", available)

            if not available:
                span.set_status(trace.StatusCode.ERROR, "Inventory not available")
                raise ValueError("Some items are out of stock")

            # Process payment
            with tracer.start_as_current_span("process_payment") as payment_span:
                payment_result = process_payment(customer_id, items)
                payment_span.set_attribute("payment.transaction_id", payment_result["tx_id"])
                payment_span.add_event("payment_completed", {
                    "amount": payment_result["amount"],
                    "currency": "KRW",
                })

            # Save order
            with tracer.start_as_current_span("save_order"):
                order = save_to_database(customer_id, items, payment_result)

            span.set_attribute("order.id", order["id"])
            return order

        except Exception as e:
            span.set_status(trace.StatusCode.ERROR, str(e))
            span.record_exception(e)
            raise

Node.js Instrumentation

// npm install @opentelemetry/api @opentelemetry/sdk-node
// npm install @opentelemetry/exporter-trace-otlp-grpc
// npm install @opentelemetry/semantic-conventions

const { NodeSDK } = require('@opentelemetry/sdk-node')
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc')
const { Resource } = require('@opentelemetry/resources')
const { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } = require('@opentelemetry/semantic-conventions')
const { trace, SpanStatusCode } = require('@opentelemetry/api')

// Initialize SDK
const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'user-service',
    [ATTR_SERVICE_VERSION]: '2.1.0',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4317',
  }),
})

sdk.start()

const tracer = trace.getTracer('user-service', '2.1.0')

// Usage example: User lookup
async function getUser(userId) {
  return tracer.startActiveSpan('getUser', async (span) => {
    try {
      span.setAttribute('user.id', userId)

      // Database query
      const user = await tracer.startActiveSpan('db.query', async (dbSpan) => {
        dbSpan.setAttribute('db.system', 'postgresql')
        dbSpan.setAttribute('db.statement', 'SELECT * FROM users WHERE id = ?')
        const result = await db.query('SELECT * FROM users WHERE id = $1', [userId])
        dbSpan.setAttribute('db.row_count', result.rows.length)
        dbSpan.end()
        return result.rows[0]
      })

      if (!user) {
        span.setStatus({ code: SpanStatusCode.ERROR, message: 'User not found' })
        return null
      }

      // Cache update
      await tracer.startActiveSpan('cache.set', async (cacheSpan) => {
        cacheSpan.setAttribute('cache.system', 'redis')
        cacheSpan.setAttribute('cache.key', `user:${userId}`)
        await redis.set(`user:${userId}`, JSON.stringify(user), 'EX', 3600)
        cacheSpan.end()
      })

      span.setStatus({ code: SpanStatusCode.OK })
      return user
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message })
      span.recordException(error)
      throw error
    } finally {
      span.end()
    }
  })
}

Go Instrumentation

package main

import (
    "context"
    "log"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/codes"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.26.0"
    "go.opentelemetry.io/otel/trace"
)

func initTracer() (*sdktrace.TracerProvider, error) {
    exporter, err := otlptracegrpc.New(
        context.Background(),
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    res, err := resource.New(
        context.Background(),
        resource.WithAttributes(
            semconv.ServiceName("payment-service"),
            semconv.ServiceVersion("3.0.1"),
            semconv.DeploymentEnvironmentKey.String("production"),
        ),
    )
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(res),
        sdktrace.WithSampler(sdktrace.ParentBased(sdktrace.TraceIDRatioBased(0.1))),
    )
    otel.SetTracerProvider(tp)
    return tp, nil
}

var tracer = otel.Tracer("payment-service")

func ProcessPayment(ctx context.Context, orderID string, amount float64) error {
    ctx, span := tracer.Start(ctx, "ProcessPayment",
        trace.WithAttributes(
            attribute.String("order.id", orderID),
            attribute.Float64("payment.amount", amount),
        ),
    )
    defer span.End()

    // Fraud detection check
    ctx, fraudSpan := tracer.Start(ctx, "fraud_detection")
    isFraud, err := checkFraud(ctx, orderID, amount)
    if err != nil {
        fraudSpan.SetStatus(codes.Error, err.Error())
        fraudSpan.RecordError(err)
        fraudSpan.End()
        return err
    }
    fraudSpan.SetAttributes(attribute.Bool("fraud.detected", isFraud))
    fraudSpan.End()

    if isFraud {
        span.SetStatus(codes.Error, "Fraud detected")
        return fmt.Errorf("fraud detected for order %s", orderID)
    }

    // Payment gateway call
    ctx, gwSpan := tracer.Start(ctx, "payment_gateway_call")
    txID, err := callPaymentGateway(ctx, amount)
    if err != nil {
        gwSpan.SetStatus(codes.Error, err.Error())
        gwSpan.RecordError(err)
        gwSpan.End()
        return err
    }
    gwSpan.SetAttributes(attribute.String("payment.transaction_id", txID))
    gwSpan.End()

    span.SetStatus(codes.Ok, "Payment processed successfully")
    return nil
}

Auto-Instrumentation

Python Auto-Instrumentation

# Install auto-instrumentation packages
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install

# Run with environment variable configuration
OTEL_SERVICE_NAME=order-service \
OTEL_TRACES_EXPORTER=otlp \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \
OTEL_PYTHON_LOG_CORRELATION=true \
opentelemetry-instrument python app.py

Node.js Auto-Instrumentation

// tracing.js - Load before app startup
const { NodeSDK } = require('@opentelemetry/sdk-node')
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node')
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc')

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4317',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': {
        ignoreIncomingPaths: ['/health', '/ready'],
      },
      '@opentelemetry/instrumentation-express': {
        enabled: true,
      },
      '@opentelemetry/instrumentation-pg': {
        enabled: true,
        enhancedDatabaseReporting: true,
      },
    }),
  ],
})

sdk.start()

process.on('SIGTERM', () => {
  sdk.shutdown().then(() => process.exit(0))
})

# Run app with auto-instrumentation
node --require ./tracing.js app.js

OpenTelemetry Collector Pipeline

Collector Architecture

The Collector consists of three components:

Receivers: Entry points that receive telemetry data. Supports protocols like OTLP, Jaeger, Zipkin.
Processors: Transform, filter, and batch data. Add/remove attributes, apply sampling, etc.
Exporters: Send processed data to backends. Jaeger, Tempo, Datadog, etc.

Collector Configuration Example

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  # Can also receive Jaeger format
  jaeger:
    protocols:
      thrift_http:
        endpoint: 0.0.0.0:14268

processors:
  # Batch processing for network efficiency
  batch:
    send_batch_size: 1024
    send_batch_max_size: 2048
    timeout: 5s

  # Limit memory usage
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

  # Add resource attributes
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert
      - key: cluster
        value: ap-northeast-2-prod
        action: upsert

  # Remove sensitive attributes (cost reduction)
  attributes:
    actions:
      - key: http.request.header.authorization
        action: delete
      - key: db.statement
        action: hash # Hash SQL queries (security)

  # Tail-based sampling
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: errors-policy
        type: status_code
        status_code:
          status_codes:
            - ERROR
      - name: slow-traces-policy
        type: latency
        latency:
          threshold_ms: 1000
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

exporters:
  # Send to Grafana Tempo
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  # Send to Jaeger
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  # Debug log output
  debug:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp, jaeger]
      processors: [memory_limiter, resource, attributes, tail_sampling, batch]
      exporters: [otlp/tempo, debug]

  telemetry:
    logs:
      level: info
    metrics:
      address: 0.0.0.0:8888

Collector Deployment Modes

# Deploy Collector with Docker Compose
version: '3.8'
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.96.0
    command: ['--config=/etc/otel/config.yaml']
    volumes:
      - ./otel-collector-config.yaml:/etc/otel/config.yaml
    ports:
      - '4317:4317' # OTLP gRPC
      - '4318:4318' # OTLP HTTP
      - '8888:8888' # Prometheus metrics
      - '8889:8889' # Prometheus exporter
      - '13133:13133' # Health check
    deploy:
      resources:
        limits:
          memory: 1G
          cpus: '1.0'

Sampling Strategies

Head-based vs Tail-based Sampling

Characteristic	Head-based	Tail-based
Decision Point	At trace start	After trace completion
Based On	TraceID hash	Complete trace data
Advantages	Low overhead, simple to implement	Guaranteed capture of error/latency traces
Disadvantages	May miss important traces	High memory usage, complex
Best For	High traffic, cost-sensitive	Debugging-focused, quality-first
Implementation	SDK (client-side)	Collector (server-side)

Sampling Configuration Examples

# Head-based sampling in Python SDK
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.sampling import (
    TraceIdRatioBased,
    ParentBased,
    ALWAYS_ON,
    ALWAYS_OFF,
)

# 10% probability sampling (follows parent span decision)
sampler = ParentBased(root=TraceIdRatioBased(0.1))

provider = TracerProvider(
    resource=resource,
    sampler=sampler,
)

# Tail-based sampling in Collector
processors:
  tail_sampling:
    decision_wait: 30s
    num_traces: 50000
    expected_new_traces_per_sec: 1000
    policies:
      # Collect 100% of error traces
      - name: error-traces
        type: status_code
        status_code:
          status_codes: [ERROR]

      # Collect 100% of traces exceeding 1 second
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 1000

      # Collect 50% of traces from critical services
      - name: critical-service
        type: string_attribute
        string_attribute:
          key: service.name
          values: [payment-service, auth-service]
          enabled_regex_matching: false
        type: and
        and:
          and_sub_policy:
            - name: sample-critical
              type: probabilistic
              probabilistic:
                sampling_percentage: 50

      # Collect 5% of remaining traces
      - name: default
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

Backend Comparison

Item	Jaeger	Grafana Tempo	Zipkin	Datadog	New Relic
License	Apache 2.0	AGPLv3	Apache 2.0	Commercial	Commercial
Storage	Cassandra, ES, Memory	Object Storage (S3 etc.)	Cassandra, ES, MySQL	Proprietary	Proprietary
Query Language	Built-in UI/API	TraceQL	Built-in UI/API	Built-in query	NRQL
Cost	Free (infra costs)	Free (infra costs)	Free (infra costs)	Per-trace billing	Per-trace billing
Scaling	Horizontal scaling	Excellent	Limited	Automatic	Automatic
OTel Support	Native	Native	Native	Native	Native
Logs/Metrics Integration	Limited	Grafana stack integration	Limited	Full integration	Full integration
Operational Complexity	Moderate	Low (object storage)	Low	None (SaaS)	None (SaaS)

Context Propagation

W3C TraceContext

W3C TraceContext is the standard for propagating trace information through standard HTTP headers.

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             ^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^ ^^
           version     trace-id (32 hex)        parent-id (16 hex) flags

# W3C TraceContext propagation setup in Python
from opentelemetry import propagate
from opentelemetry.propagators.composite import CompositePropagator
from opentelemetry.propagators.textmap import DefaultTextMapPropagator

# W3C TraceContext (default)
propagate.set_global_textmap(
    CompositePropagator([
        DefaultTextMapPropagator(),  # W3C TraceContext
    ])
)

# Inject context into HTTP request
import requests
from opentelemetry.propagate import inject

headers = {}
inject(headers)  # Automatically adds traceparent, tracestate headers
response = requests.get("http://downstream-service/api/data", headers=headers)

B3 Propagation (Zipkin Compatible)

# B3 propagation setup (when Zipkin compatibility is needed)
from opentelemetry.propagators.b3 import B3MultiFormat

propagate.set_global_textmap(
    CompositePropagator([
        DefaultTextMapPropagator(),  # W3C
        B3MultiFormat(),             # B3 (Zipkin compatible)
    ])
)

Log and Metric Correlation

# Include trace ID in logs for correlation
import logging
from opentelemetry import trace

class TraceIdFilter(logging.Filter):
    def filter(self, record):
        span = trace.get_current_span()
        if span.is_recording():
            ctx = span.get_span_context()
            record.trace_id = format(ctx.trace_id, '032x')
            record.span_id = format(ctx.span_id, '016x')
        else:
            record.trace_id = '0' * 32
            record.span_id = '0' * 16
        return True

# Log configuration
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter(
    '%(asctime)s %(levelname)s [trace_id=%(trace_id)s span_id=%(span_id)s] %(message)s'
))
handler.addFilter(TraceIdFilter())
logger = logging.getLogger(__name__)
logger.addHandler(handler)

eBPF-Based Zero-Code Instrumentation

Using eBPF (extended Berkeley Packet Filter), tracing data can be collected at the kernel level without modifying application code. Grafana Beyla is a representative tool for this approach.

# Deploy Grafana Beyla on Kubernetes
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: beyla
spec:
  selector:
    matchLabels:
      app: beyla
  template:
    metadata:
      labels:
        app: beyla
    spec:
      hostPID: true
      containers:
        - name: beyla
          image: grafana/beyla:latest
          securityContext:
            privileged: true
          env:
            - name: BEYLA_OPEN_PORT
              value: '80,443,8080,3000'
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: 'http://otel-collector:4317'
            - name: BEYLA_SERVICE_NAMESPACE
              value: 'production'
          volumeMounts:
            - name: sys-kernel
              mountPath: /sys/kernel
      volumes:
        - name: sys-kernel
          hostPath:
            path: /sys/kernel

Pros and cons of eBPF-based instrumentation:

Advantages: No code changes required, language-independent, low overhead
Disadvantages: Cannot add business context (user IDs, etc.), requires Linux kernel 4.18+, supports limited protocols only

Failure Cases and Recovery Procedures

Case 1: Context Loss Across Async Boundaries

# Problem: Trace context lost in async tasks
import asyncio
from opentelemetry import trace, context

async def process_order(order_id: str):
    with tracer.start_as_current_span("process_order") as span:
        # Wrong: Context not propagated to new task
        asyncio.create_task(send_notification(order_id))  # Context lost!

# Fix: Explicitly pass context
async def process_order_fixed(order_id: str):
    with tracer.start_as_current_span("process_order") as span:
        ctx = context.get_current()
        asyncio.create_task(send_notification_with_context(order_id, ctx))

async def send_notification_with_context(order_id: str, ctx):
    token = context.attach(ctx)
    try:
        with tracer.start_as_current_span("send_notification"):
            # Notification sending logic
            pass
    finally:
        context.detach(token)

Case 2: Trace Loss Due to Sampling Misconfiguration

# Problem: Head-based sampling at 0.1% drops most error traces too
# SDK configuration
sampler: TraceIdRatioBased(0.001)  # 0.1% - too low

# Fix: Use ParentBased + tail_sampling combination
# Collect everything in SDK
sampler: ParentBased(root=ALWAYS_ON)

# Use tail-based sampling in Collector to guarantee error/latency capture
processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: latency
        type: latency
        latency:
          threshold_ms: 500
      - name: default
        type: probabilistic
        probabilistic:
          sampling_percentage: 1

Case 3: Collector Out of Memory (OOM)

# Problem: Collector terminates with OOM during traffic spikes

# Fix: Always add memory_limiter processor
processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1024 # Maximum memory usage
    spike_limit_mib: 256 # Spike allowance
    limit_percentage: 80 # 80% of total memory

  batch:
    send_batch_size: 512 # Reduce batch size
    timeout: 2s

service:
  pipelines:
    traces:
      # Place memory_limiter at the front of the processor chain
      processors: [memory_limiter, batch, tail_sampling]

Case 4: Propagation Header Mismatch Between Services

When Service A uses W3C TraceContext and Service B uses B3 format, the context breaks.

The solution is to either use the same propagation format across all services, or support multiple formats simultaneously using CompositePropagator.

Production Checklist

Instrumentation

Verify OpenTelemetry SDK is installed in all services
Verify service name, version, and environment are included in resource attributes
Verify custom spans are added for key business transactions
Verify sensitive information (passwords, tokens, etc.) is not included in span attributes
Verify context is correctly propagated across async boundaries

Collector

Verify memory_limiter processor is placed first in the pipeline
Verify batch processor size and timeout are appropriate
Verify health check endpoint is configured
Verify the Collector's own metrics are being monitored
Verify security-sensitive attributes are removed/hashed by the attributes processor

Sampling

Verify error traces are collected at 100%
Verify latency traces (SLO violations) are collected
Verify sampling rate is within cost budget
Verify the combination of head-based and tail-based sampling is appropriate

Operations

Verify trace-log-metric correlation is configured
Verify service maps and dependency graphs display in dashboards
Verify alert rules are linked to trace-based SLOs
Verify trace data retention period is configured
Verify context propagation format is consistent across all services

OpenTelemetry 분산 트레이싱 실전 가이드: 계측·수집·분석 파이프라인 구축과 운영

들어가며

OpenTelemetry 아키텍처 개요

핵심 구성 요소

트레이스 모델

수동 계측 (Manual Instrumentation)

Python 계측

Node.js 계측

Go 계측

자동 계측 (Auto-Instrumentation)

Python 자동 계측

Node.js 자동 계측

OpenTelemetry Collector 파이프라인

Collector 아키텍처

Collector 설정 예시

Collector 배포 모드

샘플링 전략

Head-based 샘플링 vs Tail-based 샘플링

샘플링 설정 예시

백엔드 비교

컨텍스트 전파 (Context Propagation)

W3C TraceContext

B3 전파 (Zipkin 호환)

로그와 메트릭 상관관계

eBPF 기반 제로코드 계측

실패 사례와 복구 절차

사례 1: 비동기 경계에서 컨텍스트 유실

사례 2: 샘플링 설정 오류로 트레이스 누락

사례 3: Collector 메모리 부족 (OOM)

사례 4: 서비스 간 전파 헤더 불일치

프로덕션 체크리스트

계측

Collector

샘플링

운영

참고자료

OpenTelemetry Distributed Tracing Practical Guide: Building and Operating Instrumentation, Collection, and Analysis Pipelines

Introduction

OpenTelemetry Architecture Overview

Core Components

Trace Model

Manual Instrumentation

Python Instrumentation

Node.js Instrumentation

Go Instrumentation

Auto-Instrumentation

Python Auto-Instrumentation

Node.js Auto-Instrumentation

OpenTelemetry Collector Pipeline

Collector Architecture

Collector Configuration Example

Collector Deployment Modes

Sampling Strategies

Head-based vs Tail-based Sampling

Sampling Configuration Examples

Backend Comparison

Context Propagation

W3C TraceContext

B3 Propagation (Zipkin Compatible)

Log and Metric Correlation

eBPF-Based Zero-Code Instrumentation

Failure Cases and Recovery Procedures

Case 1: Context Loss Across Async Boundaries

Case 2: Trace Loss Due to Sampling Misconfiguration

Case 3: Collector Out of Memory (OOM)

Case 4: Propagation Header Mismatch Between Services

Production Checklist

Instrumentation

Collector

Sampling

Operations

References