OpenTelemetry分散トレーシング実践ガイド — 計装からJaeger可視化まで

はじめに
核心概念
- Trace、Span、Context
- Context Propagation
Python計装
Java/Spring Boot計装
- Spring Boot自動設定
OTel Collector構成
- Kubernetesデプロイ
サンプリング戦略
Jaeger UIの活用
ベストプラクティス
クイズ
まとめ
参考資料

はじめに

マイクロサービス環境において、1つのユーザーリクエストは数十のサービスを経由します。どこでボトルネックが発生しているか、どのサービスが遅いかを把握するには分散トレーシングが不可欠です。

**OpenTelemetry（OTel）**はCNCFのObservability標準であり、Traces、Metrics、Logsを統合するベンダー中立なフレームワークです。

核心概念

Trace、Span、Context

Trace（リクエスト全体の旅路）：
├── Span A: API Gateway (50ms)
│   ├── Span B: Auth Service (10ms)
│   ├── Span C: Order Service (35ms)
│   │   ├── Span D: Database Query (15ms)
│   │   └── Span E: Payment Service (18ms)
│   │       └── Span F: Bank API Call (12ms)
│   └── Span G: Notification Service (5ms, async)

# Spanの構造
{
    "traceId": "abc123def456...",        # Trace全体の一意ID（128ビット）
    "spanId": "span789...",              # このSpanの一意ID（64ビット）
    "parentSpanId": "parent456...",      # 親Span ID
    "name": "POST /api/orders",          # Span名
    "kind": "SERVER",                    # CLIENT, SERVER, PRODUCER, CONSUMER, INTERNAL
    "startTime": "2026-03-03T12:00:00Z",
    "endTime": "2026-03-03T12:00:00.050Z",
    "status": {"code": "OK"},
    "attributes": {                       # メタデータ
        "http.method": "POST",
        "http.url": "/api/orders",
        "http.status_code": 201,
        "service.name": "order-service"
    },
    "events": [                           # Span内イベント
        {
            "name": "order.validated",
            "timestamp": "2026-03-03T12:00:00.010Z",
            "attributes": {"order_id": "ORD-123"}
        }
    ]
}

Context Propagation

サービス間コンテキスト伝播：

[Service A]                    [Service B]
    │                              │
    │ traceparent: 00-abc123-span1-01
    │ ─────────────────────────>   │
    │                              │ （同じtraceIdで子Spanを生成）
    │                              │
    │                              │  traceparent: 00-abc123-span2-01
    │                              │  ──────────> [Service C]

HTTPヘッダー：
traceparent: 00-{traceId}-{spanId}-{flags}
例：traceparent: 00-abc123def456789-span12345678-01

Python計装

自動計装（ゼロコード）

# 依存関係インストール
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install  # 自動計装パッケージインストール

# 自動計装でアプリ実行
opentelemetry-instrument \
  --service_name order-service \
  --traces_exporter otlp \
  --metrics_exporter otlp \
  --exporter_otlp_endpoint http://otel-collector:4317 \
  python app.py

手動計装（きめ細かな制御）

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.semconv.resource import ResourceAttributes

# 1. TracerProvider設定
resource = Resource.create({
    ResourceAttributes.SERVICE_NAME: "order-service",
    ResourceAttributes.SERVICE_VERSION: "1.2.0",
    ResourceAttributes.DEPLOYMENT_ENVIRONMENT: "production",
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://otel-collector:4317")
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# 2. Tracer作成
tracer = trace.get_tracer("order-service", "1.2.0")

# 3. Spanの作成と使用
@tracer.start_as_current_span("create_order")
def create_order(order_data: dict) -> dict:
    span = trace.get_current_span()

    # 属性の追加
    span.set_attribute("order.customer_id", order_data["customer_id"])
    span.set_attribute("order.total_amount", order_data["total"])
    span.set_attribute("order.item_count", len(order_data["items"]))

    # イベントの追加
    span.add_event("order.validation_started")

    # バリデーション
    validate_order(order_data)
    span.add_event("order.validation_completed")

    # 決済処理（子Spanが自動生成）
    payment_result = process_payment(order_data)

    # 結果を記録
    span.set_attribute("order.id", payment_result["order_id"])
    span.set_status(trace.StatusCode.OK)

    return payment_result


@tracer.start_as_current_span("process_payment")
def process_payment(order_data: dict) -> dict:
    span = trace.get_current_span()
    span.set_attribute("payment.method", order_data.get("payment_method", "card"))

    try:
        result = payment_client.charge(order_data)
        span.set_attribute("payment.transaction_id", result["txn_id"])
        return result
    except Exception as e:
        span.set_status(trace.StatusCode.ERROR, str(e))
        span.record_exception(e)
        raise

FastAPI統合

from fastapi import FastAPI, Request
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

app = FastAPI()

# 自動計装の適用
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()  # 外部HTTP呼び出し
SQLAlchemyInstrumentor().instrument(engine=db_engine)  # DBクエリ

@app.post("/api/orders")
async def create_order(request: Request, order: OrderRequest):
    # 現在のSpanにビジネスコンテキストを追加
    span = trace.get_current_span()
    span.set_attribute("order.customer_id", order.customer_id)
    span.set_attribute("order.region", order.shipping_region)

    result = await order_service.create(order)
    return result

Java/Spring Boot計装

Spring Boot自動設定

<dependency>
    <groupId>io.opentelemetry.instrumentation</groupId>
    <artifactId>opentelemetry-spring-boot-starter</artifactId>
    <version>2.11.0</version>
</dependency>

# application.yml
otel:
  service:
    name: order-service
  exporter:
    otlp:
      endpoint: http://otel-collector:4317
  traces:
    sampler: parentbased_traceidratio
    sampler.arg: '0.1' # 10%サンプリング（プロダクション）

@RestController
@RequiredArgsConstructor
public class OrderController {

    private final Tracer tracer;
    private final OrderService orderService;

    @PostMapping("/api/orders")
    public ResponseEntity<OrderResponse> createOrder(@RequestBody OrderRequest request) {
        Span span = Span.current();
        span.setAttribute("order.customer_id", request.getCustomerId());
        span.setAttribute("order.total", request.getTotal().doubleValue());

        OrderResponse response = orderService.create(request);

        span.setAttribute("order.id", response.getOrderId());
        return ResponseEntity.status(HttpStatus.CREATED).body(response);
    }
}

// 手動Span作成
@Service
public class PaymentService {

    @Autowired
    private Tracer tracer;

    public PaymentResult processPayment(Order order) {
        Span span = tracer.spanBuilder("process_payment")
            .setAttribute("payment.amount", order.getTotal().doubleValue())
            .startSpan();

        try (Scope scope = span.makeCurrent()) {
            PaymentResult result = paymentGateway.charge(order);
            span.setAttribute("payment.txn_id", result.getTransactionId());
            return result;
        } catch (Exception e) {
            span.setStatus(StatusCode.ERROR, e.getMessage());
            span.recordException(e);
            throw e;
        } finally {
            span.end();
        }
    }
}

OTel Collector構成

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
    send_batch_max_size: 2048

  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

  attributes:
    actions:
      - key: environment
        value: production
        action: upsert

  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 1000 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes, tail_sampling]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

Kubernetesデプロイ

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
spec:
  replicas: 2
  selector:
    matchLabels:
      app: otel-collector
  template:
    spec:
      containers:
        - name: collector
          image: otel/opentelemetry-collector-contrib:0.96.0
          args: ['--config=/conf/config.yaml']
          ports:
            - containerPort: 4317 # gRPC
            - containerPort: 4318 # HTTP
            - containerPort: 8889 # Prometheus metrics
          resources:
            requests:
              cpu: 200m
              memory: 256Mi
            limits:
              cpu: 1000m
              memory: 512Mi
          volumeMounts:
            - name: config
              mountPath: /conf
      volumes:
        - name: config
          configMap:
            name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
  name: otel-collector
spec:
  selector:
    app: otel-collector
  ports:
    - name: grpc
      port: 4317
    - name: http
      port: 4318

サンプリング戦略

from opentelemetry.sdk.trace.sampling import (
    TraceIdRatioBased,
    ParentBasedTraceIdRatio,
    ALWAYS_ON,
    ALWAYS_OFF,
)

# 開発環境：すべて収集
dev_sampler = ALWAYS_ON

# プロダクション：10%サンプリング（親Spanの決定に従う）
prod_sampler = ParentBasedTraceIdRatio(0.1)

# カスタムサンプラー：エラーは100%、正常は5%
class SmartSampler:
    def should_sample(self, context, trace_id, name, kind, attributes, links):
        # エラーSpanは必ず収集
        if attributes and attributes.get("error") == True:
            return SamplingResult(Decision.RECORD_AND_SAMPLE)
        # 正常は5%
        if trace_id % 100 < 5:
            return SamplingResult(Decision.RECORD_AND_SAMPLE)
        return SamplingResult(Decision.DROP)

Jaeger UIの活用

# DockerでJaegerを起動
docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/all-in-one:1.62

# ブラウザで http://localhost:16686 にアクセス

主な機能：

1. Trace検索
   - Service選択 → Operationフィルター → Duration範囲
   - Tags検索：http.status_code=500

2. Traceタイムライン（Waterfall）
   - 各Spanの開始/終了時間を可視化
   - 並列処理と逐次処理を区別
   - 各Spanの属性/イベント/ログの詳細確認

3. Service Dependency Graph
   - サービス間の呼び出し関係を可視化
   - 呼び出し頻度とエラー率を表示

4. Compare
   - 2つのTraceを比較してパフォーマンス差異を分析

ベストプラクティス

1. Semantic Conventionsの使用

from opentelemetry.semconv.trace import SpanAttributes

# 標準属性を使用（ベンダー間の互換性）
span.set_attribute(SpanAttributes.HTTP_METHOD, "POST")
span.set_attribute(SpanAttributes.HTTP_URL, "/api/orders")
span.set_attribute(SpanAttributes.HTTP_STATUS_CODE, 201)
span.set_attribute(SpanAttributes.DB_SYSTEM, "postgresql")
span.set_attribute(SpanAttributes.DB_STATEMENT, "SELECT * FROM orders")

2. 機密情報の除去

# CollectorでPIIを除去
processors:
  attributes:
    actions:
      - key: http.request.header.authorization
        action: delete
      - key: db.statement
        action: hash  # SQLをハッシュに置換
      - key: user.email
        action: delete

3. エラーの記録

try:
    result = external_api.call()
except Exception as e:
    span.set_status(trace.StatusCode.ERROR, str(e))
    span.record_exception(e)
    # record_exceptionは自動的にスタックトレースを含む
    raise

クイズ

Q1. Trace、Span、Contextの関係は？

Traceは1つのリクエストの全経路、Spanはその中の1つの作業単位、ContextはTrace/Span情報をサービス間で伝播するメカニズムです。

Q2. W3C Trace Contextのtraceparent形式は？

{version}-{traceId}-{spanId}-{flags}形式です。例：00-abc123...-span123...-01

Q3. OpenTelemetry Collectorの役割は？

Receive（受信）→ Process（処理/フィルタリング/サンプリング）→ Export（送信）のパイプラインを担当します。アプリケーションとバックエンド（Jaeger、Tempoなど）間の中間レイヤーです。

Q4. テールサンプリングとヘッドサンプリングの違いは？

ヘッドサンプリングはTrace開始時に収集可否を決定し、テールサンプリングはTrace完了後に全情報を見て決定します。テールサンプリングはエラーTraceを100%収集するなど、より精密な制御が可能です。

Q5. プロダクションでサンプリング率を100%に設定してはいけない理由は？

Traceデータのボリュームが非常に大きくなり、ストレージコストの増加、ネットワーク負荷、そしてCollectorの過負荷が発生します。通常5〜10%が適切です。

Q6. record_exception()とset_status(ERROR)の違いは？

record_exception()はSpanに例外イベント（スタックトレースを含む）を記録し、set_status(ERROR)はSpanのステータスをエラーとして表示します。通常、両方を一緒に使用します。

Q7. Semantic Conventionsを使用する理由は？

標準化された属性名を使用することで、異なるベンダーのバックエンド（Jaeger、Datadog、Grafanaなど）でも一貫してTraceを分析できます。

まとめ

OpenTelemetryは分散システムのObservability標準として定着しました。自動計装で素早く始めて、必要に応じて手動計装でビジネスコンテキストを追加してください。OTel Collectorを中間レイヤーとして配置すれば、サンプリング、フィルタリング、複数バックエンドへの送信など柔軟な運用が可能です。