Skip to content
Published on

Grafana LGTMスタック完全ガイド:Loki + Grafana + Tempo + Mimirで統合オブザーバビリティを構築

Authors
  • Name
    Twitter

1. LGTMスタックとは

LGTMはGrafana Labsのオープンソースオブザーバビリティスタックです:

コンポーネント役割代替製品
Lokiログ収集/検索Elasticsearch, Splunk
Grafana可視化/ダッシュボードKibana, Datadog
Tempo分散トレーシングJaeger, Zipkin
Mimirメトリクス保存/クエリThanos, Cortex
graph TB
    subgraph "Applications"
        App1[Service A]
        App2[Service B]
        App3[Service C]
    end

    subgraph "Collection"
        OTel[OpenTelemetry Collector]
        Alloy[Grafana Alloy]
    end

    subgraph "LGTM Stack"
        Mimir[Mimir<br/>Metrics]
        Loki[Loki<br/>Logs]
        Tempo[Tempo<br/>Traces]
        Grafana[Grafana<br/>Dashboard]
    end

    App1 & App2 & App3 -->|OTLP| OTel
    App1 & App2 & App3 -->|logs| Alloy

    OTel -->|metrics| Mimir
    OTel -->|traces| Tempo
    Alloy -->|logs| Loki
    OTel -->|logs| Loki

    Grafana --> Mimir & Loki & Tempo

    style Grafana fill:#ff9,stroke:#333
    style Mimir fill:#f96,stroke:#333
    style Loki fill:#6f9,stroke:#333
    style Tempo fill:#69f,stroke:#333

2. Docker ComposeでLGTMを構築

2.1 ディレクトリ構成

lgtm-stack/
├── docker-compose.yaml
├── config/
│   ├── mimir.yaml
│   ├── loki.yaml
│   ├── tempo.yaml
│   ├── grafana/
│   │   └── datasources.yaml
│   └── otel-collector.yaml
└── data/
    ├── mimir/
    ├── loki/
    └── tempo/

2.2 docker-compose.yaml

version: '3.8'

services:
  # === Mimir (Metrics) ===
  mimir:
    image: grafana/mimir:2.14.0
    command: ['-config.file=/etc/mimir.yaml']
    volumes:
      - ./config/mimir.yaml:/etc/mimir.yaml
      - ./data/mimir:/data
    ports:
      - '9009:9009'

  # === Loki (Logs) ===
  loki:
    image: grafana/loki:3.3.0
    command: ['-config.file=/etc/loki.yaml']
    volumes:
      - ./config/loki.yaml:/etc/loki.yaml
      - ./data/loki:/loki
    ports:
      - '3100:3100'

  # === Tempo (Traces) ===
  tempo:
    image: grafana/tempo:2.6.0
    command: ['-config.file=/etc/tempo.yaml']
    volumes:
      - ./config/tempo.yaml:/etc/tempo.yaml
      - ./data/tempo:/var/tempo
    ports:
      - '3200:3200' # Tempo API
      - '4317:4317' # OTLP gRPC (via Tempo)

  # === OpenTelemetry Collector ===
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.112.0
    command: ['--config=/etc/otel-collector.yaml']
    volumes:
      - ./config/otel-collector.yaml:/etc/otel-collector.yaml
    ports:
      - '4318:4318' # OTLP HTTP
      - '8889:8889' # Prometheus exporter

  # === Grafana ===
  grafana:
    image: grafana/grafana:11.4.0
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_AUTH_ANONYMOUS_ENABLED=true
    volumes:
      - ./config/grafana/datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml
    ports:
      - '3000:3000'
    depends_on:
      - mimir
      - loki
      - tempo

2.3 Mimir設定

# config/mimir.yaml
multitenancy_enabled: false

blocks_storage:
  backend: filesystem
  bucket_store:
    sync_dir: /data/tsdb-sync
  filesystem:
    dir: /data/tsdb

compactor:
  data_dir: /data/compactor
  sharding_ring:
    kvstore:
      store: memberlist

distributor:
  ring:
    kvstore:
      store: memberlist

ingester:
  ring:
    kvstore:
      store: memberlist
    replication_factor: 1

server:
  http_listen_port: 9009

store_gateway:
  sharding_ring:
    replication_factor: 1

2.4 Loki設定

# config/loki.yaml
auth_enabled: false

server:
  http_listen_port: 3100

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

limits_config:
  allow_structured_metadata: true
  volume_enabled: true

2.5 Tempo設定

# config/tempo.yaml
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: '0.0.0.0:4317'

storage:
  trace:
    backend: local
    local:
      path: /var/tempo/traces
    wal:
      path: /var/tempo/wal

metrics_generator:
  registry:
    external_labels:
      source: tempo
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://mimir:9009/api/v1/push
        send_exemplars: true

2.6 OTel Collector設定

# config/otel-collector.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000

  resource:
    attributes:
      - key: service.instance.id
        from_attribute: host.name
        action: insert

exporters:
  otlphttp/mimir:
    endpoint: http://mimir:9009/otlp

  otlphttp/loki:
    endpoint: http://loki:3100/otlp

  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  debug:
    verbosity: basic

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [otlphttp/mimir]
    logs:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [otlphttp/loki]
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [otlp/tempo]

2.7 Grafanaデータソースの自動設定

# config/grafana/datasources.yaml
apiVersion: 1

datasources:
  - name: Mimir
    type: prometheus
    access: proxy
    url: http://mimir:9009/prometheus
    isDefault: true

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      derivedFields:
        - datasourceUid: tempo
          matcherRegex: '"traceId":"(\w+)"'
          name: TraceID
          url: '$${__value.raw}'

  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    jsonData:
      tracesToLogsV2:
        datasourceUid: loki
        filterByTraceID: true
      tracesToMetrics:
        datasourceUid: mimir
      serviceMap:
        datasourceUid: mimir

3. アプリケーション計装(Instrumentation)

3.1 Python (FastAPI + OpenTelemetry)

# app.py
from fastapi import FastAPI
from opentelemetry import trace, metrics
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
import logging

# Resource定義
resource = Resource.create({
    "service.name": "order-service",
    "service.version": "1.0.0",
    "deployment.environment": "production",
})

# Traces設定
trace.set_tracer_provider(TracerProvider(resource=resource))
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)

# Metrics設定
metrics.set_meter_provider(MeterProvider(
    resource=resource,
    metric_readers=[PeriodicExportingMetricReader(
        OTLPMetricExporter(endpoint="http://otel-collector:4317")
    )]
))

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)

tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)

# カスタムメトリクス
order_counter = meter.create_counter("orders.created", description="Orders created")
order_duration = meter.create_histogram("orders.duration_ms", description="Order processing time")

@app.post("/orders")
async def create_order(order: dict):
    with tracer.start_as_current_span("create_order") as span:
        span.set_attribute("order.customer_id", order["customer_id"])

        # ビジネスロジック
        result = process_order(order)

        order_counter.add(1, {"status": "success"})
        logging.info("Order created", extra={
            "order_id": result["id"],
            "traceId": span.get_span_context().trace_id,
        })
        return result

4. Grafanaでの連携

4.1 Logs から Traces への連携

sequenceDiagram
    participant Dev as 開発者
    participant G as Grafana
    participant L as Loki
    participant T as Tempo
    participant M as Mimir

    Dev->>G: エラーログ検索
    G->>L: LogQLクエリ
    L-->>G: ログ + traceId
    Dev->>G: traceIdをクリック
    G->>T: Trace照会
    T-->>G: 全体Traceの可視化
    Dev->>G: "関連メトリクスを表示"
    G->>M: PromQLクエリ
    M-->>G: 該当時点のメトリクス

4.2 LogQLの例

# エラーログの検索
{service_name="order-service"} |= "error" | json | line_format "{{.message}}"

# 特定のtraceIdでフィルタ
{service_name=~"order-service|payment-service"} | json | traceId="abc123"

# ログボリューム(メトリクスのように)
sum by (level) (count_over_time({service_name="order-service"} | json [5m]))

4.3 TraceQLの例

# 500ms以上かかったトレース
{ duration > 500ms && span.http.status_code >= 500 }

# 特定サービスの遅いDBクエリ
{ resource.service.name = "order-service" && span.db.system = "postgresql" && duration > 100ms }

5. Kubernetesデプロイ

# HelmでLGTMスタックをデプロイ
helm repo add grafana https://grafana.github.io/helm-charts

# Mimir
helm install mimir grafana/mimir-distributed -n monitoring \
  --set mimir.structuredConfig.common.storage.backend=s3 \
  --set mimir.structuredConfig.common.storage.s3.endpoint=minio:9000

# Loki
helm install loki grafana/loki -n monitoring \
  --set loki.storage.type=s3

# Tempo
helm install tempo grafana/tempo-distributed -n monitoring

# Grafana
helm install grafana grafana/grafana -n monitoring \
  --set adminPassword=admin

6. アラート設定

# Grafana Alert Rule (via provisioning)
apiVersion: 1
groups:
  - orgId: 1
    name: Service Health
    folder: Alerts
    interval: 1m
    rules:
      - uid: high_error_rate
        title: High Error Rate
        condition: C
        data:
          - refId: A
            datasourceUid: mimir
            model:
              expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
          - refId: C
            datasourceUid: __expr__
            model:
              type: threshold
              conditions:
                - evaluator:
                    type: gt
                    params: [0.05] # 5%以上のエラー

7. クイズ

Q1. LGTMスタックの各コンポーネントの役割は?

Loki: ログ収集/検索、Grafana: 可視化/ダッシュボード、Tempo: 分散トレーシング、Mimir: メトリクスの長期保存/クエリ(Prometheus互換)。

Q2. MimirとPrometheusの関係は?

MimirはPrometheusの**長期ストレージ(Long-term Storage)**です。PromQL互換で、マルチテナンシー、水平スケーリング、グローバルクエリをサポートします。Prometheusが収集したメトリクスをremote_writeでMimirに送信します。

Q3. OpenTelemetry Collectorの役割は?

アプリケーションから収集したMetrics、Logs、Tracesを受信(Receive)、加工(Process)、送信(Export)する中間エージェントです。ベンダー中立で、様々なバックエンドにルーティング可能です。

Q4. LokiがElasticsearchより軽量な理由は?

Lokiはログの内容をインデクシングせず、ラベルのみをインデクシングします。全文検索の代わりにラベルベースのフィルタリング + grep方式を使用します。インデックスサイズが劇的に小さくなります。

Q5. Traces から Logs から Metrics への連携の利点は?

エラートレースから関連ログを即座に確認し、該当時点のメトリクス(CPU、メモリ、エラー率)を合わせて分析することで、**根本原因(Root Cause)**を迅速に特定できます。

Q6. Tempoのmetrics_generatorとは?

トレースデータから**REDメトリクス(Rate、Error、Duration)**を自動生成してMimirに送信します。別途メトリクス計装なしに、トレースだけでサービスパフォーマンスダッシュボードを構成できます。