Skip to content
Published on

Grafana LGTM Stack Complete Guide: Building Unified Observability with Loki + Grafana + Tempo + Mimir

Authors
  • Name
    Twitter

1. What Is the LGTM Stack

LGTM is an open-source observability stack from Grafana Labs:

ComponentRoleAlternatives
LokiLog collection/searchElasticsearch, Splunk
GrafanaVisualization/DashboardKibana, Datadog
TempoDistributed tracingJaeger, Zipkin
MimirMetric storage/queryThanos, Cortex
graph TB
    subgraph "Applications"
        App1[Service A]
        App2[Service B]
        App3[Service C]
    end

    subgraph "Collection"
        OTel[OpenTelemetry Collector]
        Alloy[Grafana Alloy]
    end

    subgraph "LGTM Stack"
        Mimir[Mimir<br/>Metrics]
        Loki[Loki<br/>Logs]
        Tempo[Tempo<br/>Traces]
        Grafana[Grafana<br/>Dashboard]
    end

    App1 & App2 & App3 -->|OTLP| OTel
    App1 & App2 & App3 -->|logs| Alloy

    OTel -->|metrics| Mimir
    OTel -->|traces| Tempo
    Alloy -->|logs| Loki
    OTel -->|logs| Loki

    Grafana --> Mimir & Loki & Tempo

    style Grafana fill:#ff9,stroke:#333
    style Mimir fill:#f96,stroke:#333
    style Loki fill:#6f9,stroke:#333
    style Tempo fill:#69f,stroke:#333

2. Building LGTM with Docker Compose

2.1 Directory Structure

lgtm-stack/
├── docker-compose.yaml
├── config/
│   ├── mimir.yaml
│   ├── loki.yaml
│   ├── tempo.yaml
│   ├── grafana/
│   │   └── datasources.yaml
│   └── otel-collector.yaml
└── data/
    ├── mimir/
    ├── loki/
    └── tempo/

2.2 docker-compose.yaml

version: '3.8'

services:
  # === Mimir (Metrics) ===
  mimir:
    image: grafana/mimir:2.14.0
    command: ['-config.file=/etc/mimir.yaml']
    volumes:
      - ./config/mimir.yaml:/etc/mimir.yaml
      - ./data/mimir:/data
    ports:
      - '9009:9009'

  # === Loki (Logs) ===
  loki:
    image: grafana/loki:3.3.0
    command: ['-config.file=/etc/loki.yaml']
    volumes:
      - ./config/loki.yaml:/etc/loki.yaml
      - ./data/loki:/loki
    ports:
      - '3100:3100'

  # === Tempo (Traces) ===
  tempo:
    image: grafana/tempo:2.6.0
    command: ['-config.file=/etc/tempo.yaml']
    volumes:
      - ./config/tempo.yaml:/etc/tempo.yaml
      - ./data/tempo:/var/tempo
    ports:
      - '3200:3200' # Tempo API
      - '4317:4317' # OTLP gRPC (via Tempo)

  # === OpenTelemetry Collector ===
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.112.0
    command: ['--config=/etc/otel-collector.yaml']
    volumes:
      - ./config/otel-collector.yaml:/etc/otel-collector.yaml
    ports:
      - '4318:4318' # OTLP HTTP
      - '8889:8889' # Prometheus exporter

  # === Grafana ===
  grafana:
    image: grafana/grafana:11.4.0
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_AUTH_ANONYMOUS_ENABLED=true
    volumes:
      - ./config/grafana/datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml
    ports:
      - '3000:3000'
    depends_on:
      - mimir
      - loki
      - tempo

2.3 Mimir Configuration

# config/mimir.yaml
multitenancy_enabled: false

blocks_storage:
  backend: filesystem
  bucket_store:
    sync_dir: /data/tsdb-sync
  filesystem:
    dir: /data/tsdb

compactor:
  data_dir: /data/compactor
  sharding_ring:
    kvstore:
      store: memberlist

distributor:
  ring:
    kvstore:
      store: memberlist

ingester:
  ring:
    kvstore:
      store: memberlist
    replication_factor: 1

server:
  http_listen_port: 9009

store_gateway:
  sharding_ring:
    replication_factor: 1

2.4 Loki Configuration

# config/loki.yaml
auth_enabled: false

server:
  http_listen_port: 3100

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

limits_config:
  allow_structured_metadata: true
  volume_enabled: true

2.5 Tempo Configuration

# config/tempo.yaml
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: '0.0.0.0:4317'

storage:
  trace:
    backend: local
    local:
      path: /var/tempo/traces
    wal:
      path: /var/tempo/wal

metrics_generator:
  registry:
    external_labels:
      source: tempo
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://mimir:9009/api/v1/push
        send_exemplars: true

2.6 OTel Collector Configuration

# config/otel-collector.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000

  resource:
    attributes:
      - key: service.instance.id
        from_attribute: host.name
        action: insert

exporters:
  otlphttp/mimir:
    endpoint: http://mimir:9009/otlp

  otlphttp/loki:
    endpoint: http://loki:3100/otlp

  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  debug:
    verbosity: basic

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [otlphttp/mimir]
    logs:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [otlphttp/loki]
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [otlp/tempo]

2.7 Grafana Data Source Auto-Configuration

# config/grafana/datasources.yaml
apiVersion: 1

datasources:
  - name: Mimir
    type: prometheus
    access: proxy
    url: http://mimir:9009/prometheus
    isDefault: true

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      derivedFields:
        - datasourceUid: tempo
          matcherRegex: '"traceId":"(\w+)"'
          name: TraceID
          url: '$${__value.raw}'

  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    jsonData:
      tracesToLogsV2:
        datasourceUid: loki
        filterByTraceID: true
      tracesToMetrics:
        datasourceUid: mimir
      serviceMap:
        datasourceUid: mimir

3. Application Instrumentation

3.1 Python (FastAPI + OpenTelemetry)

# app.py
from fastapi import FastAPI
from opentelemetry import trace, metrics
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
import logging

# Resource definition
resource = Resource.create({
    "service.name": "order-service",
    "service.version": "1.0.0",
    "deployment.environment": "production",
})

# Traces setup
trace.set_tracer_provider(TracerProvider(resource=resource))
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)

# Metrics setup
metrics.set_meter_provider(MeterProvider(
    resource=resource,
    metric_readers=[PeriodicExportingMetricReader(
        OTLPMetricExporter(endpoint="http://otel-collector:4317")
    )]
))

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)

tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)

# Custom metrics
order_counter = meter.create_counter("orders.created", description="Orders created")
order_duration = meter.create_histogram("orders.duration_ms", description="Order processing time")

@app.post("/orders")
async def create_order(order: dict):
    with tracer.start_as_current_span("create_order") as span:
        span.set_attribute("order.customer_id", order["customer_id"])

        # Business logic
        result = process_order(order)

        order_counter.add(1, {"status": "success"})
        logging.info("Order created", extra={
            "order_id": result["id"],
            "traceId": span.get_span_context().trace_id,
        })
        return result

4. Integration in Grafana

4.1 Logs to Traces Integration

sequenceDiagram
    participant Dev as Developer
    participant G as Grafana
    participant L as Loki
    participant T as Tempo
    participant M as Mimir

    Dev->>G: Search error logs
    G->>L: LogQL query
    L-->>G: Logs + traceId
    Dev->>G: Click traceId
    G->>T: Query trace
    T-->>G: Visualize full trace
    Dev->>G: "View related metrics"
    G->>M: PromQL query
    M-->>G: Metrics at that point in time

4.2 LogQL Examples

# Search error logs
{service_name="order-service"} |= "error" | json | line_format "{{.message}}"

# Filter by specific traceId
{service_name=~"order-service|payment-service"} | json | traceId="abc123"

# Log volume (metric-like)
sum by (level) (count_over_time({service_name="order-service"} | json [5m]))

4.3 TraceQL Examples

# Traces taking more than 500ms
{ duration > 500ms && span.http.status_code >= 500 }

# Slow DB queries for a specific service
{ resource.service.name = "order-service" && span.db.system = "postgresql" && duration > 100ms }

5. Kubernetes Deployment

# Deploy LGTM stack with Helm
helm repo add grafana https://grafana.github.io/helm-charts

# Mimir
helm install mimir grafana/mimir-distributed -n monitoring \
  --set mimir.structuredConfig.common.storage.backend=s3 \
  --set mimir.structuredConfig.common.storage.s3.endpoint=minio:9000

# Loki
helm install loki grafana/loki -n monitoring \
  --set loki.storage.type=s3

# Tempo
helm install tempo grafana/tempo-distributed -n monitoring

# Grafana
helm install grafana grafana/grafana -n monitoring \
  --set adminPassword=admin

6. Alert Configuration

# Grafana Alert Rule (via provisioning)
apiVersion: 1
groups:
  - orgId: 1
    name: Service Health
    folder: Alerts
    interval: 1m
    rules:
      - uid: high_error_rate
        title: High Error Rate
        condition: C
        data:
          - refId: A
            datasourceUid: mimir
            model:
              expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
          - refId: C
            datasourceUid: __expr__
            model:
              type: threshold
              conditions:
                - evaluator:
                    type: gt
                    params: [0.05] # More than 5% errors

7. Quiz

Q1. What are the roles of each component in the LGTM stack?

Loki: Log collection/search, Grafana: Visualization/dashboard, Tempo: Distributed tracing, Mimir: Long-term metric storage/query (Prometheus-compatible).

Q2. What is the relationship between Mimir and Prometheus?

Mimir serves as long-term storage for Prometheus. It is PromQL-compatible and supports multi-tenancy, horizontal scaling, and global queries. Prometheus sends collected metrics to Mimir via remote_write.

Q3. What is the role of the OpenTelemetry Collector?

It is an intermediary agent that receives, processes, and exports Metrics, Logs, and Traces collected from applications. It is vendor-neutral and can route data to various backends.

Q4. Why is Loki lighter than Elasticsearch?

Loki does not index log content -- it only indexes labels. Instead of full-text search, it uses label-based filtering combined with a grep-style approach. This makes the index size dramatically smaller.

Q5. What are the benefits of Traces to Logs to Metrics integration?

From an error trace, you can instantly view related logs and analyze metrics (CPU, memory, error rate) at that point in time, enabling rapid root cause identification.

Q6. What is Tempo's metrics_generator?

It automatically generates RED metrics (Rate, Error, Duration) from trace data and sends them to Mimir. This enables building service performance dashboards from traces alone without separate metric instrumentation.