Skip to content
Published on

OpenTelemetry Collector Pipeline Design Practical Guide — Receiver, Processor, Exporter

Authors
  • Name
    Twitter
OpenTelemetry Collector Pipeline

Introduction

In microservice environments, unified collection and processing of Traces, Metrics, and Logs is at the core of Observability. The OpenTelemetry Collector is a vendor-neutral telemetry pipeline that collects data from various sources and sends it to the desired backends.

In this article, we explore the OTel Collector architecture and cover pipeline design for production environments.

OTel Collector Architecture

Pipeline Structure

# Data Flow
# ReceiverProcessorExporter
#
# Receiver: Data collection (OTLP, Jaeger, Prometheus, Fluentd, etc.)
# Processor: Data processing (filtering, transformation, batching, sampling)
# Exporter: Data transmission (OTLP, Jaeger, Prometheus, Loki, etc.)
#
# Multiple pipelines can be configured in a single Collector:
# - traces pipeline
# - metrics pipeline
# - logs pipeline

Collector Deployment Patterns

# Pattern 1: Agent (Sidecar/DaemonSet)
# Deployed on each node/Pod, collects locally

# Pattern 2: Gateway (Centralized)
# Deployed as an independent service in the cluster, centralizes traffic

# Pattern 3: Agent + Gateway (Recommended)
# Agent collects locally → Gateway handles central processing/routing

Installation

Kubernetes (Helm)

# Install OpenTelemetry Operator
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

# Install Collector (DaemonSet mode)
helm install otel-collector open-telemetry/opentelemetry-collector \
  --namespace observability \
  --create-namespace \
  --values collector-values.yaml

Docker

docker run -d --name otel-collector \
  -p 4317:4317 \
  -p 4318:4318 \
  -p 8888:8888 \
  -v $(pwd)/otel-collector-config.yaml:/etc/otelcol/config.yaml \
  otel/opentelemetry-collector-contrib:0.96.0

Pipeline Configuration

Basic Configuration Structure

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000
    send_batch_max_size: 1500

exporters:
  debug:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]

Production Pipeline

# production-config.yaml
receivers:
  # OTLP (sent from application SDKs)
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        max_recv_msg_size_mib: 4
      http:
        endpoint: 0.0.0.0:4318
        cors:
          allowed_origins: ['*']

  # Prometheus scraping
  prometheus:
    config:
      scrape_configs:
        - job_name: 'kubernetes-pods'
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
              action: keep
              regex: true

  # Host metrics
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu: {}
      memory: {}
      disk: {}
      network: {}
      load: {}

  # Kubernetes events
  k8s_events:
    namespaces: [default, production]

processors:
  # Batching
  batch:
    timeout: 5s
    send_batch_size: 1000

  # Memory limiting
  memory_limiter:
    check_interval: 1s
    limit_mib: 1500
    spike_limit_mib: 512

  # Resource information enrichment
  resourcedetection:
    detectors: [env, system, docker, gcp, aws, azure]
    timeout: 5s

  # K8s metadata enrichment
  k8sattributes:
    auth_type: serviceAccount
    extract:
      metadata:
        - k8s.namespace.name
        - k8s.deployment.name
        - k8s.pod.name
        - k8s.node.name

  # Remove unnecessary attributes
  attributes:
    actions:
      - key: http.request.header.authorization
        action: delete
      - key: db.statement
        action: hash # Hash sensitive queries

  # Tail sampling (traces only)
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: error-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-policy
        type: latency
        latency:
          threshold_ms: 1000
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

  # Filtering
  filter:
    metrics:
      exclude:
        match_type: regexp
        metric_names:
          - 'go_.*'
          - 'process_.*'

exporters:
  # Traces → Tempo
  otlp/tempo:
    endpoint: tempo.observability.svc:4317
    tls:
      insecure: true

  # Metrics → Prometheus/Mimir
  prometheusremotewrite:
    endpoint: http://mimir.observability.svc:9009/api/v1/push
    tls:
      insecure: true
    resource_to_telemetry_conversion:
      enabled: true

  # Logs → Loki
  loki:
    endpoint: http://loki.observability.svc:3100/loki/api/v1/push
    default_labels_enabled:
      exporter: true
      job: true

  # Debug (for troubleshooting)
  debug:
    verbosity: basic

extensions:
  # Health check
  health_check:
    endpoint: 0.0.0.0:13133

  # Self metrics
  zpages:
    endpoint: 0.0.0.0:55679

  # pprof (profiling)
  pprof:
    endpoint: 0.0.0.0:1777

service:
  extensions: [health_check, zpages, pprof]

  pipelines:
    traces:
      receivers: [otlp]
      processors:
        [memory_limiter, resourcedetection, k8sattributes, attributes, tail_sampling, batch]
      exporters: [otlp/tempo]

    metrics:
      receivers: [otlp, prometheus, hostmetrics]
      processors: [memory_limiter, resourcedetection, k8sattributes, filter, batch]
      exporters: [prometheusremotewrite]

    logs:
      receivers: [otlp, k8s_events]
      processors: [memory_limiter, resourcedetection, k8sattributes, attributes, batch]
      exporters: [loki]

  telemetry:
    logs:
      level: info
    metrics:
      address: 0.0.0.0:8888

Receiver Details

OTLP Receiver

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        max_recv_msg_size_mib: 4
        keepalive:
          server_parameters:
            max_connection_idle: 11s
            max_connection_age: 30s
      http:
        endpoint: 0.0.0.0:4318

Filelog Receiver (Log File Collection)

receivers:
  filelog:
    include:
      - /var/log/pods/*/*/*.log
    exclude:
      - /var/log/pods/*/otel-collector*/*.log
    start_at: beginning
    include_file_path: true
    operators:
      - type: router
        routes:
          - output: parse_json
            expr: 'body matches "^\\{"'
          - output: parse_plain
            expr: 'body matches "^[^{]"'
      - id: parse_json
        type: json_parser
        timestamp:
          parse_from: attributes.timestamp
          layout: '%Y-%m-%dT%H:%M:%S.%fZ'
      - id: parse_plain
        type: regex_parser
        regex: '^(?P<timestamp>\S+) (?P<level>\S+) (?P<message>.*)'

Processor Details

Tail Sampling (Critical!)

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      # Collect 100% of errors
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]

      # Collect 100% of slow requests over 1 second
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 1000

      # Collect 100% for specific services
      - name: critical-services
        type: string_attribute
        string_attribute:
          key: service.name
          values: [payment-service, auth-service]

      # Collect only 5% of the rest
      - name: probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

      # Composite policy
      - name: composite-policy
        type: composite
        composite:
          max_total_spans_per_second: 1000
          policy_order: [errors, slow-traces, critical-services, probabilistic]
          rate_allocation:
            - policy: errors
              percent: 30
            - policy: slow-traces
              percent: 30
            - policy: critical-services
              percent: 20
            - policy: probabilistic
              percent: 20

Transform Processor

processors:
  transform:
    trace_statements:
      - context: span
        statements:
          # Add attribute
          - set(attributes["deployment.environment"], "production")
          # Transform attribute
          - replace_pattern(attributes["http.url"], "password=\\w+", "password=***")
          # Conditional processing
          - set(attributes["error.category"], "timeout") where attributes["error.type"] == "DeadlineExceeded"

    metric_statements:
      - context: datapoint
        statements:
          - set(attributes["env"], "prod")

    log_statements:
      - context: log
        statements:
          # Extract information from log body
          - set(attributes["user_id"], ExtractPatterns(body, "user_id=(?P<user_id>\\w+)"))

Kubernetes Deployment

Agent (DaemonSet) + Gateway Pattern

# agent-config.yaml (DaemonSet)
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel-agent
  namespace: observability
spec:
  mode: daemonset
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
      hostmetrics:
        collection_interval: 30s
        scrapers:
          cpu: {}
          memory: {}

    processors:
      memory_limiter:
        limit_mib: 512
      batch:
        timeout: 5s

    exporters:
      # Send to Gateway
      otlp:
        endpoint: otel-gateway.observability.svc:4317
        tls:
          insecure: true

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [otlp]
        metrics:
          receivers: [otlp, hostmetrics]
          processors: [memory_limiter, batch]
          exporters: [otlp]
---
# gateway-config.yaml (Deployment)
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel-gateway
  namespace: observability
spec:
  mode: deployment
  replicas: 3
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317

    processors:
      memory_limiter:
        limit_mib: 2048
      tail_sampling:
        decision_wait: 10s
        policies:
          - name: errors
            type: status_code
            status_code:
              status_codes: [ERROR]
          - name: probabilistic
            type: probabilistic
            probabilistic:
              sampling_percentage: 10
      batch:
        timeout: 10s
        send_batch_size: 5000

    exporters:
      otlp/tempo:
        endpoint: tempo.observability.svc:4317
        tls:
          insecure: true
      prometheusremotewrite:
        endpoint: http://mimir.observability.svc:9009/api/v1/push
      loki:
        endpoint: http://loki.observability.svc:3100/loki/api/v1/push

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, tail_sampling, batch]
          exporters: [otlp/tempo]
        metrics:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [prometheusremotewrite]
        logs:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [loki]

Troubleshooting

Checking Self Metrics

# Collector self metrics (port 8888)
curl http://localhost:8888/metrics | grep otelcol

# Key metrics:
# otelcol_receiver_accepted_spans - Number of accepted spans
# otelcol_receiver_refused_spans - Number of refused spans
# otelcol_processor_dropped_spans - Number of dropped spans
# otelcol_exporter_sent_spans - Number of sent spans
# otelcol_exporter_send_failed_spans - Number of failed spans

Debugging with zPages

# http://localhost:55679/debug/tracez — View recent traces
# http://localhost:55679/debug/pipelinez — View pipeline status

Conclusion

Key points for OpenTelemetry Collector pipeline design:

  1. Agent + Gateway pattern: Efficient operations with local collection + centralized processing
  2. Tail Sampling: Cost reduction with 100% collection for errors/slow requests, probabilistic sampling for the rest
  3. Memory Limiter is essential: Memory limits to prevent OOM
  4. Processor order matters: Recommended order is memory_limiter then sampling then batch
  5. Vendor neutral: Only swap the Exporter when changing backends

Quiz (6 Questions)

Q1. What are the three components of an OTel Collector pipeline? Receiver, Processor, Exporter

Q2. What are the roles of Agent and Gateway in the Agent + Gateway pattern? Agent: Local collection on each node, Gateway: Central processing/routing/transmission

Q3. Why is Tail Sampling better than Head Sampling? It makes sampling decisions after seeing the complete trace, so errors/slow requests are never missed

Q4. Why should the memory_limiter Processor be placed first in the pipeline? To check memory first and prevent OOM when receiving large volumes of data

Q5. How do you check the Collector's self metrics? Via the /metrics endpoint on port 8888 or zPages (port 55679)

Q6. What is the relationship between the batch Processor's timeout and send_batch_size? The batch is sent when either the timeout expires or the send_batch_size is reached (whichever comes first)