Skip to content
Published on

Observability Telemetry Pipeline Cost Optimization: Sampling, Filtering, and Tiering Strategies

Authors
  • Name
    Twitter

Introduction

Observability costs have risen to one of the top line items in cloud infrastructure spending. The global observability market surpassed 28.5billionin2025andisprojectedtoreach28.5 billion in 2025 and is projected to reach 34.1 billion by end of 2026. As microservice architectures proliferate, traces account for 60–70% of total observability costs, while logs consume another 20–30%. When hundreds of services generate millions of spans and log entries per second, storage and indexing costs grow exponentially rather than linearly.

However, simply reducing data is not the answer. The key is to separate signal from noise — cutting costs while preserving the quality of observability. This article covers telemetry pipeline cost optimization strategies centered on the OpenTelemetry Collector, with practical configuration examples and actionable checklists.

Cost Structure Analysis

Cost Breakdown by Signal Type

SignalCost SharePrimary Cost DriversOptimization Difficulty
Traces60–70%High cardinality, large payloads, span explosionHigh
Logs20–30%Unstructured data, full-text indexing, high volumeMedium
Metrics5–15%Time-series cardinality explosion, label combinationsMedium
Profiles1–5%CPU/memory profile data sizeLow

Where Costs Occur in the Pipeline

[Generation] --> [Collection] --> [Processing] --> [Indexing] --> [Storage] --> [Query]
     |               |               |               |             |            |
    SDK          Network         Collector        Backend       Disk/S3     Compute
  overhead       bandwidth       CPU/Memory       Index I/O    storage      query cost

The core optimization principle: eliminate unnecessary data as early in the pipeline as possible. Data removed at the generation stage saves costs at every subsequent stage, while data removed at the storage stage means you have already paid for collection, processing, and indexing.

Sampling Strategies

Head Sampling vs Tail Sampling

AspectHead SamplingTail Sampling
Decision PointTrace start (SDK level)After trace completion (Collector)
Cost ReductionHigh (saves network bandwidth too)Medium
Data QualityLow (may miss error traces)High (100% retention of errors/high latency)
ComplexityLowHigh (requires memory, routing)
Best ForNon-critical high-traffic servicesProduction core services

Tail Sampling Configuration

# OpenTelemetry Collector - Tail Sampling
processors:
  tail_sampling:
    decision_wait: 30s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      # Retain 100% of error traces
      - name: error-policy
        type: status_code
        status_code:
          status_codes:
            - ERROR

      # Retain 100% of high-latency traces (>500ms)
      - name: latency-policy
        type: latency
        latency:
          threshold_ms: 500

      # Retain 100% of critical service traces
      - name: critical-service-policy
        type: string_attribute
        string_attribute:
          key: service.name
          values:
            - payment-service
            - auth-service

      # Sample only 5% of normal traffic
      - name: normal-traffic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

2-Tier Collector Architecture

For stable Tail Sampling in production, separate Agent Collectors from Gateway Collectors. This ensures all spans from the same trace arrive at the same Gateway instance for correct sampling decisions.

[Services] --OTLP--> [Agent Collector (DaemonSet)]
                           |
                     Trace ID-based routing
                           |
                     [Gateway Collector] --> [Tempo/Jaeger]
                     (Tail Sampling runs here)

Use the loadbalancing exporter on Agent Collectors to route spans by trace ID:

exporters:
  loadbalancing:
    protocol:
      otlp:
        tls:
          insecure: true
    resolver:
      dns:
        hostname: otel-gateway-headless.observability.svc.cluster.local
        port: 4317

Log Filtering Pipeline

Level-Based Filtering

Dropping DEBUG/TRACE level logs at the Collector typically reduces log volume by 30–50%.

processors:
  # Drop DEBUG/TRACE logs
  filter/drop-debug:
    error_mode: ignore
    logs:
      log_record:
        - 'severity_number < SEVERITY_NUMBER_INFO'

  # Drop health check logs
  filter/drop-healthcheck:
    error_mode: ignore
    logs:
      log_record:
        - 'IsMatch(body, ".*GET /health.*")'
        - 'IsMatch(body, ".*GET /readyz.*")'
        - 'IsMatch(body, ".*kube-probe.*")'

  # Remove unnecessary attributes to reduce payload
  transform/reduce-attributes:
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - delete_key(attributes, "log.file.path")
          - delete_key(attributes, "log.iostream")
          - truncate_all(attributes, 256)
          - limit(attributes, 20)

Measuring Filtering Effectiveness

# Measure filtering impact using Collector internal metrics
RECEIVED=$(curl -s http://localhost:8888/metrics | \
  grep 'otelcol_receiver_accepted_log_records' | \
  awk '{sum += $2} END {print sum}')

EXPORTED=$(curl -s http://localhost:8888/metrics | \
  grep 'otelcol_exporter_sent_log_records' | \
  awk '{sum += $2} END {print sum}')

if [ "$RECEIVED" -gt 0 ]; then
  DROP_RATE=$(echo "scale=2; (1 - $EXPORTED / $RECEIVED) * 100" | bc)
  echo "Received logs: $RECEIVED"
  echo "Exported logs: $EXPORTED"
  echo "Drop rate: ${DROP_RATE}%"
fi

Metric Cardinality Management

Cardinality explosion is the primary driver of unpredictable observability costs. Every unique time series requires a separate index entry; when millions of series are created, RAM and disk usage spike while ingest latency and query performance degrade.

Dangerous Patterns and Remedies

Dangerous PatternExampleRemedy
User ID as labeluser_id="u12345"Remove label; move to logs/traces
Raw request pathpath="/api/users/12345"Normalize: path="/api/users/:id"
Pod namepod="web-7f8c9-xk2m"Use Deployment name only
Full error messageerror="Connection refused: 10.0.1.42"Classify by error code

Collector-Side Cardinality Control

processors:
  transform/normalize-url:
    error_mode: ignore
    metric_statements:
      - context: datapoint
        statements:
          - replace_pattern(attributes["url.path"], "^/api/users/[^/]+", "/api/users/:id")
          - replace_pattern(attributes["url.path"], "^/api/orders/[^/]+", "/api/orders/:id")
          - delete_key(attributes, "user.id")
          - delete_key(attributes, "request.id")

Storage Tiering Architecture

A Hot/Warm/Cold architecture automatically migrates data to cheaper storage tiers as it ages, dramatically reducing long-term retention costs.

TierRetentionStorageQuery SpeedCost
Hot0–7 daysSSD/NVMeFastestHigh
Warm7–30 daysHDD/S3 StandardModerateMedium
Cold30 days–1 yearS3 Glacier/IASlowLow

Cost Optimization Checklist

Phase 1: Quick Wins (1–2 Weeks)

  • Are DEBUG/TRACE level logs being dropped at the Collector in production?
  • Are health check and readiness probe logs/traces being filtered?
  • Do metric labels avoid high-cardinality values like user IDs or request IDs?
  • Are URL path labels normalized (e.g., /api/users/:id instead of /api/users/12345)?
  • Is the memory_limiter processor configured on all Collectors?
  • Are you monitoring Collector internal metrics (drop rates, queue sizes, memory usage)?

Phase 2: Mid-Term Optimization (2–4 Weeks)

  • Does Tail Sampling retain 100% of error/high-latency traces while limiting normal traffic to 5–10%?
  • Is a 2-Tier Collector architecture (Agent + Gateway) deployed?
  • Is trace ID-based routing (LoadBalancing Exporter) configured?
  • Are unnecessary log attributes being stripped?
  • Is an Observability Budget defined per service?
  • Is a cardinality monitoring dashboard in place?

Phase 3: Long-Term Infrastructure Optimization (1–3 Months)

  • Is Hot/Warm/Cold storage tiering configured?
  • Is S3 Intelligent-Tiering or equivalent auto-tiering enabled?
  • Are metric downsampling policies applied (e.g., 5-min resolution after 7 days, 1-hour after 30 days)?
  • Do ILM/retention policies meet regulatory requirements?
  • Is a cardinality validation gate included in the CI/CD pipeline?

Troubleshooting

Collector Memory Usage Keeps Growing

# 1. Check Collector memory usage
kubectl top pods -n observability -l app=otel-gateway

# 2. Profile memory with pprof
curl -s http://localhost:1777/debug/pprof/heap > heap.prof
go tool pprof -top heap.prof

# 3. Check Tail Sampling queue state
curl -s http://localhost:8888/metrics | grep tail_sampling

Finding the Source of Cardinality Explosion

# Top metrics by series count in Prometheus
curl -s http://prometheus:9090/api/v1/status/tsdb | \
  jq '.data.seriesCountByMetricName | sort_by(-.value) | .[0:10]'

# Analyze label cardinality for a specific metric
curl -s 'http://prometheus:9090/api/v1/query?query=count(http_server_request_duration_seconds_bucket) by (service_name)' | \
  jq '.data.result | sort_by(-.value[1] | tonumber) | .[0:10]'

Conclusion

Observability cost optimization is not about indiscriminately reducing data — it is an engineering discipline of precisely separating signal from noise, maintaining observability quality while maximizing cost efficiency.

Key strategies summarized:

  1. Sampling: Use Head Sampling to save network costs, and Tail Sampling on core services to retain 100% of error and high-latency traces.
  2. Filtering: Drop DEBUG/health check logs at the Collector and trim attributes to reduce payload sizes.
  3. Cardinality Management: Remove high-cardinality labels, normalize URL paths, and enforce per-service Observability Budgets.
  4. Storage Tiering: Deploy Hot/Warm/Cold architecture to automatically migrate aging data to lower-cost storage tiers.

By systematically applying these strategies, you can reduce overall observability costs by 60–80% while fully preserving the data needed for incident response.

References