Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Introduction
Cost Structure Analysis
- Cost Breakdown by Signal Type
- Where Costs Occur in the Pipeline
Sampling Strategies
Log Filtering Pipeline
- Level-Based Filtering
- Measuring Filtering Effectiveness
Metric Cardinality Management
- Dangerous Patterns and Remedies
- Collector-Side Cardinality Control
Storage Tiering Architecture
Cost Optimization Checklist
Troubleshooting
- Collector Memory Usage Keeps Growing
- Finding the Source of Cardinality Explosion
Conclusion
References

Introduction

Observability costs have risen to one of the top line items in cloud infrastructure spending. The global observability market surpassed $28.5 billion in 2025 and is projected to reach $34.1 billion by end of 2026. As microservice architectures proliferate, traces account for 60–70% of total observability costs, while logs consume another 20–30%. When hundreds of services generate millions of spans and log entries per second, storage and indexing costs grow exponentially rather than linearly.

However, simply reducing data is not the answer. The key is to separate signal from noise — cutting costs while preserving the quality of observability. This article covers telemetry pipeline cost optimization strategies centered on the OpenTelemetry Collector, with practical configuration examples and actionable checklists.

Cost Structure Analysis

Cost Breakdown by Signal Type

Signal	Cost Share	Primary Cost Drivers	Optimization Difficulty
Traces	60–70%	High cardinality, large payloads, span explosion	High
Logs	20–30%	Unstructured data, full-text indexing, high volume	Medium
Metrics	5–15%	Time-series cardinality explosion, label combinations	Medium
Profiles	1–5%	CPU/memory profile data size	Low

Where Costs Occur in the Pipeline

[Generation] --> [Collection] --> [Processing] --> [Indexing] --> [Storage] --> [Query]
     |               |               |               |             |            |
    SDK          Network         Collector        Backend       Disk/S3     Compute
  overhead       bandwidth       CPU/Memory       Index I/O    storage      query cost

The core optimization principle: eliminate unnecessary data as early in the pipeline as possible. Data removed at the generation stage saves costs at every subsequent stage, while data removed at the storage stage means you have already paid for collection, processing, and indexing.

Sampling Strategies

Head Sampling vs Tail Sampling

Aspect	Head Sampling	Tail Sampling
Decision Point	Trace start (SDK level)	After trace completion (Collector)
Cost Reduction	High (saves network bandwidth too)	Medium
Data Quality	Low (may miss error traces)	High (100% retention of errors/high latency)
Complexity	Low	High (requires memory, routing)
Best For	Non-critical high-traffic services	Production core services

Tail Sampling Configuration

# OpenTelemetry Collector - Tail Sampling
processors:
  tail_sampling:
    decision_wait: 30s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      # Retain 100% of error traces
      - name: error-policy
        type: status_code
        status_code:
          status_codes:
            - ERROR

      # Retain 100% of high-latency traces (>500ms)
      - name: latency-policy
        type: latency
        latency:
          threshold_ms: 500

      # Retain 100% of critical service traces
      - name: critical-service-policy
        type: string_attribute
        string_attribute:
          key: service.name
          values:
            - payment-service
            - auth-service

      # Sample only 5% of normal traffic
      - name: normal-traffic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

2-Tier Collector Architecture

For stable Tail Sampling in production, separate Agent Collectors from Gateway Collectors. This ensures all spans from the same trace arrive at the same Gateway instance for correct sampling decisions.

[Services] --OTLP--> [Agent Collector (DaemonSet)]
                           |
                     Trace ID-based routing
                           |
                     [Gateway Collector] --> [Tempo/Jaeger]
                     (Tail Sampling runs here)

Use the loadbalancing exporter on Agent Collectors to route spans by trace ID:

exporters:
  loadbalancing:
    protocol:
      otlp:
        tls:
          insecure: true
    resolver:
      dns:
        hostname: otel-gateway-headless.observability.svc.cluster.local
        port: 4317

Log Filtering Pipeline

Level-Based Filtering

Dropping DEBUG/TRACE level logs at the Collector typically reduces log volume by 30–50%.

processors:
  # Drop DEBUG/TRACE logs
  filter/drop-debug:
    error_mode: ignore
    logs:
      log_record:
        - 'severity_number < SEVERITY_NUMBER_INFO'

  # Drop health check logs
  filter/drop-healthcheck:
    error_mode: ignore
    logs:
      log_record:
        - 'IsMatch(body, ".*GET /health.*")'
        - 'IsMatch(body, ".*GET /readyz.*")'
        - 'IsMatch(body, ".*kube-probe.*")'

  # Remove unnecessary attributes to reduce payload
  transform/reduce-attributes:
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - delete_key(attributes, "log.file.path")
          - delete_key(attributes, "log.iostream")
          - truncate_all(attributes, 256)
          - limit(attributes, 20)

Measuring Filtering Effectiveness

# Measure filtering impact using Collector internal metrics
RECEIVED=$(curl -s http://localhost:8888/metrics | \
  grep 'otelcol_receiver_accepted_log_records' | \
  awk '{sum += $2} END {print sum}')

EXPORTED=$(curl -s http://localhost:8888/metrics | \
  grep 'otelcol_exporter_sent_log_records' | \
  awk '{sum += $2} END {print sum}')

if [ "$RECEIVED" -gt 0 ]; then
  DROP_RATE=$(echo "scale=2; (1 - $EXPORTED / $RECEIVED) * 100" | bc)
  echo "Received logs: $RECEIVED"
  echo "Exported logs: $EXPORTED"
  echo "Drop rate: ${DROP_RATE}%"
fi

Metric Cardinality Management

Cardinality explosion is the primary driver of unpredictable observability costs. Every unique time series requires a separate index entry; when millions of series are created, RAM and disk usage spike while ingest latency and query performance degrade.

Dangerous Patterns and Remedies

Dangerous Pattern	Example	Remedy
User ID as label	`user_id="u12345"`	Remove label; move to logs/traces
Raw request path	`path="/api/users/12345"`	Normalize: `path="/api/users/:id"`
Pod name	`pod="web-7f8c9-xk2m"`	Use Deployment name only
Full error message	`error="Connection refused: 10.0.1.42"`	Classify by error code

Collector-Side Cardinality Control

processors:
  transform/normalize-url:
    error_mode: ignore
    metric_statements:
      - context: datapoint
        statements:
          - replace_pattern(attributes["url.path"], "^/api/users/[^/]+", "/api/users/:id")
          - replace_pattern(attributes["url.path"], "^/api/orders/[^/]+", "/api/orders/:id")
          - delete_key(attributes, "user.id")
          - delete_key(attributes, "request.id")

Storage Tiering Architecture

A Hot/Warm/Cold architecture automatically migrates data to cheaper storage tiers as it ages, dramatically reducing long-term retention costs.

Tier	Retention	Storage	Query Speed	Cost
Hot	0–7 days	SSD/NVMe	Fastest	High
Warm	7–30 days	HDD/S3 Standard	Moderate	Medium
Cold	30 days–1 year	S3 Glacier/IA	Slow	Low

Cost Optimization Checklist

Phase 1: Quick Wins (1–2 Weeks)

Are DEBUG/TRACE level logs being dropped at the Collector in production?
Are health check and readiness probe logs/traces being filtered?
Do metric labels avoid high-cardinality values like user IDs or request IDs?
Are URL path labels normalized (e.g., /api/users/:id instead of /api/users/12345)?
Is the memory_limiter processor configured on all Collectors?
Are you monitoring Collector internal metrics (drop rates, queue sizes, memory usage)?

Phase 2: Mid-Term Optimization (2–4 Weeks)

Does Tail Sampling retain 100% of error/high-latency traces while limiting normal traffic to 5–10%?
Is a 2-Tier Collector architecture (Agent + Gateway) deployed?
Is trace ID-based routing (LoadBalancing Exporter) configured?
Are unnecessary log attributes being stripped?
Is an Observability Budget defined per service?
Is a cardinality monitoring dashboard in place?

Phase 3: Long-Term Infrastructure Optimization (1–3 Months)

Is Hot/Warm/Cold storage tiering configured?
Is S3 Intelligent-Tiering or equivalent auto-tiering enabled?
Are metric downsampling policies applied (e.g., 5-min resolution after 7 days, 1-hour after 30 days)?
Do ILM/retention policies meet regulatory requirements?
Is a cardinality validation gate included in the CI/CD pipeline?

Troubleshooting

Collector Memory Usage Keeps Growing

# 1. Check Collector memory usage
kubectl top pods -n observability -l app=otel-gateway

# 2. Profile memory with pprof
curl -s http://localhost:1777/debug/pprof/heap > heap.prof
go tool pprof -top heap.prof

# 3. Check Tail Sampling queue state
curl -s http://localhost:8888/metrics | grep tail_sampling

Finding the Source of Cardinality Explosion

# Top metrics by series count in Prometheus
curl -s http://prometheus:9090/api/v1/status/tsdb | \
  jq '.data.seriesCountByMetricName | sort_by(-.value) | .[0:10]'

# Analyze label cardinality for a specific metric
curl -s 'http://prometheus:9090/api/v1/query?query=count(http_server_request_duration_seconds_bucket) by (service_name)' | \
  jq '.data.result | sort_by(-.value[1] | tonumber) | .[0:10]'

Conclusion

Observability cost optimization is not about indiscriminately reducing data — it is an engineering discipline of precisely separating signal from noise, maintaining observability quality while maximizing cost efficiency.

Key strategies summarized:

Sampling: Use Head Sampling to save network costs, and Tail Sampling on core services to retain 100% of error and high-latency traces.
Filtering: Drop DEBUG/health check logs at the Collector and trim attributes to reduce payload sizes.
Cardinality Management: Remove high-cardinality labels, normalize URL paths, and enforce per-service Observability Budgets.
Storage Tiering: Deploy Hot/Warm/Cold architecture to automatically migrate aging data to lower-cost storage tiers.

By systematically applying these strategies, you can reduce overall observability costs by 60–80% while fully preserving the data needed for incident response.