Skip to content

필사 모드: Observability Telemetry Pipeline Cost Optimization: Sampling, Filtering, and Tiering Strategies

한국어
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

Observability costs have risen to one of the top line items in cloud infrastructure spending. The global observability market surpassed \$28.5 billion in 2025 and is projected to reach \$34.1 billion by end of 2026. As microservice architectures proliferate, traces account for 60–70% of total observability costs, while logs consume another 20–30%. When hundreds of services generate millions of spans and log entries per second, storage and indexing costs grow exponentially rather than linearly.

However, simply reducing data is not the answer. The key is to **separate signal from noise** — cutting costs while preserving the quality of observability. This article covers telemetry pipeline cost optimization strategies centered on the OpenTelemetry Collector, with practical configuration examples and actionable checklists.

Cost Structure Analysis

Cost Breakdown by Signal Type

| Signal | Cost Share | Primary Cost Drivers | Optimization Difficulty |

| ------------ | ---------- | ----------------------------------------------------- | ----------------------- |

| **Traces** | 60–70% | High cardinality, large payloads, span explosion | High |

| **Logs** | 20–30% | Unstructured data, full-text indexing, high volume | Medium |

| **Metrics** | 5–15% | Time-series cardinality explosion, label combinations | Medium |

| **Profiles** | 1–5% | CPU/memory profile data size | Low |

Where Costs Occur in the Pipeline

[Generation] --> [Collection] --> [Processing] --> [Indexing] --> [Storage] --> [Query]

| | | | | |

SDK Network Collector Backend Disk/S3 Compute

overhead bandwidth CPU/Memory Index I/O storage query cost

The core optimization principle: **eliminate unnecessary data as early in the pipeline as possible**. Data removed at the generation stage saves costs at every subsequent stage, while data removed at the storage stage means you have already paid for collection, processing, and indexing.

Sampling Strategies

Head Sampling vs Tail Sampling

| Aspect | Head Sampling | Tail Sampling |

| ------------------ | ---------------------------------- | -------------------------------------------- |

| **Decision Point** | Trace start (SDK level) | After trace completion (Collector) |

| **Cost Reduction** | High (saves network bandwidth too) | Medium |

| **Data Quality** | Low (may miss error traces) | High (100% retention of errors/high latency) |

| **Complexity** | Low | High (requires memory, routing) |

| **Best For** | Non-critical high-traffic services | Production core services |

Tail Sampling Configuration

OpenTelemetry Collector - Tail Sampling

processors:

tail_sampling:

decision_wait: 30s

num_traces: 100000

expected_new_traces_per_sec: 1000

policies:

Retain 100% of error traces

- name: error-policy

type: status_code

status_code:

status_codes:

- ERROR

Retain 100% of high-latency traces (>500ms)

- name: latency-policy

type: latency

latency:

threshold_ms: 500

Retain 100% of critical service traces

- name: critical-service-policy

type: string_attribute

string_attribute:

key: service.name

values:

- payment-service

- auth-service

Sample only 5% of normal traffic

- name: normal-traffic-policy

type: probabilistic

probabilistic:

sampling_percentage: 5

2-Tier Collector Architecture

For stable Tail Sampling in production, separate Agent Collectors from Gateway Collectors. This ensures all spans from the same trace arrive at the same Gateway instance for correct sampling decisions.

[Services] --OTLP--> [Agent Collector (DaemonSet)]

|

Trace ID-based routing

|

[Gateway Collector] --> [Tempo/Jaeger]

(Tail Sampling runs here)

Use the `loadbalancing` exporter on Agent Collectors to route spans by trace ID:

exporters:

loadbalancing:

protocol:

otlp:

tls:

insecure: true

resolver:

dns:

hostname: otel-gateway-headless.observability.svc.cluster.local

port: 4317

Log Filtering Pipeline

Level-Based Filtering

Dropping DEBUG/TRACE level logs at the Collector typically reduces log volume by 30–50%.

processors:

Drop DEBUG/TRACE logs

filter/drop-debug:

error_mode: ignore

logs:

log_record:

- 'severity_number < SEVERITY_NUMBER_INFO'

Drop health check logs

filter/drop-healthcheck:

error_mode: ignore

logs:

log_record:

- 'IsMatch(body, ".*GET /health.*")'

- 'IsMatch(body, ".*GET /readyz.*")'

- 'IsMatch(body, ".*kube-probe.*")'

Remove unnecessary attributes to reduce payload

transform/reduce-attributes:

error_mode: ignore

log_statements:

- context: log

statements:

- delete_key(attributes, "log.file.path")

- delete_key(attributes, "log.iostream")

- truncate_all(attributes, 256)

- limit(attributes, 20)

Measuring Filtering Effectiveness

Measure filtering impact using Collector internal metrics

RECEIVED=$(curl -s http://localhost:8888/metrics | \

grep 'otelcol_receiver_accepted_log_records' | \

awk '{sum += $2} END {print sum}')

EXPORTED=$(curl -s http://localhost:8888/metrics | \

grep 'otelcol_exporter_sent_log_records' | \

awk '{sum += $2} END {print sum}')

if [ "$RECEIVED" -gt 0 ]; then

DROP_RATE=$(echo "scale=2; (1 - $EXPORTED / $RECEIVED) * 100" | bc)

echo "Received logs: $RECEIVED"

echo "Exported logs: $EXPORTED"

echo "Drop rate: ${DROP_RATE}%"

fi

Metric Cardinality Management

Cardinality explosion is the primary driver of unpredictable observability costs. Every unique time series requires a separate index entry; when millions of series are created, RAM and disk usage spike while ingest latency and query performance degrade.

Dangerous Patterns and Remedies

| Dangerous Pattern | Example | Remedy |

| ------------------ | --------------------------------------- | ---------------------------------- |

| User ID as label | `user_id="u12345"` | Remove label; move to logs/traces |

| Raw request path | `path="/api/users/12345"` | Normalize: `path="/api/users/:id"` |

| Pod name | `pod="web-7f8c9-xk2m"` | Use Deployment name only |

| Full error message | `error="Connection refused: 10.0.1.42"` | Classify by error code |

Collector-Side Cardinality Control

processors:

transform/normalize-url:

error_mode: ignore

metric_statements:

- context: datapoint

statements:

- replace_pattern(attributes["url.path"], "^/api/users/[^/]+", "/api/users/:id")

- replace_pattern(attributes["url.path"], "^/api/orders/[^/]+", "/api/orders/:id")

- delete_key(attributes, "user.id")

- delete_key(attributes, "request.id")

Storage Tiering Architecture

A Hot/Warm/Cold architecture automatically migrates data to cheaper storage tiers as it ages, dramatically reducing long-term retention costs.

| Tier | Retention | Storage | Query Speed | Cost |

| -------- | -------------- | --------------- | ----------- | ------ |

| **Hot** | 0–7 days | SSD/NVMe | Fastest | High |

| **Warm** | 7–30 days | HDD/S3 Standard | Moderate | Medium |

| **Cold** | 30 days–1 year | S3 Glacier/IA | Slow | Low |

Cost Optimization Checklist

Phase 1: Quick Wins (1–2 Weeks)

- [ ] Are DEBUG/TRACE level logs being dropped at the Collector in production?

- [ ] Are health check and readiness probe logs/traces being filtered?

- [ ] Do metric labels avoid high-cardinality values like user IDs or request IDs?

- [ ] Are URL path labels normalized (e.g., `/api/users/:id` instead of `/api/users/12345`)?

- [ ] Is the `memory_limiter` processor configured on all Collectors?

- [ ] Are you monitoring Collector internal metrics (drop rates, queue sizes, memory usage)?

Phase 2: Mid-Term Optimization (2–4 Weeks)

- [ ] Does Tail Sampling retain 100% of error/high-latency traces while limiting normal traffic to 5–10%?

- [ ] Is a 2-Tier Collector architecture (Agent + Gateway) deployed?

- [ ] Is trace ID-based routing (LoadBalancing Exporter) configured?

- [ ] Are unnecessary log attributes being stripped?

- [ ] Is an Observability Budget defined per service?

- [ ] Is a cardinality monitoring dashboard in place?

Phase 3: Long-Term Infrastructure Optimization (1–3 Months)

- [ ] Is Hot/Warm/Cold storage tiering configured?

- [ ] Is S3 Intelligent-Tiering or equivalent auto-tiering enabled?

- [ ] Are metric downsampling policies applied (e.g., 5-min resolution after 7 days, 1-hour after 30 days)?

- [ ] Do ILM/retention policies meet regulatory requirements?

- [ ] Is a cardinality validation gate included in the CI/CD pipeline?

Troubleshooting

Collector Memory Usage Keeps Growing

1. Check Collector memory usage

kubectl top pods -n observability -l app=otel-gateway

2. Profile memory with pprof

curl -s http://localhost:1777/debug/pprof/heap > heap.prof

go tool pprof -top heap.prof

3. Check Tail Sampling queue state

curl -s http://localhost:8888/metrics | grep tail_sampling

Finding the Source of Cardinality Explosion

Top metrics by series count in Prometheus

curl -s http://prometheus:9090/api/v1/status/tsdb | \

jq '.data.seriesCountByMetricName | sort_by(-.value) | .[0:10]'

Analyze label cardinality for a specific metric

curl -s 'http://prometheus:9090/api/v1/query?query=count(http_server_request_duration_seconds_bucket) by (service_name)' | \

jq '.data.result | sort_by(-.value[1] | tonumber) | .[0:10]'

Conclusion

Observability cost optimization is not about indiscriminately reducing data — it is an engineering discipline of precisely separating signal from noise, maintaining observability quality while maximizing cost efficiency.

Key strategies summarized:

1. **Sampling**: Use Head Sampling to save network costs, and Tail Sampling on core services to retain 100% of error and high-latency traces.

2. **Filtering**: Drop DEBUG/health check logs at the Collector and trim attributes to reduce payload sizes.

3. **Cardinality Management**: Remove high-cardinality labels, normalize URL paths, and enforce per-service Observability Budgets.

4. **Storage Tiering**: Deploy Hot/Warm/Cold architecture to automatically migrate aging data to lower-cost storage tiers.

By systematically applying these strategies, you can reduce overall observability costs by 60–80% while fully preserving the data needed for incident response.

References

- [OpenTelemetry Sampling Documentation](https://opentelemetry.io/docs/concepts/sampling/)

- [OpenTelemetry Collector Contrib - Tail Sampling Processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/README.md)

- [OpenTelemetry Collector Contrib - Filter Processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/filterprocessor/README.md)

현재 단락 (1/157)

Observability costs have risen to one of the top line items in cloud infrastructure spending. The gl...

작성 글자: 0원문 글자: 9,118작성 단락: 0/157