- Published on
Observability Telemetry Pipeline Cost Optimization: Sampling, Filtering, and Tiering Strategies
- Authors
- Name
- Introduction
- Cost Structure Analysis
- Sampling Strategies
- Log Filtering Pipeline
- Metric Cardinality Management
- Storage Tiering Architecture
- Cost Optimization Checklist
- Troubleshooting
- Conclusion
- References
Introduction
Observability costs have risen to one of the top line items in cloud infrastructure spending. The global observability market surpassed 34.1 billion by end of 2026. As microservice architectures proliferate, traces account for 60–70% of total observability costs, while logs consume another 20–30%. When hundreds of services generate millions of spans and log entries per second, storage and indexing costs grow exponentially rather than linearly.
However, simply reducing data is not the answer. The key is to separate signal from noise — cutting costs while preserving the quality of observability. This article covers telemetry pipeline cost optimization strategies centered on the OpenTelemetry Collector, with practical configuration examples and actionable checklists.
Cost Structure Analysis
Cost Breakdown by Signal Type
| Signal | Cost Share | Primary Cost Drivers | Optimization Difficulty |
|---|---|---|---|
| Traces | 60–70% | High cardinality, large payloads, span explosion | High |
| Logs | 20–30% | Unstructured data, full-text indexing, high volume | Medium |
| Metrics | 5–15% | Time-series cardinality explosion, label combinations | Medium |
| Profiles | 1–5% | CPU/memory profile data size | Low |
Where Costs Occur in the Pipeline
[Generation] --> [Collection] --> [Processing] --> [Indexing] --> [Storage] --> [Query]
| | | | | |
SDK Network Collector Backend Disk/S3 Compute
overhead bandwidth CPU/Memory Index I/O storage query cost
The core optimization principle: eliminate unnecessary data as early in the pipeline as possible. Data removed at the generation stage saves costs at every subsequent stage, while data removed at the storage stage means you have already paid for collection, processing, and indexing.
Sampling Strategies
Head Sampling vs Tail Sampling
| Aspect | Head Sampling | Tail Sampling |
|---|---|---|
| Decision Point | Trace start (SDK level) | After trace completion (Collector) |
| Cost Reduction | High (saves network bandwidth too) | Medium |
| Data Quality | Low (may miss error traces) | High (100% retention of errors/high latency) |
| Complexity | Low | High (requires memory, routing) |
| Best For | Non-critical high-traffic services | Production core services |
Tail Sampling Configuration
# OpenTelemetry Collector - Tail Sampling
processors:
tail_sampling:
decision_wait: 30s
num_traces: 100000
expected_new_traces_per_sec: 1000
policies:
# Retain 100% of error traces
- name: error-policy
type: status_code
status_code:
status_codes:
- ERROR
# Retain 100% of high-latency traces (>500ms)
- name: latency-policy
type: latency
latency:
threshold_ms: 500
# Retain 100% of critical service traces
- name: critical-service-policy
type: string_attribute
string_attribute:
key: service.name
values:
- payment-service
- auth-service
# Sample only 5% of normal traffic
- name: normal-traffic-policy
type: probabilistic
probabilistic:
sampling_percentage: 5
2-Tier Collector Architecture
For stable Tail Sampling in production, separate Agent Collectors from Gateway Collectors. This ensures all spans from the same trace arrive at the same Gateway instance for correct sampling decisions.
[Services] --OTLP--> [Agent Collector (DaemonSet)]
|
Trace ID-based routing
|
[Gateway Collector] --> [Tempo/Jaeger]
(Tail Sampling runs here)
Use the loadbalancing exporter on Agent Collectors to route spans by trace ID:
exporters:
loadbalancing:
protocol:
otlp:
tls:
insecure: true
resolver:
dns:
hostname: otel-gateway-headless.observability.svc.cluster.local
port: 4317
Log Filtering Pipeline
Level-Based Filtering
Dropping DEBUG/TRACE level logs at the Collector typically reduces log volume by 30–50%.
processors:
# Drop DEBUG/TRACE logs
filter/drop-debug:
error_mode: ignore
logs:
log_record:
- 'severity_number < SEVERITY_NUMBER_INFO'
# Drop health check logs
filter/drop-healthcheck:
error_mode: ignore
logs:
log_record:
- 'IsMatch(body, ".*GET /health.*")'
- 'IsMatch(body, ".*GET /readyz.*")'
- 'IsMatch(body, ".*kube-probe.*")'
# Remove unnecessary attributes to reduce payload
transform/reduce-attributes:
error_mode: ignore
log_statements:
- context: log
statements:
- delete_key(attributes, "log.file.path")
- delete_key(attributes, "log.iostream")
- truncate_all(attributes, 256)
- limit(attributes, 20)
Measuring Filtering Effectiveness
# Measure filtering impact using Collector internal metrics
RECEIVED=$(curl -s http://localhost:8888/metrics | \
grep 'otelcol_receiver_accepted_log_records' | \
awk '{sum += $2} END {print sum}')
EXPORTED=$(curl -s http://localhost:8888/metrics | \
grep 'otelcol_exporter_sent_log_records' | \
awk '{sum += $2} END {print sum}')
if [ "$RECEIVED" -gt 0 ]; then
DROP_RATE=$(echo "scale=2; (1 - $EXPORTED / $RECEIVED) * 100" | bc)
echo "Received logs: $RECEIVED"
echo "Exported logs: $EXPORTED"
echo "Drop rate: ${DROP_RATE}%"
fi
Metric Cardinality Management
Cardinality explosion is the primary driver of unpredictable observability costs. Every unique time series requires a separate index entry; when millions of series are created, RAM and disk usage spike while ingest latency and query performance degrade.
Dangerous Patterns and Remedies
| Dangerous Pattern | Example | Remedy |
|---|---|---|
| User ID as label | user_id="u12345" | Remove label; move to logs/traces |
| Raw request path | path="/api/users/12345" | Normalize: path="/api/users/:id" |
| Pod name | pod="web-7f8c9-xk2m" | Use Deployment name only |
| Full error message | error="Connection refused: 10.0.1.42" | Classify by error code |
Collector-Side Cardinality Control
processors:
transform/normalize-url:
error_mode: ignore
metric_statements:
- context: datapoint
statements:
- replace_pattern(attributes["url.path"], "^/api/users/[^/]+", "/api/users/:id")
- replace_pattern(attributes["url.path"], "^/api/orders/[^/]+", "/api/orders/:id")
- delete_key(attributes, "user.id")
- delete_key(attributes, "request.id")
Storage Tiering Architecture
A Hot/Warm/Cold architecture automatically migrates data to cheaper storage tiers as it ages, dramatically reducing long-term retention costs.
| Tier | Retention | Storage | Query Speed | Cost |
|---|---|---|---|---|
| Hot | 0–7 days | SSD/NVMe | Fastest | High |
| Warm | 7–30 days | HDD/S3 Standard | Moderate | Medium |
| Cold | 30 days–1 year | S3 Glacier/IA | Slow | Low |
Cost Optimization Checklist
Phase 1: Quick Wins (1–2 Weeks)
- Are DEBUG/TRACE level logs being dropped at the Collector in production?
- Are health check and readiness probe logs/traces being filtered?
- Do metric labels avoid high-cardinality values like user IDs or request IDs?
- Are URL path labels normalized (e.g.,
/api/users/:idinstead of/api/users/12345)? - Is the
memory_limiterprocessor configured on all Collectors? - Are you monitoring Collector internal metrics (drop rates, queue sizes, memory usage)?
Phase 2: Mid-Term Optimization (2–4 Weeks)
- Does Tail Sampling retain 100% of error/high-latency traces while limiting normal traffic to 5–10%?
- Is a 2-Tier Collector architecture (Agent + Gateway) deployed?
- Is trace ID-based routing (LoadBalancing Exporter) configured?
- Are unnecessary log attributes being stripped?
- Is an Observability Budget defined per service?
- Is a cardinality monitoring dashboard in place?
Phase 3: Long-Term Infrastructure Optimization (1–3 Months)
- Is Hot/Warm/Cold storage tiering configured?
- Is S3 Intelligent-Tiering or equivalent auto-tiering enabled?
- Are metric downsampling policies applied (e.g., 5-min resolution after 7 days, 1-hour after 30 days)?
- Do ILM/retention policies meet regulatory requirements?
- Is a cardinality validation gate included in the CI/CD pipeline?
Troubleshooting
Collector Memory Usage Keeps Growing
# 1. Check Collector memory usage
kubectl top pods -n observability -l app=otel-gateway
# 2. Profile memory with pprof
curl -s http://localhost:1777/debug/pprof/heap > heap.prof
go tool pprof -top heap.prof
# 3. Check Tail Sampling queue state
curl -s http://localhost:8888/metrics | grep tail_sampling
Finding the Source of Cardinality Explosion
# Top metrics by series count in Prometheus
curl -s http://prometheus:9090/api/v1/status/tsdb | \
jq '.data.seriesCountByMetricName | sort_by(-.value) | .[0:10]'
# Analyze label cardinality for a specific metric
curl -s 'http://prometheus:9090/api/v1/query?query=count(http_server_request_duration_seconds_bucket) by (service_name)' | \
jq '.data.result | sort_by(-.value[1] | tonumber) | .[0:10]'
Conclusion
Observability cost optimization is not about indiscriminately reducing data — it is an engineering discipline of precisely separating signal from noise, maintaining observability quality while maximizing cost efficiency.
Key strategies summarized:
- Sampling: Use Head Sampling to save network costs, and Tail Sampling on core services to retain 100% of error and high-latency traces.
- Filtering: Drop DEBUG/health check logs at the Collector and trim attributes to reduce payload sizes.
- Cardinality Management: Remove high-cardinality labels, normalize URL paths, and enforce per-service Observability Budgets.
- Storage Tiering: Deploy Hot/Warm/Cold architecture to automatically migrate aging data to lower-cost storage tiers.
By systematically applying these strategies, you can reduce overall observability costs by 60–80% while fully preserving the data needed for incident response.