- Published on
Distributed Tracing & OpenTelemetry 2026 — OTel / Jaeger / Tempo / Zipkin / Honeycomb / Lightstep / SigNoz / SkyWalking / Datadog APM Deep Dive
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Prologue — "Why is checkout slow?"
A Friday evening in 2026. The payments-team Slack channel lights up.
"p99 checkout latency 4.2s. Normally 600ms. What got slow?"
Answering that within five minutes is why humanity spent the last decade building distributed tracing. One request fans through the gateway, into auth, into the payments core, out to an external PG, then back through the receipts queue — every hop captured as a single trace.
In 2020, the field was fragmented: OpenTracing, OpenCensus, Jaeger clients, Zipkin libraries, each APM vendor's agent. Even inside one company Java used Datadog, Go used Jaeger, Node used New Relic. In 2026 the answer is simpler — OpenTelemetry.
This post is a map of distributed tracing in 2026. The OTel spec and Collector, propagation standards, OSS backends (Jaeger/Tempo/Zipkin), observability 2.0 (Honeycomb, SigNoz), acquired giants (Lightstep), the APM super-league (Datadog, New Relic, Dynatrace, Splunk), eBPF auto-instrumentation (Pixie, Beyla), plus sampling and cost. Including how Korean and Japanese companies actually migrated.
1. The 2026 Distributed-Tracing Map — Four Camps
The ecosystem clusters into four buckets.
| Bucket | Examples | Character |
|---|---|---|
| Standard / instrumentation | OpenTelemetry | Shared spec, SDKs, Collector. Almost everyone flows through this |
| OSS self-hosted backends | Jaeger, Tempo, Zipkin, SigNoz, SkyWalking | Self-operated. No license fees |
| APM SaaS | Datadog, New Relic, Dynatrace, Splunk, AppDynamics, Sentry, Elastic | Managed. Unified UI, alerting, AIOps |
| observability 2.0 | Honeycomb, Lightstep (ServiceNow) | Wide events, high-cardinality analytics |
| eBPF auto-instrumentation | Pixie, Beyla, Coroot | No code changes — kernel / network level |
Three axes in one picture.
OSS SaaS
+--------------+ +-----------------+
Instr. | OTel SDK | --- shared - | OTel SDK |
| (vendor- | | (vendor adapter) |
| neutral) | | |
+------+-------+ +--------+--------+
| |
+------v-----------------------------------v-----+
| OpenTelemetry Collector |
| receivers --> processors --> exporters |
+------+-----------------------------------+----+
| |
+-------v--------+ +--------v--------+
| Jaeger / Tempo | | Datadog APM |
| Zipkin / SigNoz| | New Relic APM |
| SkyWalking | | Honeycomb etc. |
+----------------+ +-----------------+
The lesson: unify instrumentation on OTel, keep backends swappable. That is the 2026 default posture.
2. OpenTelemetry — CNCF Graduated (2024)
OpenTelemetry (OTel) was formed in 2019 from the merger of OpenTracing (2016) and OpenCensus (2018). CNCF Incubating in 2021, CNCF Graduated in November 2024. It is now the de facto industry standard — the second most active CNCF project after Kubernetes.
What OTel provides:
- Specification: what is a trace, a span, what attributes exist.
- APIs: instrumentation APIs per language (Java, Go, Python, .NET, Node.js, Ruby, PHP, Rust, C++, Swift...).
- SDKs: reference implementations of the API — sampling, batching, exporters.
- Collector: a separate process running the telemetry pipeline.
- Semantic Conventions: standard attribute names for HTTP, DB, messaging, etc.
- OTLP (OpenTelemetry Protocol): gRPC/HTTP wire format.
Core data model: a trace is a tree of spans. Each span has trace_id, span_id, parent_span_id, start/end, attributes, events, links.
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317"))
)
tracer = trace.get_tracer("checkout")
with tracer.start_as_current_span("charge_card") as span:
span.set_attribute("user.id", user_id)
span.set_attribute("amount", amount)
result = pg.charge(...)
span.set_attribute("pg.response_code", result.code)
OTel's real value is that metrics, logs, and traces share one SDK. Traces are graduate-stable, metrics are stable, logs reached stable across most languages by 2024-2025. OTel is no longer just a tracing standard — it is a unified telemetry standard.
3. OTel Collector — Receivers, Processors, Exporters
OTel SDK lives inside your application; the Collector runs as a separate process. Reasons for that separation:
- Apps do not pull in vendor backend SDKs — dependency hygiene.
- Sampling, filtering, routing applied centrally.
- Backend swap does not require app restarts.
- Surge buffering / retry.
Three component types:
| Component | Role | Examples |
|---|---|---|
| Receivers | Ingest telemetry from outside | otlp, jaeger, zipkin, prometheus, kafka, filelog |
| Processors | Transform / filter / sample | batch, memory_limiter, tail_sampling, attributes |
| Exporters | Send to backends | otlp, jaeger, prometheus, datadog, honeycomb, logging |
These three form pipelines.
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 1500
tail_sampling:
decision_wait: 30s
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow
type: latency
latency: { threshold_ms: 1000 }
- name: sample_10
type: probabilistic
probabilistic: { sampling_percentage: 10 }
exporters:
otlp/tempo:
endpoint: tempo:4317
tls: { insecure: true }
datadog:
api: { site: datadoghq.com, key: ENV_DD_API_KEY }
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch]
exporters: [otlp/tempo, datadog]
You can fan the same trace to an OSS backend (Tempo) and a SaaS (Datadog) at once. That is why the Collector is the migration tool of choice — add a new exporter line, run side by side, validate, then cut the old one.
Two deployment modes:
- Agent: one per host / node / Pod sidecar. Small traffic, close-by collection.
- Gateway: a few per cluster / region. Buffering, tail sampling, central policy.
Large deployments typically run Agent → Gateway → Backend in two stages.
4. Propagation Standards — W3C Trace Context, B3, Baggage
Distributed tracing is "distributed" because the same trace_id crosses service boundaries. Propagation is usually via HTTP headers.
W3C Trace Context (2020 recommendation)
The two headers W3C standardised in 2020 are the de facto default in 2026.
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
| | | |
| trace-id (16 byte hex) parent-span-id trace-flags
version
tracestate: vendor1=value1,vendor2=value2
traceparent: every OTel SDK sends and receives this by default.tracestate: vendor-specific add-ons (e.g. sampling decision).
OTel SDKs inject these headers into HTTP, gRPC, messaging (Kafka, RabbitMQ headers), even some SQL comments automatically.
B3 Propagation (from Zipkin)
The multi-header format Zipkin created.
X-B3-TraceId: 4bf92f3577b34da6a3ce929d0e0e4736
X-B3-SpanId: 00f067aa0ba902b7
X-B3-ParentSpanId: 05e3ac9a4f6e3b90
X-B3-Sampled: 1
OTel supports both W3C and B3. Where legacy systems mix in, configure to accept both.
Baggage — "Context carried along with the trace"
Baggage carries key-value data alongside trace_id during propagation. Examples: user.tier=gold, tenant=acme, experiment=ab_42.
baggage: user.tier=gold,tenant=acme
Use cases:
- Attach the same label to all downstream spans automatically.
- Change sampling decisions based on baggage values (e.g. "tenant=enterprise gets 100% sampling").
- Downstream services use baggage for authorisation, logging, routing.
Warning: baggage is not a secret. If the next hop is external, anyone can read the header. Never put PII in baggage.
5. Jaeger — CNCF Graduated, Uber Original
Jaeger was created at Uber in 2015. CNCF Incubating in 2017, CNCF Graduated in 2019. It was the de facto standard before OTel.
Characteristics:
- Written in Go, modular backend.
- Storage: Cassandra, Elasticsearch/OpenSearch, ClickHouse, Badger (embedded).
- UI: trace view, dependency graph, comparison view.
- Ingest: native SDK (deprecated), Zipkin protocol, OTLP.
Since 2024, the Jaeger project has explicitly pivoted to an "OTel first" stance. The native Jaeger SDK is deprecated; new users are pointed at OTel SDKs and Jaeger becomes a "backend + UI". The v2 backend is built on top of the OTel Collector internally.
Deployment shapes:
- AllInOne: agent + collector + query + UI + memory storage in one container. Demo/dev.
- Production: collector + storage + query/UI split, agents per host.
- Kubernetes Operator: one CR with jaegertracing/jaeger-operator.
Pros: lightweight, OSS, easy install, intuitive UI.
Limits: traces only — no metrics/logs. High-cardinality analysis is weak. UI peaks at trace_id lookup + time-range filter.
6. Grafana Tempo — Parquet-Based, "Object Storage Is the Hot Path"
Grafana Tempo, born at Grafana Labs in 2020, is a trace backend with a precise philosophy — "a trace store without indexes."
Jaeger and Zipkin index every span (service, operation, tags). The indexes are larger and more expensive than the span bodies themselves. Tempo skips that — lookup by trace_id only, with the premise that you pull trace_id out of metrics (Prometheus) and logs (Loki) to navigate. "Click the exemplar and the trace pops up" as a UX pattern.
Storage goes directly to S3 / GCS / Azure Blob style object storage. No indexes means cheap storage. Since 2023 the on-disk format is unified on Parquet — columnar with good compression, and readable by external tools (Athena, DuckDB).
Tempo also has a query language, TraceQL — for finding patterns inside trace bodies.
{ resource.service.name = "checkout"
&& span.http.status_code = 500
&& duration > 1s
}
Pros:
- Cheapest storage of all — object storage directly.
- Naturally integrates with Loki, Mimir, Grafana.
- Parquet means external analytics tools can read it.
Limits:
- If you do not have the "metrics/logs link traces via exemplars" workflow, value drops.
- Generic search (by service name / tag) is weaker than Jaeger (TraceQL helps).
If scale and long-term storage dominate, Tempo wins.
7. Zipkin — Twitter Original, Still Alive
Zipkin was open-sourced by Twitter in 2012. It is the most influential OSS implementation of the Google Dapper paper and the starting point for every tracing system that followed.
Characteristics:
- Java-based (Brave instrumentation).
- Storage: in-memory (dev), Cassandra, Elasticsearch, MySQL.
- Simple — drops in as a single jar.
- The origin of B3 propagation.
Not many fresh adoptions in 2026, but still useful for:
- The backend for existing B3/Brave Java codebases.
- "Even Jaeger feels heavy" small systems.
- A conversion hub via the OTel Collector zipkin receiver/exporter.
Versus Jaeger, the UI is plainer but setup is simpler.
8. Honeycomb — Charity Majors' "Observability 2.0"
Honeycomb, founded in 2016 by Charity Majors and Christine Yen, is a SaaS observability company. Their phrase "observability 2.0" reshaped the vocabulary of the industry.
Their thesis: classical monitoring (separate metrics/logs/traces) only sees predefined dimensions. Real debugging needs wide events — high-cardinality attributes (user_id, request_id, k8s_pod) that you can slice and dice freely.
Honeycomb's data model: every span is essentially an "event". Attach attributes freely; queries jump between BubbleUp (automatic outlier breakdown), heatmap, and the trace view.
BubbleUp result example:
73% of slow (p95) requests have user.tier=enterprise AND
db.host=replica-3 AND
client_country=JP.
<- found automatically; no one had to specify the combination.
Differentiators:
- No indexing of every dimension — proprietary columnar store (Retriever).
- Sampling SDK (Refinery) — the de facto standard for tail sampling.
- 100% OTel-compatible ingest.
- Pricing: events (span count) and retention.
Pros: debugging speed is different. "Why is this slow" narrows to one or two minutes without pre-built indexes.
Limits: SaaS only (no self-hosting). Overkill for simple monitoring without high-cardinality questions.
9. Lightstep, Now ServiceNow Cloud Observability
Lightstep was founded in 2014 by Ben Sigelman, one of the original Google Dapper authors. "Statistical analysis" and "merging metrics and traces" were early themes.
ServiceNow acquired Lightstep in 2021. The product was rebranded to ServiceNow Cloud Observability. As a standalone it is fading, but it survives bundled with ServiceNow ITSM/AIOps in enterprise accounts (especially ITIL/CMDB-bound shops).
Technical legacy:
- Played a key role in OTel's early standardisation — Sigelman sat on the OTel Governance Committee.
- Introduced statistical noise reduction over traces.
2026 assessment: worth a look if you live in the ServiceNow ecosystem. Otherwise the action is at Honeycomb, Datadog, and the OSS camp.
10. SigNoz — OSS Full-Stack Observability
SigNoz was started in 2020 out of India. OSS APM built on top of ClickHouse, carrying traces, metrics, logs together. Positioned as "the OSS alternative to Datadog."
Characteristics:
- 100% OTel-native — uses the OTel Collector almost like an SDK.
- One UI for traces, metrics, logs, exceptions.
- ClickHouse-based — fast queries, compressed storage.
- Series B funding in 2024 — shipped LogStream, Anomaly Detection, metrics v2.
docker-compose up brings up the full stack in one line, and OTel compatibility keeps migration relatively painless.
2024-2025 updates:
- LogStream: logs in the same UI for search/filter.
- Anomaly Detection: ML-based outlier detection (auto-baseline).
- Trace Funnels: per-step success rates for multi-stage flows like checkout.
Pros: looks polished even though OSS, and full-stack. Cloud SaaS is also available if self-hosting is a burden.
Limits: heavy dependency on ClickHouse — operating at scale needs expertise. Enterprise features (SSO, SOC2 reports) skew to the cloud plan.
11. Apache SkyWalking — APM + Tracing, China-Led
Apache SkyWalking entered Apache Incubating in 2017 and graduated in 2019. Contributors come heavily from Huawei, Alibaba, Tencent, and the user base is dense in East Asia.
Characteristics:
- Own agents (Java, .NET, Node) plus OTel compatibility.
- Not just traces — service topology, metrics, logs, events all in one.
- Backend: in-house OAP server plus Elasticsearch / BanyanDB (their own time-series DB).
- UI: topology visualisation is strong.
- Extensions for service mesh (Istio) visibility and event-driven self-healing (SkyWalking-Rover, eBPF).
2026 positioning: the main OSS choice across China, India, Southeast Asia. Less common in Korea / Japan, but you meet it naturally if your company runs a Chinese subsidiary.
12. Datadog APM, New Relic APM, Elastic APM, Sentry Performance
The commercial APM super-league's tracing lineup.
Datadog APM
- Largest market share. UI polish is unmatched.
- OTel-native in and out —
dd-traceexists, but OTel SDK direct also supported. - Trace + APM metrics (throughput, error, latency) + profiling all linked.
- Pricing: per-host plus ingested-spans tier. Big-traffic shops can see scary bills.
- Connects to Watchdog (AIOps), Continuous Profiler, Database Monitoring.
New Relic APM
- The original first-generation APM (2008). Repositioned in the 2020s as the "Telemetry Data Platform".
- Pricing overhaul in 2020 — simplified to per-user plus per-GB ingested.
- OTel-compatible, OTel-only operation is feasible.
- New Relic AI for analytics, automatic log-trace linking.
Elastic APM
- APM sitting on Elasticsearch. Natural if you already run ELK.
- Both self-hosted and Elastic Cloud.
- Often cheaper than competitors while integrating search, dashboards, and ML.
Sentry Performance
- Sentry was originally an error monitoring company. Around 2020 it added Performance (distributed tracing).
- Strength: frontend coupling. Client SDKs for React, Vue, etc. produce traces that continue straight into the backend.
- Weaker as a backend-only APM, but full-stack (especially SPA frontend) is strong.
13. Dynatrace, AppDynamics (Splunk), Splunk Observability Cloud
The enterprise heavyweights.
Dynatrace
- Austrian company since 1993. The giant of enterprise APM.
- Their own agent, OneAgent, builds automatic topology, metrics, and traces the moment it installs — minimal human work.
- AI engine Davis auto-proposes root-cause candidates.
- OTel input also supported.
- Pricing: enterprise. Fits high-ROI environments (large global operations).
AppDynamics (now Splunk-acquired)
- Cisco acquired in 2017. Cisco then acquired Splunk in 2024, so AppDynamics now sits under the Splunk umbrella again.
- Java and .NET strong. Business-transaction-centric view.
- 2025-2026 roadmap signals gradual consolidation into Splunk Observability Cloud.
Splunk Observability Cloud
- Built from the 2019 SignalFx acquisition. Metrics, traces, logs plus RUM (real-user monitoring).
- After Cisco's Splunk acquisition (2024), Splunk is unifying the SignalFx and AppDynamics lineages under one SKU.
- Strong candidate for shops already invested in Cisco + Splunk.
14. eBPF Tracing — Pixie and Beyla, Zero-Code Auto-Instrumentation
eBPF lets you run safe code inside the Linux kernel. You can capture network packets, syscalls, HTTP requests at the kernel level — without changing a line of application code.
Pixie
- Started in 2019, acquired by New Relic in 2021. Open-sourced as a CNCF Sandbox project (2021), Incubating in 2023.
- Installed as a DaemonSet on Kubernetes, captures HTTP/HTTPS/gRPC/DNS/MySQL/Postgres/Redis traffic automatically.
- Data stays in node memory (24h default), only sent out on demand — favourable for privacy and cost.
- Query language is PxL (Python-like).
Beyla — Grafana's eBPF Auto-Instrumenter
- Announced by Grafana Labs in 2023. GA in 2024.
- A single binary or sidecar; uses eBPF to capture HTTP/gRPC and emit OTel spans.
- Slots most smoothly into self-hosted OTel + Tempo stacks.
Where eBPF Tracing Fits
- "Fast visibility without code changes" — polyglot environments, legacy code, "where is it slow right now?"
- Trade-off: it cannot capture business context like user_id. eBPF sees network payloads — if the app does not include the ID, it stays invisible.
Conclusion: run eBPF and OTel SDK together. eBPF gives instant infra/network visibility; OTel SDK adds business spans and high-cardinality attributes.
15. Sampling — Head vs Tail vs Ratio
Trace cost = spans produced * unit price. Eighty per cent of the cost curve is driven by sampling strategy.
Three strategies:
Head-Based Sampling
- Decision made at the start of a trace. Propagated via traceparent's trace-flags.
- Decision has little information (you do not know yet what happens).
- Pros: simple, predictable cost, SDK-only.
- Typically a fixed 1-10% pass-through.
Ratio Sampling
- Sub-type of head sampling. Example:
sample_ratio=0.1(10%). - OTel SDK default (TraceIdRatioBased).
- Consistency guarantee: decision is derived from a trace_id hash so the same trace receives the same decision across all services.
Tail-Based Sampling
- Decision made after the trace finishes (or after a wait window).
- All spans first land in the Collector, which can use rich information (error status, total latency, specific attributes).
- Pros: keep 100% of errors and slow traces, sample 1% of healthy ones. Best cost-to-value.
- Cons: needs buffering (typically 30s to a few minutes), memory and CPU, and all spans of the same trace must reach the same Collector instance (use a loadbalancing exporter that hashes by trace_id).
OTel Collector tail_sampling example:
processors:
tail_sampling:
decision_wait: 30s
num_traces: 100000
policies:
- name: keep_errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: keep_slow
type: latency
latency: { threshold_ms: 1500 }
- name: keep_enterprise
type: string_attribute
string_attribute: { key: tenant.tier, values: [enterprise] }
- name: random_2pct
type: probabilistic
probabilistic: { sampling_percentage: 2 }
This keeps: 100% errors + 100% over 1.5s + 100% enterprise tenants + 2% of the rest.
Which to Pick
| Situation | Recommendation |
|---|---|
| Small traffic (~hundreds rps) | Head 100% or ratio 50% |
| Typical SaaS (~thousands rps) | Head 10% + errors 100% (branched in SDK) |
| Large scale (tens of thousands+ rps) | Tail-based + errors / slow / specific tenants 100% |
| Cost spiralling | Tail-based immediately — usually 5-10x savings |
16. Cost — Tail-Based Sampling Proxies and the Bill
Two patterns that make the tracing bill scary.
- Bill grows linearly with traffic — 100% sampling at large scale.
- High-cardinality attribute explosion — every span tagged with user_id, request_id, k8s_pod, container_id.
Mitigations:
- Put tail sampling in a gateway Collector. Cut before the SaaS ingests. Use a
loadbalancingexporter upstream so all spans of one trace reach the same Collector. - Store in OSS, send only "hot" traces to SaaS. All traces go to Tempo, but errors and slow ones also go to Honeycomb / Datadog.
- Attribute allowlist. Use the
attributesprocessor to drop attributes you do not query. - Log-trace linking to slim trace bodies. Big long attributes live in logs, spans only hold a log_id reference.
Rules of thumb (author's conservative estimates):
- 100% naive -> 1% head sampling = 90% bill reduction, 30% debugging quality reduction.
- 1% head -> tail (100% errors + 1% rest) = same bill, 200% better debugging quality.
- Gateway Collector + allowlist = an extra 30% off the bill.
17. Korea and Japan — Real Migrations
Korea
- Toss — From in-house trace infra (Kafka consumer plus Spring Sleuth) toward an OTel + Tempo + Grafana stack between 2023 and 2024. Tail-sampled payment / transfer traffic to tame cost. Toss SLASH conferences feature talks on the topic.
- Kakao — Standardised on OTel from an internal APM. KakaoTalk backend is Java-heavy, so OTel Java agent auto-instrumentation plus manual instrumentation on a few critical services. The if(kakao) conference often hosts tracing case studies.
- NAVER — Internal monitoring platforms
nLogandPinpointare growing OTel-compatible interfaces. Pinpoint is NAVER's own OSS APM (strong Java bytecode instrumentation); it now ships an OTel exporter to align with the standard.
Japan
- Mercari — Migrated from Datadog to an OTel plus in-house backend combination, documented on the Mercari Engineering blog. Traffic cost was the prime driver; tail sampling via the OTel Collector reined in the bill.
- NTT Communications — Adopted OTel for their internal multi-cloud monitoring platform. Unified mixed Datadog / New Relic / SkyWalking environments through OTel — the equivalent in Korea would be the monitoring teams at SK Telecom or KT.
- CyberAgent / DeNA / LINE — Gaming, media, and messenger traffic profiles drive heavy demand for tail sampling and high-cardinality analysis; they mix Honeycomb / Datadog with OSS self-hosting (Tempo, Jaeger).
The common pattern: to avoid vendor lock-in, SDK is OTel, backends are swappable. SaaS pricing negotiation also leverages OTel compatibility.
18. Who Should Pick What
Small Teams / Side Projects
- OTel SDK + Jaeger (AllInOne) or SigNoz (self-hosted), or Honeycomb Free / Sentry Performance.
- Stands up in 30 minutes, cost USD 0 to low-tens.
Mid-Size (Dozens of People, Tens to Hundreds of Thousands of rps)
- OTel SDK + Collector + Tempo + Grafana (OSS full stack). Cost is infra only.
- Or Datadog / New Relic APM (SaaS, quick wins on alerting integration).
- The safe migration pattern is to run both in parallel via the OTel Collector, then prune.
Enterprise
- Dynatrace / Splunk Observability Cloud / Datadog — automatic topology, AIOps, company-wide alerting, audit logging.
- OTel compatibility is mandatory. Negotiate so the vendor accepts OTel-only ingest.
Debugging-First (High Cardinality, Ad Hoc Analysis)
- Honeycomb — observability 2.0 posture. Cost driven by event count and retention.
- SigNoz follows similar ground in OSS form.
Forced Self-Hosting (Regulation, Security)
- OTel + Tempo / Jaeger / SigNoz / SkyWalking. SigNoz is full-stack (traces + metrics + logs + events); Tempo + Loki + Mimir is the Grafana split-stack approach.
Kubernetes Polyglot, Hard to Change Code
- Pixie or Beyla — automatic visibility via eBPF. Add OTel SDK only for business spans.
19. Ten Anti-Patterns
- Skip OTel and hard-code vendor SDK directly — backend change rewrites your code.
- Instrumented but sampling still at 100% — bill explodes.
- "I want every error" without tail sampling — head 1% loses 99% of errors.
- Put user_id / request_id on every span without thought — straight cardinality bill.
- Forget to run W3C and B3 side by side — trace breaks at legacy boundaries.
- PII in baggage — leaks via headers at the next hop.
- Metrics, logs, traces stored in disconnected backends with no trace_id link — exemplar workflow broken.
- No Collector — SDK ships straight to SaaS — backend change means full app redeploy.
- Ignore trace_id routing for tail sampling — same trace splits across Collectors and decisions go wrong.
- Look only at traces, ignore metrics and logs — traces are "the story of one request", metrics are "the trend of the whole". You need both.
Epilogue — Build Freedom on Top of OTel
The 2026 lesson is one sentence.
Unify instrumentation on OTel and keep backends swappable.
OTel is more than a tracing standard — it is the interface that breaks vendor lock-in. On top of it, pick the backend that matches your taste: the OSS freedom of Jaeger / Tempo, the debugging depth of Honeycomb, the unified UX of Datadog, the AIOps of Dynatrace. The moment swap cost shrinks from rewriting code to editing a Collector line, you gain negotiating power, cost control, and better debugging at once.
Next post candidates: OpenTelemetry metrics deep dive — exemplars and trace-metric linking, Linking logs to traces — Loki / OpenSearch / OpenTelemetry Logs, Tail sampling in practice — load-balancing exporter topology and cost curves.
— Distributed Tracing and OpenTelemetry 2026 Deep Dive, fin.
References
- OpenTelemetry — Official site
- OpenTelemetry — CNCF graduation announcement (2024)
- OpenTelemetry Specification
- OpenTelemetry Collector
- OpenTelemetry Collector — GitHub (open-telemetry/opentelemetry-collector)
- Collector Contrib — tail sampling processor
- W3C Trace Context — Recommendation
- W3C Baggage — Specification
- B3 Propagation — openzipkin/b3-propagation
- Jaeger — Official site
- Jaeger — CNCF graduated project
- Jaeger v2 announcement
- Grafana Tempo
- Tempo — Parquet block format
- TraceQL — Tempo query language
- Zipkin — Official site
- Honeycomb — Observability 2.0
- Charity Majors — Observability 2.0 manifesto
- Refinery — Honeycomb tail sampling proxy
- Lightstep — now ServiceNow Cloud Observability
- SigNoz — Official site
- SigNoz — GitHub (SigNoz/signoz)
- Apache SkyWalking
- Datadog APM
- New Relic APM
- Elastic APM
- Sentry Performance Monitoring
- Dynatrace — distributed tracing
- Splunk Observability Cloud
- AppDynamics — Cisco/Splunk
- Pixie — eBPF observability for Kubernetes
- Pixie — CNCF Incubating
- Grafana Beyla — eBPF auto-instrumentation
- OpenTelemetry — Semantic Conventions
- OpenTelemetry Java agent
- Mercari Engineering — observability migration
- Toss SLASH conference — observability talks
- NAVER Pinpoint — open source APM
- Google Dapper paper (2010)