Distributed Tracing & OpenTelemetry 2026 — OTel / Jaeger / Tempo / Zipkin / Honeycomb / Lightstep / SigNoz / SkyWalking / Datadog APM Deep Dive

Prologue — "Why is checkout slow?"

A Friday evening in 2026. The payments-team Slack channel lights up.

"p99 checkout latency 4.2s. Normally 600ms. What got slow?"

Answering that within five minutes is why humanity spent the last decade building distributed tracing. One request fans through the gateway, into auth, into the payments core, out to an external PG, then back through the receipts queue — every hop captured as a single trace.

In 2020, the field was fragmented: OpenTracing, OpenCensus, Jaeger clients, Zipkin libraries, each APM vendor's agent. Even inside one company Java used Datadog, Go used Jaeger, Node used New Relic. In 2026 the answer is simpler — OpenTelemetry.

This post is a map of distributed tracing in 2026. The OTel spec and Collector, propagation standards, OSS backends (Jaeger/Tempo/Zipkin), observability 2.0 (Honeycomb, SigNoz), acquired giants (Lightstep), the APM super-league (Datadog, New Relic, Dynatrace, Splunk), eBPF auto-instrumentation (Pixie, Beyla), plus sampling and cost. Including how Korean and Japanese companies actually migrated.

1. The 2026 Distributed-Tracing Map — Four Camps

The ecosystem clusters into four buckets.

Bucket	Examples	Character
Standard / instrumentation	OpenTelemetry	Shared spec, SDKs, Collector. Almost everyone flows through this
OSS self-hosted backends	Jaeger, Tempo, Zipkin, SigNoz, SkyWalking	Self-operated. No license fees
APM SaaS	Datadog, New Relic, Dynatrace, Splunk, AppDynamics, Sentry, Elastic	Managed. Unified UI, alerting, AIOps
observability 2.0	Honeycomb, Lightstep (ServiceNow)	Wide events, high-cardinality analytics
eBPF auto-instrumentation	Pixie, Beyla, Coroot	No code changes — kernel / network level

Three axes in one picture.

              OSS                              SaaS
        +--------------+              +-----------------+
Instr.  | OTel SDK     |  --- shared - | OTel SDK         |
        | (vendor-     |              | (vendor adapter) |
        |  neutral)    |              |                 |
        +------+-------+              +--------+--------+
               |                                |
        +------v-----------------------------------v-----+
        |          OpenTelemetry Collector              |
        |  receivers --> processors --> exporters       |
        +------+-----------------------------------+----+
               |                                  |
       +-------v--------+                +--------v--------+
       | Jaeger / Tempo |                |  Datadog APM    |
       | Zipkin / SigNoz|                |  New Relic APM  |
       | SkyWalking     |                |  Honeycomb etc. |
       +----------------+                +-----------------+

The lesson: unify instrumentation on OTel, keep backends swappable. That is the 2026 default posture.

2. OpenTelemetry — CNCF Graduated (2024)

OpenTelemetry (OTel) was formed in 2019 from the merger of OpenTracing (2016) and OpenCensus (2018). CNCF Incubating in 2021, CNCF Graduated in November 2024. It is now the de facto industry standard — the second most active CNCF project after Kubernetes.

What OTel provides:

Specification: what is a trace, a span, what attributes exist.
APIs: instrumentation APIs per language (Java, Go, Python, .NET, Node.js, Ruby, PHP, Rust, C++, Swift...).
SDKs: reference implementations of the API — sampling, batching, exporters.
Collector: a separate process running the telemetry pipeline.
Semantic Conventions: standard attribute names for HTTP, DB, messaging, etc.
OTLP (OpenTelemetry Protocol): gRPC/HTTP wire format.

Core data model: a trace is a tree of spans. Each span has trace_id, span_id, parent_span_id, start/end, attributes, events, links.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317"))
)

tracer = trace.get_tracer("checkout")
with tracer.start_as_current_span("charge_card") as span:
    span.set_attribute("user.id", user_id)
    span.set_attribute("amount", amount)
    result = pg.charge(...)
    span.set_attribute("pg.response_code", result.code)

OTel's real value is that metrics, logs, and traces share one SDK. Traces are graduate-stable, metrics are stable, logs reached stable across most languages by 2024-2025. OTel is no longer just a tracing standard — it is a unified telemetry standard.

3. OTel Collector — Receivers, Processors, Exporters

OTel SDK lives inside your application; the Collector runs as a separate process. Reasons for that separation:

Apps do not pull in vendor backend SDKs — dependency hygiene.
Sampling, filtering, routing applied centrally.
Backend swap does not require app restarts.
Surge buffering / retry.

Three component types:

Component	Role	Examples
Receivers	Ingest telemetry from outside	otlp, jaeger, zipkin, prometheus, kafka, filelog
Processors	Transform / filter / sample	batch, memory_limiter, tail_sampling, attributes
Exporters	Send to backends	otlp, jaeger, prometheus, datadog, honeycomb, logging

These three form pipelines.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 1500
  tail_sampling:
    decision_wait: 30s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 1000 }
      - name: sample_10
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls: { insecure: true }
  datadog:
    api: { site: datadoghq.com, key: ENV_DD_API_KEY }

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/tempo, datadog]

You can fan the same trace to an OSS backend (Tempo) and a SaaS (Datadog) at once. That is why the Collector is the migration tool of choice — add a new exporter line, run side by side, validate, then cut the old one.

Two deployment modes:

Agent: one per host / node / Pod sidecar. Small traffic, close-by collection.
Gateway: a few per cluster / region. Buffering, tail sampling, central policy.

Large deployments typically run Agent → Gateway → Backend in two stages.

4. Propagation Standards — W3C Trace Context, B3, Baggage

Distributed tracing is "distributed" because the same trace_id crosses service boundaries. Propagation is usually via HTTP headers.

W3C Trace Context (2020 recommendation)

The two headers W3C standardised in 2020 are the de facto default in 2026.

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             |   |                                |                |
             |   trace-id (16 byte hex)            parent-span-id   trace-flags
             version

tracestate: vendor1=value1,vendor2=value2

traceparent: every OTel SDK sends and receives this by default.
tracestate: vendor-specific add-ons (e.g. sampling decision).

OTel SDKs inject these headers into HTTP, gRPC, messaging (Kafka, RabbitMQ headers), even some SQL comments automatically.

B3 Propagation (from Zipkin)

The multi-header format Zipkin created.

X-B3-TraceId: 4bf92f3577b34da6a3ce929d0e0e4736
X-B3-SpanId: 00f067aa0ba902b7
X-B3-ParentSpanId: 05e3ac9a4f6e3b90
X-B3-Sampled: 1

OTel supports both W3C and B3. Where legacy systems mix in, configure to accept both.

Baggage — "Context carried along with the trace"

Baggage carries key-value data alongside trace_id during propagation. Examples: user.tier=gold, tenant=acme, experiment=ab_42.

baggage: user.tier=gold,tenant=acme

Use cases:

Attach the same label to all downstream spans automatically.
Change sampling decisions based on baggage values (e.g. "tenant=enterprise gets 100% sampling").
Downstream services use baggage for authorisation, logging, routing.

Warning: baggage is not a secret. If the next hop is external, anyone can read the header. Never put PII in baggage.

5. Jaeger — CNCF Graduated, Uber Original

Jaeger was created at Uber in 2015. CNCF Incubating in 2017, CNCF Graduated in 2019. It was the de facto standard before OTel.

Characteristics:

Written in Go, modular backend.
Storage: Cassandra, Elasticsearch/OpenSearch, ClickHouse, Badger (embedded).
UI: trace view, dependency graph, comparison view.
Ingest: native SDK (deprecated), Zipkin protocol, OTLP.

Since 2024, the Jaeger project has explicitly pivoted to an "OTel first" stance. The native Jaeger SDK is deprecated; new users are pointed at OTel SDKs and Jaeger becomes a "backend + UI". The v2 backend is built on top of the OTel Collector internally.

Deployment shapes:

AllInOne: agent + collector + query + UI + memory storage in one container. Demo/dev.
Production: collector + storage + query/UI split, agents per host.
Kubernetes Operator: one CR with jaegertracing/jaeger-operator.

Pros: lightweight, OSS, easy install, intuitive UI.

Limits: traces only — no metrics/logs. High-cardinality analysis is weak. UI peaks at trace_id lookup + time-range filter.

6. Grafana Tempo — Parquet-Based, "Object Storage Is the Hot Path"

Grafana Tempo, born at Grafana Labs in 2020, is a trace backend with a precise philosophy — "a trace store without indexes."

Jaeger and Zipkin index every span (service, operation, tags). The indexes are larger and more expensive than the span bodies themselves. Tempo skips that — lookup by trace_id only, with the premise that you pull trace_id out of metrics (Prometheus) and logs (Loki) to navigate. "Click the exemplar and the trace pops up" as a UX pattern.

Storage goes directly to S3 / GCS / Azure Blob style object storage. No indexes means cheap storage. Since 2023 the on-disk format is unified on Parquet — columnar with good compression, and readable by external tools (Athena, DuckDB).

Tempo also has a query language, TraceQL — for finding patterns inside trace bodies.

{ resource.service.name = "checkout"
  && span.http.status_code = 500
  && duration > 1s
}

Pros:

Cheapest storage of all — object storage directly.
Naturally integrates with Loki, Mimir, Grafana.
Parquet means external analytics tools can read it.

Limits:

If you do not have the "metrics/logs link traces via exemplars" workflow, value drops.
Generic search (by service name / tag) is weaker than Jaeger (TraceQL helps).

If scale and long-term storage dominate, Tempo wins.

7. Zipkin — Twitter Original, Still Alive

Zipkin was open-sourced by Twitter in 2012. It is the most influential OSS implementation of the Google Dapper paper and the starting point for every tracing system that followed.

Characteristics:

Java-based (Brave instrumentation).
Storage: in-memory (dev), Cassandra, Elasticsearch, MySQL.
Simple — drops in as a single jar.
The origin of B3 propagation.

Not many fresh adoptions in 2026, but still useful for:

The backend for existing B3/Brave Java codebases.
"Even Jaeger feels heavy" small systems.
A conversion hub via the OTel Collector zipkin receiver/exporter.

Versus Jaeger, the UI is plainer but setup is simpler.

8. Honeycomb — Charity Majors' "Observability 2.0"

Honeycomb, founded in 2016 by Charity Majors and Christine Yen, is a SaaS observability company. Their phrase "observability 2.0" reshaped the vocabulary of the industry.

Their thesis: classical monitoring (separate metrics/logs/traces) only sees predefined dimensions. Real debugging needs wide events — high-cardinality attributes (user_id, request_id, k8s_pod) that you can slice and dice freely.

Honeycomb's data model: every span is essentially an "event". Attach attributes freely; queries jump between BubbleUp (automatic outlier breakdown), heatmap, and the trace view.

BubbleUp result example:
  73% of slow (p95) requests have user.tier=enterprise AND
  db.host=replica-3 AND
  client_country=JP.
  <- found automatically; no one had to specify the combination.

Differentiators:

No indexing of every dimension — proprietary columnar store (Retriever).
Sampling SDK (Refinery) — the de facto standard for tail sampling.
100% OTel-compatible ingest.
Pricing: events (span count) and retention.

Pros: debugging speed is different. "Why is this slow" narrows to one or two minutes without pre-built indexes.

Limits: SaaS only (no self-hosting). Overkill for simple monitoring without high-cardinality questions.

9. Lightstep, Now ServiceNow Cloud Observability

Lightstep was founded in 2014 by Ben Sigelman, one of the original Google Dapper authors. "Statistical analysis" and "merging metrics and traces" were early themes.

ServiceNow acquired Lightstep in 2021. The product was rebranded to ServiceNow Cloud Observability. As a standalone it is fading, but it survives bundled with ServiceNow ITSM/AIOps in enterprise accounts (especially ITIL/CMDB-bound shops).

Technical legacy:

Played a key role in OTel's early standardisation — Sigelman sat on the OTel Governance Committee.
Introduced statistical noise reduction over traces.

2026 assessment: worth a look if you live in the ServiceNow ecosystem. Otherwise the action is at Honeycomb, Datadog, and the OSS camp.

10. SigNoz — OSS Full-Stack Observability

SigNoz was started in 2020 out of India. OSS APM built on top of ClickHouse, carrying traces, metrics, logs together. Positioned as "the OSS alternative to Datadog."

Characteristics:

100% OTel-native — uses the OTel Collector almost like an SDK.
One UI for traces, metrics, logs, exceptions.
ClickHouse-based — fast queries, compressed storage.
Series B funding in 2024 — shipped LogStream, Anomaly Detection, metrics v2.

docker-compose up brings up the full stack in one line, and OTel compatibility keeps migration relatively painless.

2024-2025 updates:

LogStream: logs in the same UI for search/filter.
Anomaly Detection: ML-based outlier detection (auto-baseline).
Trace Funnels: per-step success rates for multi-stage flows like checkout.

Pros: looks polished even though OSS, and full-stack. Cloud SaaS is also available if self-hosting is a burden.

Limits: heavy dependency on ClickHouse — operating at scale needs expertise. Enterprise features (SSO, SOC2 reports) skew to the cloud plan.

11. Apache SkyWalking — APM + Tracing, China-Led

Apache SkyWalking entered Apache Incubating in 2017 and graduated in 2019. Contributors come heavily from Huawei, Alibaba, Tencent, and the user base is dense in East Asia.

Characteristics:

Own agents (Java, .NET, Node) plus OTel compatibility.
Not just traces — service topology, metrics, logs, events all in one.
Backend: in-house OAP server plus Elasticsearch / BanyanDB (their own time-series DB).
UI: topology visualisation is strong.
Extensions for service mesh (Istio) visibility and event-driven self-healing (SkyWalking-Rover, eBPF).

2026 positioning: the main OSS choice across China, India, Southeast Asia. Less common in Korea / Japan, but you meet it naturally if your company runs a Chinese subsidiary.

12. Datadog APM, New Relic APM, Elastic APM, Sentry Performance

The commercial APM super-league's tracing lineup.

Datadog APM

Largest market share. UI polish is unmatched.
OTel-native in and out — dd-trace exists, but OTel SDK direct also supported.
Trace + APM metrics (throughput, error, latency) + profiling all linked.
Pricing: per-host plus ingested-spans tier. Big-traffic shops can see scary bills.
Connects to Watchdog (AIOps), Continuous Profiler, Database Monitoring.

New Relic APM

The original first-generation APM (2008). Repositioned in the 2020s as the "Telemetry Data Platform".
Pricing overhaul in 2020 — simplified to per-user plus per-GB ingested.
OTel-compatible, OTel-only operation is feasible.
New Relic AI for analytics, automatic log-trace linking.

Elastic APM

APM sitting on Elasticsearch. Natural if you already run ELK.
Both self-hosted and Elastic Cloud.
Often cheaper than competitors while integrating search, dashboards, and ML.

Sentry Performance

Sentry was originally an error monitoring company. Around 2020 it added Performance (distributed tracing).
Strength: frontend coupling. Client SDKs for React, Vue, etc. produce traces that continue straight into the backend.
Weaker as a backend-only APM, but full-stack (especially SPA frontend) is strong.

13. Dynatrace, AppDynamics (Splunk), Splunk Observability Cloud

The enterprise heavyweights.

Dynatrace

Austrian company since 1993. The giant of enterprise APM.
Their own agent, OneAgent, builds automatic topology, metrics, and traces the moment it installs — minimal human work.
AI engine Davis auto-proposes root-cause candidates.
OTel input also supported.
Pricing: enterprise. Fits high-ROI environments (large global operations).

AppDynamics (now Splunk-acquired)

Cisco acquired in 2017. Cisco then acquired Splunk in 2024, so AppDynamics now sits under the Splunk umbrella again.
Java and .NET strong. Business-transaction-centric view.
2025-2026 roadmap signals gradual consolidation into Splunk Observability Cloud.

Splunk Observability Cloud

Built from the 2019 SignalFx acquisition. Metrics, traces, logs plus RUM (real-user monitoring).
After Cisco's Splunk acquisition (2024), Splunk is unifying the SignalFx and AppDynamics lineages under one SKU.
Strong candidate for shops already invested in Cisco + Splunk.

14. eBPF Tracing — Pixie and Beyla, Zero-Code Auto-Instrumentation

eBPF lets you run safe code inside the Linux kernel. You can capture network packets, syscalls, HTTP requests at the kernel level — without changing a line of application code.

Pixie

Started in 2019, acquired by New Relic in 2021. Open-sourced as a CNCF Sandbox project (2021), Incubating in 2023.
Installed as a DaemonSet on Kubernetes, captures HTTP/HTTPS/gRPC/DNS/MySQL/Postgres/Redis traffic automatically.
Data stays in node memory (24h default), only sent out on demand — favourable for privacy and cost.
Query language is PxL (Python-like).

Beyla — Grafana's eBPF Auto-Instrumenter

Announced by Grafana Labs in 2023. GA in 2024.
A single binary or sidecar; uses eBPF to capture HTTP/gRPC and emit OTel spans.
Slots most smoothly into self-hosted OTel + Tempo stacks.

Where eBPF Tracing Fits

"Fast visibility without code changes" — polyglot environments, legacy code, "where is it slow right now?"
Trade-off: it cannot capture business context like user_id. eBPF sees network payloads — if the app does not include the ID, it stays invisible.

Conclusion: run eBPF and OTel SDK together. eBPF gives instant infra/network visibility; OTel SDK adds business spans and high-cardinality attributes.

15. Sampling — Head vs Tail vs Ratio

Trace cost = spans produced * unit price. Eighty per cent of the cost curve is driven by sampling strategy.

Three strategies:

Head-Based Sampling

Decision made at the start of a trace. Propagated via traceparent's trace-flags.
Decision has little information (you do not know yet what happens).
Pros: simple, predictable cost, SDK-only.
Typically a fixed 1-10% pass-through.

Ratio Sampling

Sub-type of head sampling. Example: sample_ratio=0.1 (10%).
OTel SDK default (TraceIdRatioBased).
Consistency guarantee: decision is derived from a trace_id hash so the same trace receives the same decision across all services.

Tail-Based Sampling

Decision made after the trace finishes (or after a wait window).
All spans first land in the Collector, which can use rich information (error status, total latency, specific attributes).
Pros: keep 100% of errors and slow traces, sample 1% of healthy ones. Best cost-to-value.
Cons: needs buffering (typically 30s to a few minutes), memory and CPU, and all spans of the same trace must reach the same Collector instance (use a loadbalancing exporter that hashes by trace_id).

OTel Collector tail_sampling example:

processors:
  tail_sampling:
    decision_wait: 30s
    num_traces: 100000
    policies:
      - name: keep_errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: keep_slow
        type: latency
        latency: { threshold_ms: 1500 }
      - name: keep_enterprise
        type: string_attribute
        string_attribute: { key: tenant.tier, values: [enterprise] }
      - name: random_2pct
        type: probabilistic
        probabilistic: { sampling_percentage: 2 }

This keeps: 100% errors + 100% over 1.5s + 100% enterprise tenants + 2% of the rest.

Which to Pick

Situation	Recommendation
Small traffic (~hundreds rps)	Head 100% or ratio 50%
Typical SaaS (~thousands rps)	Head 10% + errors 100% (branched in SDK)
Large scale (tens of thousands+ rps)	Tail-based + errors / slow / specific tenants 100%
Cost spiralling	Tail-based immediately — usually 5-10x savings

16. Cost — Tail-Based Sampling Proxies and the Bill

Two patterns that make the tracing bill scary.

Bill grows linearly with traffic — 100% sampling at large scale.
High-cardinality attribute explosion — every span tagged with user_id, request_id, k8s_pod, container_id.

Mitigations:

Put tail sampling in a gateway Collector. Cut before the SaaS ingests. Use a loadbalancing exporter upstream so all spans of one trace reach the same Collector.
Store in OSS, send only "hot" traces to SaaS. All traces go to Tempo, but errors and slow ones also go to Honeycomb / Datadog.
Attribute allowlist. Use the attributes processor to drop attributes you do not query.
Log-trace linking to slim trace bodies. Big long attributes live in logs, spans only hold a log_id reference.

Rules of thumb (author's conservative estimates):

100% naive -> 1% head sampling = 90% bill reduction, 30% debugging quality reduction.
1% head -> tail (100% errors + 1% rest) = same bill, 200% better debugging quality.
Gateway Collector + allowlist = an extra 30% off the bill.

17. Korea and Japan — Real Migrations

Korea

Toss — From in-house trace infra (Kafka consumer plus Spring Sleuth) toward an OTel + Tempo + Grafana stack between 2023 and 2024. Tail-sampled payment / transfer traffic to tame cost. Toss SLASH conferences feature talks on the topic.
Kakao — Standardised on OTel from an internal APM. KakaoTalk backend is Java-heavy, so OTel Java agent auto-instrumentation plus manual instrumentation on a few critical services. The if(kakao) conference often hosts tracing case studies.
NAVER — Internal monitoring platforms nLog and Pinpoint are growing OTel-compatible interfaces. Pinpoint is NAVER's own OSS APM (strong Java bytecode instrumentation); it now ships an OTel exporter to align with the standard.

Japan

Mercari — Migrated from Datadog to an OTel plus in-house backend combination, documented on the Mercari Engineering blog. Traffic cost was the prime driver; tail sampling via the OTel Collector reined in the bill.
NTT Communications — Adopted OTel for their internal multi-cloud monitoring platform. Unified mixed Datadog / New Relic / SkyWalking environments through OTel — the equivalent in Korea would be the monitoring teams at SK Telecom or KT.
CyberAgent / DeNA / LINE — Gaming, media, and messenger traffic profiles drive heavy demand for tail sampling and high-cardinality analysis; they mix Honeycomb / Datadog with OSS self-hosting (Tempo, Jaeger).

The common pattern: to avoid vendor lock-in, SDK is OTel, backends are swappable. SaaS pricing negotiation also leverages OTel compatibility.

18. Who Should Pick What

Small Teams / Side Projects

OTel SDK + Jaeger (AllInOne) or SigNoz (self-hosted), or Honeycomb Free / Sentry Performance.
Stands up in 30 minutes, cost USD 0 to low-tens.

Mid-Size (Dozens of People, Tens to Hundreds of Thousands of rps)

OTel SDK + Collector + Tempo + Grafana (OSS full stack). Cost is infra only.
Or Datadog / New Relic APM (SaaS, quick wins on alerting integration).
The safe migration pattern is to run both in parallel via the OTel Collector, then prune.

Enterprise

Dynatrace / Splunk Observability Cloud / Datadog — automatic topology, AIOps, company-wide alerting, audit logging.
OTel compatibility is mandatory. Negotiate so the vendor accepts OTel-only ingest.

Debugging-First (High Cardinality, Ad Hoc Analysis)

Honeycomb — observability 2.0 posture. Cost driven by event count and retention.
SigNoz follows similar ground in OSS form.

Forced Self-Hosting (Regulation, Security)

OTel + Tempo / Jaeger / SigNoz / SkyWalking. SigNoz is full-stack (traces + metrics + logs + events); Tempo + Loki + Mimir is the Grafana split-stack approach.

Kubernetes Polyglot, Hard to Change Code

Pixie or Beyla — automatic visibility via eBPF. Add OTel SDK only for business spans.

19. Ten Anti-Patterns

Skip OTel and hard-code vendor SDK directly — backend change rewrites your code.
Instrumented but sampling still at 100% — bill explodes.
"I want every error" without tail sampling — head 1% loses 99% of errors.
Put user_id / request_id on every span without thought — straight cardinality bill.
Forget to run W3C and B3 side by side — trace breaks at legacy boundaries.
PII in baggage — leaks via headers at the next hop.
Metrics, logs, traces stored in disconnected backends with no trace_id link — exemplar workflow broken.
No Collector — SDK ships straight to SaaS — backend change means full app redeploy.
Ignore trace_id routing for tail sampling — same trace splits across Collectors and decisions go wrong.
Look only at traces, ignore metrics and logs — traces are "the story of one request", metrics are "the trend of the whole". You need both.

Epilogue — Build Freedom on Top of OTel

The 2026 lesson is one sentence.

Unify instrumentation on OTel and keep backends swappable.

OTel is more than a tracing standard — it is the interface that breaks vendor lock-in. On top of it, pick the backend that matches your taste: the OSS freedom of Jaeger / Tempo, the debugging depth of Honeycomb, the unified UX of Datadog, the AIOps of Dynatrace. The moment swap cost shrinks from rewriting code to editing a Collector line, you gain negotiating power, cost control, and better debugging at once.

Next post candidates: OpenTelemetry metrics deep dive — exemplars and trace-metric linking, Linking logs to traces — Loki / OpenSearch / OpenTelemetry Logs, Tail sampling in practice — load-balancing exporter topology and cost curves.

— Distributed Tracing and OpenTelemetry 2026 Deep Dive, fin.

Prologue — "Why is checkout slow?"

1. The 2026 Distributed-Tracing Map — Four Camps

2. OpenTelemetry — CNCF Graduated (2024)

3. OTel Collector — Receivers, Processors, Exporters

4. Propagation Standards — W3C Trace Context, B3, Baggage

W3C Trace Context (2020 recommendation)

B3 Propagation (from Zipkin)

Baggage — "Context carried along with the trace"

5. Jaeger — CNCF Graduated, Uber Original

6. Grafana Tempo — Parquet-Based, "Object Storage Is the Hot Path"

7. Zipkin — Twitter Original, Still Alive

8. Honeycomb — Charity Majors' "Observability 2.0"

9. Lightstep, Now ServiceNow Cloud Observability

10. SigNoz — OSS Full-Stack Observability

11. Apache SkyWalking — APM + Tracing, China-Led

12. Datadog APM, New Relic APM, Elastic APM, Sentry Performance

Datadog APM

New Relic APM

Elastic APM

Sentry Performance

13. Dynatrace, AppDynamics (Splunk), Splunk Observability Cloud

Dynatrace

AppDynamics (now Splunk-acquired)

Splunk Observability Cloud

14. eBPF Tracing — Pixie and Beyla, Zero-Code Auto-Instrumentation

Pixie

Beyla — Grafana's eBPF Auto-Instrumenter

Where eBPF Tracing Fits

15. Sampling — Head vs Tail vs Ratio

Head-Based Sampling

Ratio Sampling

Tail-Based Sampling

Which to Pick

16. Cost — Tail-Based Sampling Proxies and the Bill

17. Korea and Japan — Real Migrations

Korea

Japan

18. Who Should Pick What

Small Teams / Side Projects

Mid-Size (Dozens of People, Tens to Hundreds of Thousands of rps)

Enterprise

Debugging-First (High Cardinality, Ad Hoc Analysis)

Forced Self-Hosting (Regulation, Security)

Kubernetes Polyglot, Hard to Change Code

19. Ten Anti-Patterns

Epilogue — Build Freedom on Top of OTel

References