Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Prologue — The Standardization War Is Over

Picture the observability landscape in 2018. OpenTracing and OpenCensus were fighting in parallel universes, and to send trace data you had to install a different SDK for every vendor (Datadog, New Relic, Lightstep, Honeycomb, etc.). Library authors despaired. They wanted to bake in instrumentation, but had to pick a vendor.

In 2019, OpenTracing and OpenCensus merged into OpenTelemetry. That was seven years ago.

May 2026 looks different.

OTLP is effectively the only observability wire protocol. Tempo, Jaeger, Honeycomb, Datadog, New Relic, Dynatrace, Grafana Cloud, SigNoz — all accept OTLP. Vendor-specific protocols hang around only for legacy compatibility.
Semantic conventions v1 are locked. The attribute names for HTTP, relational DB, messaging, RPC, and system metrics do not change anymore: http.request.method, db.system.name, messaging.system. Dashboard authors exhaled.
The log signal is GA. The era of "traces in OTel, logs somewhere else" is over.
Profiles has entered as the fourth signal. After CNCF incubation, OTLP/profiles data started growing from late 2024.
eBPF auto-instrumentation (Beyla, Coroot, OpenTelemetry eBPF Collector) opened a path where traces appear without any SDK. A line of trace data for legacy binaries and untouchable services became real.

In short, the standardization war is over. The question is now how you deploy OTel. This piece walks from the OTLP wire format through a production Collector pipeline, per-language auto-instrumentation, and the tradeoffs of the eBPF path — exactly the shape that is shipping in May 2026.

1. The Landscape — Why OpenTelemetry Won

Three decisions tipped the fight.

A vendor-neutral wire format. OTLP's protobuf schema is public, and it runs over both gRPC and HTTP. Any vendor that opens an OTLP endpoint becomes OTel-compatible immediately. Vendor lock weakens — or to be precise, lock-in moves from the SDK to the backend (query language, UI, pricing).
A semantic convention lock. How to represent an HTTP request (http.request.method, url.full, http.response.status_code) and a DB query (db.system.name, db.query.text) got agreed. Without this, OTel would have been just another SDK.
CNCF graduation. In 2024 OpenTelemetry graduated as the second-most-active CNCF project after Kubernetes. The signal mattered — this is a real neutral standard, not something owned by one company.

Old landscape vs new

Item	Pre-2019	2026
Tracing standard	OpenTracing + OpenCensus in parallel	OpenTelemetry single
Wire protocol	One per vendor	OTLP/gRPC + OTLP/HTTP
Library instrumentation	Baked into a vendor SDK	Baked into the OTel API, SDK is swappable
Logs	Separate pipeline (Fluentd, etc.)	One OTel signal
Metrics	Separate Prometheus camp	OTLP + Prometheus compat
Auto-instrumentation	A few languages	Java, Python, Node, Go, Ruby, .NET, PHP, Rust
Semantics	Per-vendor	`http.`, `db.`, `messaging.*` locked

2. OTLP — Why the Wire Protocol Matters Most

The heart of OpenTelemetry is not the SDK; it is OTLP. OTLP is the protobuf schema and transport protocol that ships trace, metric, log, and profile data to a backend.

2.1 Two transports

OTLP/gRPC — protobuf over HTTP/2. Default port 4317. Most efficient. The default in server environments.
OTLP/HTTP — protobuf or JSON over HTTP/1.1. Default port 4318. Used in browsers, Lambda, and environments where firewalls block gRPC.

The payload (schema) is identical; only the transport differs. This separation matters — sending traces directly from a browser over OTLP/HTTP became possible.

2.2 Wire shape (simplified)

message ExportTraceServiceRequest {
  repeated ResourceSpans resource_spans = 1;
}

message ResourceSpans {
  Resource resource = 1;          // service info (service.name, deployment.environment...)
  repeated ScopeSpans scope_spans = 2;
}

message ScopeSpans {
  InstrumentationScope scope = 1; // which library produced the spans
  repeated Span spans = 2;
}

message Span {
  bytes trace_id = 1;
  bytes span_id = 2;
  string name = 5;
  fixed64 start_time_unix_nano = 7;
  fixed64 end_time_unix_nano = 8;
  repeated KeyValue attributes = 9;  // e.g. http.request.method = "GET"
  repeated Event events = 11;
  repeated Link links = 13;
  Status status = 15;
}

The key is the three-level Resource → Scope → Span tree. Spans from the same service and same library are bundled together, which compresses well.

2.3 Metrics and logs

The same pattern repeats for metrics, logs, and profiles.

Metrics: ResourceMetrics → ScopeMetrics → Metric (Gauge / Sum / Histogram / ExponentialHistogram / Summary)
Logs: ResourceLogs → ScopeLogs → LogRecord
Profiles: ResourceProfiles → ScopeProfiles → Profile (pprof-compatible schema)

2.4 Is OTLP push or pull?

OTLP is push. The SDK or Collector pushes to a backend. That is the biggest philosophical break from Prometheus (pull).

OTel keeps both legs in for Prometheus compatibility. The Collector ships a prometheusreceiver (to scrape targets) and a prometheusexporter (so Prometheus can scrape the Collector). In real production, a hybrid — Prometheus for metrics, OTLP push for traces and logs — is common.

3. Semantic Conventions — Why This Is OTel's Real Weapon

Technically OTLP is the most impressive piece, but the operator touches semantic conventions every day.

3.1 What got locked

Lock-down work for v1 stable started in September 2024 and locked the following between 2025 and 2026:

HTTP — http.request.method, http.response.status_code, url.path, url.full, url.scheme, server.address, server.port, user_agent.original.
Databases — db.system.name, db.namespace, db.query.text, db.collection.name, db.operation.name.
Messaging — messaging.system, messaging.destination.name, messaging.operation.type, messaging.message.id.
RPC — rpc.system, rpc.service, rpc.method.
System metrics — system.cpu.utilization, system.memory.usage, process.runtime.*.
Resource — service.name, service.version, service.instance.id, deployment.environment.name, host.name, os.type, cloud.provider, k8s.pod.name, k8s.namespace.name.

Locked stable means the names do not change anymore. Dashboards, alerts, queries, backend integrations — everything stabilizes.

3.2 Why the lock matters

There was a well-known trap in early OTel. The HTTP attribute name was once http.method, then it changed to http.request.method. Every deployed SDK had to be replaced; every dashboard broke. This happened more than once.

The v1 lock ends that pain. Future changes go into a v2 namespace; v1 is preserved. Same principle as Kubernetes API stability.

3.3 What semantic conventions enforce

Vendor portability. Move the same trace data from Tempo to Honeycomb and the queries still work.
Shared dashboards. OTel dashboards in the Grafana marketplace assume semantic conventions v1.
Trustworthy library instrumentation. The attribute names emitted by auto-instrumentation are standard, so operators can predict them.

4. Traces, Metrics, Logs, Profiles — Current State of the Four Signals

4.1 Traces

The most mature signal. The core of OTel 1.0; very little new arrives in 2026. The unit of a trace is the span — a unit of work with a start and end time, attributes, events, and a parent span.

Spans are grouped by trace ID, so you see one request flowing across a distributed system. The traceparent header (W3C Trace Context) propagates context across service boundaries.

4.2 Metrics

OTel's metric model differs slightly from Prometheus. Prometheus has gauge / counter / histogram / summary; OTel subdivides into Counter / UpDownCounter / Gauge / Histogram / ExponentialHistogram / ObservableX.

One big change — ExponentialHistogram entered the standard. The old Histogram uses pre-chosen bucket boundaries; ExponentialHistogram adapts to the distribution dynamically. Quantile estimation accuracy (P99 and friends) improves a lot.

The Prometheus camp went the same way with native histograms. Both camps converged.

4.3 Logs

GA in late 2024. The point is — OTel alone now ships traces, metrics, and logs without a separate log pipeline.

The killer feature is automatic attachment of trace context. Log records get trace_id and span_id automatically, so trace-to-log and log-to-trace jumps are one click. In the old days you wired this by hand into your logger.

That said, Fluent Bit and Vector did not die in 2026. OTel's log receiver is strong, but Fluent Bit is more mature at tailing and parsing container log files. Real production splits two ways: Fluent Bit reads files and pushes OTLP to a Collector, or the Collector's filelogreceiver reads files directly.

4.4 Profiles

After CNCF incubation in 2024, profiling became the fourth signal. Continuous profiling joined OTel alongside traces, metrics, and logs.

The technical core: a pprof-compatible schema lands in OTLP. The reason this matters — Grafana Pyroscope, Polar Signals Parca, and other existing continuous-profiling tools, once they accept OTLP/profiles, let you ship all four signals through one pipeline.

Status as of May 2026:

Wire protocol (OTLP/profiles): near-stable beta.
SDK support: parts of Go, Python, and Java auto-instrumentation experimentally emit profile signals.
Collector support: otlpreceiver and otlpexporter accept the profiles signal. Some backends do not yet receive it.
eBPF: Parca and Pyroscope generate pprof via eBPF and ship it over OTLP.

Close to GA but not complete. When you start, pin SDK and Collector versions that can enable profiles.

5. Collector Architecture — The nginx of Observability

The OpenTelemetry Collector is the single most important component in the OTel ecosystem. Its role in an observability pipeline is what nginx is to HTTP traffic — receive, transform, route.

5.1 Three component types

+-------------+      +-------------+      +-------------+
|  Receivers  | ---> | Processors  | ---> |  Exporters  |
+-------------+      +-------------+      +-------------+
   otlp                batch                otlp
   prometheus          memory_limiter        prometheusremotewrite
   filelog             attributes            elasticsearch
   jaeger              k8sattributes         loki
   zipkin              tail_sampling         tempo
   kafka               filter                kafka
                       transform

Receivers — adapters that take data in. OTLP is the default, but you can take Prometheus scrapes, Jaeger, Zipkin, Fluent Forward, Kafka, SQS — anything.
Processors — transform, filter, and enrich. batch is essentially required, memory_limiter is a safety net, k8sattributes automatically attaches Kubernetes metadata, tail_sampling delays the sampling decision until the trace is complete.
Exporters — push to backends. OTLP is the default, but plenty of backend-specific exporters exist.

5.2 Core vs Contrib

Two distributions.

otelcol (core) — only the most essential components. Easier to security-audit; smaller binary.
otelcol-contrib (contrib) — all community components. 99% of production runs contrib.

Recommendation: start with contrib. Once operations stabilize, use OpenTelemetry Collector Builder (ocb) to build a custom image with exactly the components you need.

5.3 The concept of a pipeline

Collector config is grouped by pipelines. One pipeline per signal type.

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, batch]
      exporters: [otlphttp/tempo]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp, filelog]
      processors: [memory_limiter, k8sattributes, batch]
      exporters: [otlphttp/loki]

The same receiver can feed multiple pipelines, and fan-out / fan-in are free.

6. Production Collector Setup — sidecar → gateway → backend

The most common real production pattern is a two-tier layout.

+----------------+   OTLP   +-----------+   OTLP    +-----------+
| App + SDK      | -------> | Collector |---------->| Collector |
| (Pod sidecar)  |          | (sidecar) |           | (gateway) |
+----------------+          +-----------+           +-----------+
                                                          |
                                                          | OTLP / proprietary
                                                          v
                                                 +------------------+
                                                 | Tempo / Jaeger / |
                                                 | Honeycomb / etc. |
                                                 +------------------+

sidecar Collector (DaemonSet or pod sidecar) — receives locally, batches, retries. The app sends OTLP and gets a fast response so it can get on with its work.
gateway Collector (Deployment) — aggregates cluster-wide for tail sampling, metadata enrichment, routing, and fan-out to multiple backends.
backend — Tempo / Jaeger / SigNoz / Honeycomb / Datadog. Wherever OTLP terminates.

6.1 sidecar / agent Collector config

# otelcol-agent-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
        - job_name: 'self-metrics'
          scrape_interval: 30s
          static_configs:
            - targets: ['localhost:8888']

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 400
    spike_limit_mib: 100
  k8sattributes:
    auth_type: serviceAccount
    passthrough: false
    extract:
      metadata:
        - k8s.pod.name
        - k8s.namespace.name
        - k8s.node.name
        - k8s.deployment.name
        - k8s.cluster.uid
  batch:
    timeout: 200ms
    send_batch_size: 8192

exporters:
  otlp/gateway:
    endpoint: otelcol-gateway.observability.svc.cluster.local:4317
    tls:
      insecure: true
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
    sending_queue:
      enabled: true
      num_consumers: 4
      queue_size: 5000

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, batch]
      exporters: [otlp/gateway]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, k8sattributes, batch]
      exporters: [otlp/gateway]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, batch]
      exporters: [otlp/gateway]

6.2 gateway Collector config (with tail sampling)

The gateway usually receives the full trace and decides — which traces to keep, where to send them.

# otelcol-gateway-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 4000
    spike_limit_mib: 800
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 1000 }
      - name: sample-10pct
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }
  batch:
    timeout: 1s
    send_batch_size: 16384

exporters:
  otlphttp/tempo:
    endpoint: http://tempo-distributor.monitoring.svc.cluster.local:4318
  prometheusremotewrite:
    endpoint: http://mimir-distributor.monitoring.svc.cluster.local/api/v1/push
  otlphttp/loki:
    endpoint: http://loki-distributor.monitoring.svc.cluster.local:3100/otlp

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlphttp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlphttp/loki]

6.3 Operational tips

memory_limiter goes first. If memory blows up every other processor is useless.
tail sampling only at the gateway. Do it at the agent and traces fracture.
k8sattributes belongs on the agent. The gateway may not see pod info.
Tune batch size and timeout to backend capacity. Too small and RPS explodes; too large and latency climbs.
Scrape the Collector's own metrics. :8888/metrics exposes otelcol_processor_dropped_spans and friends.

7. Per-Language Auto-Instrumentation — How Automatic Is It

One of OTel's biggest selling points is auto-instrumentation. Without touching a line of code, spans appear for HTTP handlers, DB drivers, and gRPC clients.

Maturity varies sharply by language.

7.1 Java — best-in-class

The opentelemetry-javaagent.jar is effectively the industry standard. Using JVM's -javaagent mechanism it rewrites bytecode at runtime and auto-instruments more than 100 libraries (Spring, Hibernate, Apache HttpClient, Kafka, JDBC, Servlet, Reactor, gRPC, AWS SDK, …).

java -javaagent:opentelemetry-javaagent.jar \
     -Dotel.service.name=order-service \
     -Dotel.exporter.otlp.endpoint=http://otelcol-agent:4317 \
     -Dotel.exporter.otlp.protocol=grpc \
     -Dotel.resource.attributes=deployment.environment.name=prod \
     -jar app.jar

That is it. Zero code changes. Spring Boot controllers, every JDBC call, every Kafka producer and consumer span shows up automatically. Trace context propagation is automatic too.

Why Java is best-in-class — JVM allows the freest runtime bytecode manipulation, the OTel Java team is the largest, and the know-how of the Datadog Java agent merged straight into OTel.

7.2 Python — opentelemetry-instrument

Python auto-instruments via the opentelemetry-instrument launcher and the opentelemetry-distro package. It wraps libraries via monkey-patching.

pip install opentelemetry-distro opentelemetry-exporter-otlp \
            opentelemetry-instrumentation-flask \
            opentelemetry-instrumentation-requests \
            opentelemetry-instrumentation-psycopg2

opentelemetry-bootstrap -a install   # auto-detect installed libs

OTEL_SERVICE_NAME=order-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otelcol-agent:4317 \
OTEL_RESOURCE_ATTRIBUTES=deployment.environment.name=prod \
opentelemetry-instrument python app.py

Supported libraries include Flask, Django, FastAPI, Starlette, requests, urllib3, httpx, psycopg2, asyncpg, SQLAlchemy, Redis, pymongo, celery, kafka-python, boto3. Not as deep as Java, but it covers the everyday libraries.

7.3 Node.js — require-hook based

Node auto-instruments through @opentelemetry/auto-instrumentations-node, which installs a require hook.

// tracing.js — must load before the app entry point
const { NodeSDK } = require('@opentelemetry/sdk-node')
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node')
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc')

const sdk = new NodeSDK({
  serviceName: 'order-service',
  traceExporter: new OTLPTraceExporter({
    url: 'http://otelcol-agent:4317',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
})

sdk.start()

Run it:

node --require ./tracing.js app.js

In ESM you need the --import flag and a hook loader (the register() API in Node 20.6+). CommonJS environments still auto-instrument more smoothly.

Supported libraries: Express, Koa, Fastify, Hapi, http, https, pg, mysql, redis, mongoose, ioredis, kafka.js, AWS SDK, GraphQL, etc.

7.4 Go — the compile-time instrumentation revolution

Go was an OTel weakness for a long time. Go is hard to monkey-patch, has no vtable, and runtime interception is messy. For years Go meant manual instrumentation only.

Two paths opened in 2024–2025.

Path 1: code injection at go build time — otel-go-instrumentation

go install github.com/open-telemetry/opentelemetry-go-instrumentation/cli@latest

# Inject auto-instrumentation at build time
otel-instrument go build -o app ./...

The build tool generates wrappers around standard library and popular package calls (net/http, gRPC, database/sql, etc.) and injects them. Zero source changes.

Path 2: outside the process via eBPF — Beyla

Even build-time injection is skipped. Beyla runs as a separate process and uses eBPF to observe syscalls and sockets in the kernel, producing traces. More on this in Section 8.

7.5 .NET, Ruby, PHP, Rust

.NET — OpenTelemetry.AutoInstrumentation NuGet package. Rewrites IL via the CoreCLR profiler API. Second-most mature after Java.
Ruby — opentelemetry-instrumentation-all gem. Monkey-patch based. Rails, Sinatra, Rack, etc.
PHP — open-telemetry/opentelemetry-auto-laravel and friends. Sometimes needs an OPcache extension.
Rust — virtually no auto-instrumentation. The OTel adapter for the tracing crate is the standard manual approach.

7.6 Limits of auto-instrumentation

Auto-instrumentation only sees the boundaries — HTTP, DB, message, RPC. The meaningful business units — "create order", "validate payment", "decrement inventory" — are invisible.

The right answer in production OTel is auto-instrumentation plus hand-coded business spans. Auto-instrumentation catches the infrastructure edges; manual spans add domain meaning.

8. eBPF Auto-Instrumentation — Traces Without an SDK

One of the largest shifts in 2024–2025. eBPF makes traces appear without any SDK installed.

8.1 How it works

An eBPF auto-instrumentation tool runs as a separate process and hooks into the kernel's syscalls and socket events. It directly observes HTTP requests, gRPC calls, and DB connections.

The app is a plain binary. No SDK, no javaagent.
An eBPF program hooks accept, connect, read, write, etc., and decodes the packets.
Decoded transactions are sent over OTLP to a Collector.

8.2 The tools

Grafana Beyla — Grafana Labs' open-source project. Watches Go, Java, Node, Python, and Rust uniformly. Emits both traces and metrics. GA in 2024.
Coroot — full-stack observability on top of Beyla — collects via eBPF and ships its own UI. OTel-compatible.
OpenTelemetry eBPF Collector — the official OTel project. Originally built by Splunk and donated. Watches system metrics and network-level traces.
Pixie (Pixie Labs) — Kubernetes-only. eBPF plus its own query language. OTel output supported.
Cilium Hubble — focused on the network layer. Cilium's observability component.

8.3 Strengths of the eBPF path

Zero SDK lines. Tracing works for legacy binaries and untouchable third-party services.
Zero instrumentation-miss risk. Visible even if a library version is out of sync with OTel.
Language-agnostic. One Beyla yields uniform Go, Java, Node, Python, and Rust traces.
Low operational pressure. No redeploy required.

8.4 Limits of the eBPF path

Distributed trace context propagation is hard. eBPF can read traceparent in HTTP headers, but cannot follow that context into function-level calls inside the app. Result — service-level traces look fine, function-level graphs are thin.
Encrypted traffic. To see HTTP/2 inside TLS you need uprobes hooking library functions. Modern tools support this but the compatibility matrix is narrow.
Kernel privileges required. Pods need CAP_BPF or privileged. Security policy may block it.
Zero business logic. Domain semantics like "create order" are missed.

8.5 eBPF + SDK hybrid

The right production answer is eBPF plus SDK.

eBPF — the infrastructure picture (service-to-service calls, DB queries, external API calls, network latency).
SDK — business meaning (domain spans, custom metrics, business logs).

Tying the two datasets together with the same trace_id is hard. In practice "eBPF trace graph + SDK trace graph" often run side by side without merging. This is the most actively improved area as of May 2026.

9. Resource Detection

Every piece of OTel data is shipped attached to a Resource — metadata like service.name=order-service, deployment.environment.name=prod, host.name=node-01, k8s.pod.name=order-69b.

You can set all of it by hand, but the OTel SDK and Collector include automatic resource detectors.

9.1 Kinds of detectors

Environment variables — OTEL_RESOURCE_ATTRIBUTES=service.name=order,deployment.environment.name=prod.
Process — pid, command, runtime version.
Host — hostname, OS, architecture.
Container — container.id (extracted from cgroup).
Kubernetes — env injection via downwardAPI, or the Collector's k8sattributes processor enriching by pod IP.
Cloud — AWS EC2/ECS/EKS/Lambda, GCP GCE/GKE/Cloud Run, Azure VM/AKS — detect instance metadata via IMDS.

9.2 Precedence

The OTel spec defines an order — environment variables less than SDK defaults less than explicit code settings. Cloud detectors run automatically at SDK startup.

9.3 Production pattern

env:
  - name: OTEL_SERVICE_NAME
    value: order-service
  - name: OTEL_RESOURCE_ATTRIBUTES
    value: "deployment.environment.name=prod,service.version=$VERSION"
  - name: POD_NAME
    valueFrom:
      fieldRef:
        fieldPath: metadata.name
  - name: NODE_NAME
    valueFrom:
      fieldRef:
        fieldPath: spec.nodeName
  - name: OTEL_RESOURCE_ATTRIBUTES_OTHER
    value: "k8s.pod.name=$POD_NAME,k8s.node.name=$NODE_NAME"

Cleaner: hand it off to the Collector's k8sattributes processor. The SDK only sets service.name and the rest is enriched by the Collector via pod IP → API server lookup.

10. From Vendor SDK to OTel SDK — The Migration Story

The most common migration pattern of 2024–2025.

10.1 Scenario — Datadog dd-trace to OTel

You already ship traces via Datadog dd-trace and want to move to OTel. Two paths.

Path 1: dd-trace's OTel API compatibility mode

dd-trace itself can accept the OTel API. You write code against the OTel API; dd-trace handles the wire format.

[Code: OTel API] -> [dd-trace agent: accepts OTel API] -> [Datadog backend: dd-trace format]

Upside: incremental migration. Switch code to the OTel API now, switch the backend later. Downside: until you peel dd-trace off, you have not gained much.

Path 2: OTel SDK direct emit

[Code: OTel API] -> [OTel SDK] -> [OTLP] -> [OTel Collector] -> [Datadog's OTLP endpoint, or swap the backend]

Upside: real OTel. Free to swap backends. Downside: bigger jump. A gap appears wherever dd-trace had auto-instrumentation that OTel still lacks.

10.2 Staged migration order

Stand up the OTel Collector first. Run alongside the dd-trace agent.
Adopt the OTel SDK in new services. No new dd-trace.
Migrate traces, then metrics, then logs. Traces are smoothest; logs are the hardest for dashboard compatibility.
Rewrite dashboards. Semantic conventions differ (http.status_code vs http.response.status_code). Do new dashboards first, do not flip all at once.
Remove the dd-trace agent. Only after every service is on OTel.

10.3 Honest tradeoffs

OTel SDK is not better than vendor agents on every axis.

Aspect	OTel SDK	Vendor SDK (e.g. dd-trace)
Auto-instrumentation breadth	Good (Java best-in-class)	Very broad (long accumulation)
Runtime overhead	Generally a touch heavier	Often more tuned, sometimes lighter
Backend flexibility	Very large	Locked to one vendor
Semantic consistency	Standard semantic conventions	Vendor-specific names
Diagnostics, support	Community	Vendor SRE support
AI/ML workflows	New (OTel GenAI conventions)	Some vendors are ahead

In short, OTel trades a little runtime cost to unlock the backend. Acceptable? Usually yes — the value of backend freedom typically dwarfs SDK micro-tuning gains.

11. GenAI Semantic Conventions — The Next Area to Lock

The most active area in the OTel camp through 2025–2026. Semantic conventions for tracing LLM calls are right before lock.

Core attributes (beta, near stable):

gen_ai.system — openai, anthropic, google, ollama, ...
gen_ai.request.model — claude-3.5-sonnet, gpt-4o, ...
gen_ai.usage.input_tokens / gen_ai.usage.output_tokens — cost and rate-limit tracking.
gen_ai.response.finish_reasons — stop, length, tool_calls, ...
gen_ai.operation.name — chat, tool_call, embedding, text_completion.
Events — per-message inputs and outputs (optional, cost-and-privacy tradeoff).

LangChain, LlamaIndex, OpenLLMetry, Arize Phoenix all track these conventions. You will see LLM-workflow cost, latency, and failure rate in a standardized shape. v1 stable lock in the next quarter or two is likely.

12. Common Mistakes and Anti-Patterns

Mistakes seen often in production.

12.1 SDK pushing straight to the backend

App SDK ----(OTLP)----> Datadog / Honeycomb

Works in small environments. In large ones it fails — you cannot change backend policy (sampling, routing, enrichment) without redeploying the app. Always insert a Collector layer.

12.2 Losing error traces to head sampling

If the SDK head-samples at 1%, 99% of error traces vanish. Errors must be preserved, so use tail sampling (in the Collector).

12.3 Generating trace_id and span_id by hand

We saw code that made a 16-byte UUID and stuffed it as a trace_id — it breaks the standard trace context propagation. Use getCurrentSpan().spanContext() from the OTel API.

12.4 Creating a new Tracer per request

// bad
function handler(req) {
  const tracer = trace.getTracer('app')  // looks like a fresh one each time
  tracer.startSpan(...)
}

getTracer is cached, but pulling it once at module top is clearer about intent.

12.5 OTLP without batch

Without a batch processor, you send a small request per RPS. The batch processor is essentially required.

12.6 PII leaks into attributes

Putting http.request.body or user.email directly into attributes lands plain-text PII in the observability backend. Mask via the attributes processor or drop it.

12.7 Computing P99 from a fixed-bucket Histogram

A default Histogram with 5ms / 10ms / 50ms / 100ms / 500ms boundaries gives a sloppy P99. Default to ExponentialHistogram.

12.8 A single Collector instance as SPOF

Agent as DaemonSet, gateway as HPA. Always multi-instance. If the gateway dies, every signal in that window is gone.

Epilogue — What to Do Once the Standard Is Laid Down

OpenTelemetry sits in May 2026 in the spot of having won the standardization war. But "do you use OTel" is no longer an interesting question. The interesting ones are these:

How well did you deploy it. Two-tier agent + gateway, or single? Is there tail sampling? Is memory_limiter the first processor?
Do you follow semantic conventions. Not just trusting auto-instrumentation, but tagging business spans by the conventions too.
The balance of auto vs manual. Infrastructure edges automatic, business meaning manual.
Where you put eBPF. Only for legacy and third-party, or as the main path where the SDK side fails?
Are you ready for the profiles signal. Do your backend and SDK versions accept it?

Adoption checklist

Is the Collector deployed as a two-tier agent + gateway?
Is memory_limiter the first processor in every pipeline?
Is tail sampling in the gateway and are errors and slow requests unconditionally preserved?
Does the k8sattributes processor run on the agent?
Does every pipeline include the batch processor?
For instrumented languages, is it via javaagent, opentelemetry-instrument, require hook, or compile-time injection?
Do your business spans follow the semantic conventions?
Does the attributes processor mask PII out of attributes?
Do you scrape the Collector's self-metrics (otelcol_processor_dropped_spans, etc.)?
Is ExponentialHistogram the default?
Is your backend ready for GenAI semantic conventions?

Anti-pattern summary

SDK direct to backend — insert a Collector layer.
Head sampling only — loses error traces; add tail sampling.
No batch processor — one small request per RPS.
memory_limiter last — pointless if memory blows; put it first.
PII in raw attributes — mask via attributes processor.
Auto-instrumentation only, zero business spans — only the infrastructure edge is visible.
Single Collector instance — SPOF. Multi-instance required.
Custom attributes ignoring semantic conventions — dashboard compatibility breaks.
Traces in OTel, metrics and logs in a different pipeline — biggest value is unifying all three signals.
Enabling profiles in the SDK when the backend cannot receive it — data goes nowhere.

What is next

Next-piece candidates: Operating a Collector at depth — pipeline load, drops, and the trap of metric cardinality, OTel Profiles in practice — catching hotspots in a Go binary with eBPF, One dashboard for LLM cost, latency, and failure via GenAI semantic conventions.