- Published on
OpenTelemetry 2026 Deep Dive — OTLP, Semantic Conventions, the Collector Pipeline, and Auto-Instrumentation After the Standardization War
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Prologue — The Standardization War Is Over
Picture the observability landscape in 2018. OpenTracing and OpenCensus were fighting in parallel universes, and to send trace data you had to install a different SDK for every vendor (Datadog, New Relic, Lightstep, Honeycomb, etc.). Library authors despaired. They wanted to bake in instrumentation, but had to pick a vendor.
In 2019, OpenTracing and OpenCensus merged into OpenTelemetry. That was seven years ago.
May 2026 looks different.
- OTLP is effectively the only observability wire protocol. Tempo, Jaeger, Honeycomb, Datadog, New Relic, Dynatrace, Grafana Cloud, SigNoz — all accept OTLP. Vendor-specific protocols hang around only for legacy compatibility.
- Semantic conventions v1 are locked. The attribute names for HTTP, relational DB, messaging, RPC, and system metrics do not change anymore:
http.request.method,db.system.name,messaging.system. Dashboard authors exhaled. - The log signal is GA. The era of "traces in OTel, logs somewhere else" is over.
- Profiles has entered as the fourth signal. After CNCF incubation, OTLP/profiles data started growing from late 2024.
- eBPF auto-instrumentation (Beyla, Coroot, OpenTelemetry eBPF Collector) opened a path where traces appear without any SDK. A line of trace data for legacy binaries and untouchable services became real.
In short, the standardization war is over. The question is now how you deploy OTel. This piece walks from the OTLP wire format through a production Collector pipeline, per-language auto-instrumentation, and the tradeoffs of the eBPF path — exactly the shape that is shipping in May 2026.
1. The Landscape — Why OpenTelemetry Won
Three decisions tipped the fight.
- A vendor-neutral wire format. OTLP's protobuf schema is public, and it runs over both gRPC and HTTP. Any vendor that opens an OTLP endpoint becomes OTel-compatible immediately. Vendor lock weakens — or to be precise, lock-in moves from the SDK to the backend (query language, UI, pricing).
- A semantic convention lock. How to represent an HTTP request (
http.request.method,url.full,http.response.status_code) and a DB query (db.system.name,db.query.text) got agreed. Without this, OTel would have been just another SDK. - CNCF graduation. In 2024 OpenTelemetry graduated as the second-most-active CNCF project after Kubernetes. The signal mattered — this is a real neutral standard, not something owned by one company.
Old landscape vs new
| Item | Pre-2019 | 2026 |
|---|---|---|
| Tracing standard | OpenTracing + OpenCensus in parallel | OpenTelemetry single |
| Wire protocol | One per vendor | OTLP/gRPC + OTLP/HTTP |
| Library instrumentation | Baked into a vendor SDK | Baked into the OTel API, SDK is swappable |
| Logs | Separate pipeline (Fluentd, etc.) | One OTel signal |
| Metrics | Separate Prometheus camp | OTLP + Prometheus compat |
| Auto-instrumentation | A few languages | Java, Python, Node, Go, Ruby, .NET, PHP, Rust |
| Semantics | Per-vendor | http.*, db.*, messaging.* locked |
2. OTLP — Why the Wire Protocol Matters Most
The heart of OpenTelemetry is not the SDK; it is OTLP. OTLP is the protobuf schema and transport protocol that ships trace, metric, log, and profile data to a backend.
2.1 Two transports
- OTLP/gRPC — protobuf over HTTP/2. Default port 4317. Most efficient. The default in server environments.
- OTLP/HTTP — protobuf or JSON over HTTP/1.1. Default port 4318. Used in browsers, Lambda, and environments where firewalls block gRPC.
The payload (schema) is identical; only the transport differs. This separation matters — sending traces directly from a browser over OTLP/HTTP became possible.
2.2 Wire shape (simplified)
message ExportTraceServiceRequest {
repeated ResourceSpans resource_spans = 1;
}
message ResourceSpans {
Resource resource = 1; // service info (service.name, deployment.environment...)
repeated ScopeSpans scope_spans = 2;
}
message ScopeSpans {
InstrumentationScope scope = 1; // which library produced the spans
repeated Span spans = 2;
}
message Span {
bytes trace_id = 1;
bytes span_id = 2;
string name = 5;
fixed64 start_time_unix_nano = 7;
fixed64 end_time_unix_nano = 8;
repeated KeyValue attributes = 9; // e.g. http.request.method = "GET"
repeated Event events = 11;
repeated Link links = 13;
Status status = 15;
}
The key is the three-level Resource → Scope → Span tree. Spans from the same service and same library are bundled together, which compresses well.
2.3 Metrics and logs
The same pattern repeats for metrics, logs, and profiles.
- Metrics:
ResourceMetrics→ScopeMetrics→Metric(Gauge / Sum / Histogram / ExponentialHistogram / Summary) - Logs:
ResourceLogs→ScopeLogs→LogRecord - Profiles:
ResourceProfiles→ScopeProfiles→Profile(pprof-compatible schema)
2.4 Is OTLP push or pull?
OTLP is push. The SDK or Collector pushes to a backend. That is the biggest philosophical break from Prometheus (pull).
OTel keeps both legs in for Prometheus compatibility. The Collector ships a prometheusreceiver (to scrape targets) and a prometheusexporter (so Prometheus can scrape the Collector). In real production, a hybrid — Prometheus for metrics, OTLP push for traces and logs — is common.
3. Semantic Conventions — Why This Is OTel's Real Weapon
Technically OTLP is the most impressive piece, but the operator touches semantic conventions every day.
3.1 What got locked
Lock-down work for v1 stable started in September 2024 and locked the following between 2025 and 2026:
- HTTP —
http.request.method,http.response.status_code,url.path,url.full,url.scheme,server.address,server.port,user_agent.original. - Databases —
db.system.name,db.namespace,db.query.text,db.collection.name,db.operation.name. - Messaging —
messaging.system,messaging.destination.name,messaging.operation.type,messaging.message.id. - RPC —
rpc.system,rpc.service,rpc.method. - System metrics —
system.cpu.utilization,system.memory.usage,process.runtime.*. - Resource —
service.name,service.version,service.instance.id,deployment.environment.name,host.name,os.type,cloud.provider,k8s.pod.name,k8s.namespace.name.
Locked stable means the names do not change anymore. Dashboards, alerts, queries, backend integrations — everything stabilizes.
3.2 Why the lock matters
There was a well-known trap in early OTel. The HTTP attribute name was once http.method, then it changed to http.request.method. Every deployed SDK had to be replaced; every dashboard broke. This happened more than once.
The v1 lock ends that pain. Future changes go into a v2 namespace; v1 is preserved. Same principle as Kubernetes API stability.
3.3 What semantic conventions enforce
- Vendor portability. Move the same trace data from Tempo to Honeycomb and the queries still work.
- Shared dashboards. OTel dashboards in the Grafana marketplace assume semantic conventions v1.
- Trustworthy library instrumentation. The attribute names emitted by auto-instrumentation are standard, so operators can predict them.
4. Traces, Metrics, Logs, Profiles — Current State of the Four Signals
4.1 Traces
The most mature signal. The core of OTel 1.0; very little new arrives in 2026. The unit of a trace is the span — a unit of work with a start and end time, attributes, events, and a parent span.
Spans are grouped by trace ID, so you see one request flowing across a distributed system. The traceparent header (W3C Trace Context) propagates context across service boundaries.
4.2 Metrics
OTel's metric model differs slightly from Prometheus. Prometheus has gauge / counter / histogram / summary; OTel subdivides into Counter / UpDownCounter / Gauge / Histogram / ExponentialHistogram / ObservableX.
One big change — ExponentialHistogram entered the standard. The old Histogram uses pre-chosen bucket boundaries; ExponentialHistogram adapts to the distribution dynamically. Quantile estimation accuracy (P99 and friends) improves a lot.
The Prometheus camp went the same way with native histograms. Both camps converged.
4.3 Logs
GA in late 2024. The point is — OTel alone now ships traces, metrics, and logs without a separate log pipeline.
The killer feature is automatic attachment of trace context. Log records get trace_id and span_id automatically, so trace-to-log and log-to-trace jumps are one click. In the old days you wired this by hand into your logger.
That said, Fluent Bit and Vector did not die in 2026. OTel's log receiver is strong, but Fluent Bit is more mature at tailing and parsing container log files. Real production splits two ways: Fluent Bit reads files and pushes OTLP to a Collector, or the Collector's filelogreceiver reads files directly.
4.4 Profiles
After CNCF incubation in 2024, profiling became the fourth signal. Continuous profiling joined OTel alongside traces, metrics, and logs.
The technical core: a pprof-compatible schema lands in OTLP. The reason this matters — Grafana Pyroscope, Polar Signals Parca, and other existing continuous-profiling tools, once they accept OTLP/profiles, let you ship all four signals through one pipeline.
Status as of May 2026:
- Wire protocol (OTLP/profiles): near-stable beta.
- SDK support: parts of Go, Python, and Java auto-instrumentation experimentally emit profile signals.
- Collector support:
otlpreceiverandotlpexporteraccept the profiles signal. Some backends do not yet receive it. - eBPF: Parca and Pyroscope generate pprof via eBPF and ship it over OTLP.
Close to GA but not complete. When you start, pin SDK and Collector versions that can enable profiles.
5. Collector Architecture — The nginx of Observability
The OpenTelemetry Collector is the single most important component in the OTel ecosystem. Its role in an observability pipeline is what nginx is to HTTP traffic — receive, transform, route.
5.1 Three component types
+-------------+ +-------------+ +-------------+
| Receivers | ---> | Processors | ---> | Exporters |
+-------------+ +-------------+ +-------------+
otlp batch otlp
prometheus memory_limiter prometheusremotewrite
filelog attributes elasticsearch
jaeger k8sattributes loki
zipkin tail_sampling tempo
kafka filter kafka
transform
- Receivers — adapters that take data in. OTLP is the default, but you can take Prometheus scrapes, Jaeger, Zipkin, Fluent Forward, Kafka, SQS — anything.
- Processors — transform, filter, and enrich.
batchis essentially required,memory_limiteris a safety net,k8sattributesautomatically attaches Kubernetes metadata,tail_samplingdelays the sampling decision until the trace is complete. - Exporters — push to backends. OTLP is the default, but plenty of backend-specific exporters exist.
5.2 Core vs Contrib
Two distributions.
otelcol(core) — only the most essential components. Easier to security-audit; smaller binary.otelcol-contrib(contrib) — all community components. 99% of production runs contrib.
Recommendation: start with contrib. Once operations stabilize, use OpenTelemetry Collector Builder (ocb) to build a custom image with exactly the components you need.
5.3 The concept of a pipeline
Collector config is grouped by pipelines. One pipeline per signal type.
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, batch]
exporters: [otlphttp/tempo]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp, filelog]
processors: [memory_limiter, k8sattributes, batch]
exporters: [otlphttp/loki]
The same receiver can feed multiple pipelines, and fan-out / fan-in are free.
6. Production Collector Setup — sidecar → gateway → backend
The most common real production pattern is a two-tier layout.
+----------------+ OTLP +-----------+ OTLP +-----------+
| App + SDK | -------> | Collector |---------->| Collector |
| (Pod sidecar) | | (sidecar) | | (gateway) |
+----------------+ +-----------+ +-----------+
|
| OTLP / proprietary
v
+------------------+
| Tempo / Jaeger / |
| Honeycomb / etc. |
+------------------+
- sidecar Collector (DaemonSet or pod sidecar) — receives locally, batches, retries. The app sends OTLP and gets a fast response so it can get on with its work.
- gateway Collector (Deployment) — aggregates cluster-wide for tail sampling, metadata enrichment, routing, and fan-out to multiple backends.
- backend — Tempo / Jaeger / SigNoz / Honeycomb / Datadog. Wherever OTLP terminates.
6.1 sidecar / agent Collector config
# otelcol-agent-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'self-metrics'
scrape_interval: 30s
static_configs:
- targets: ['localhost:8888']
processors:
memory_limiter:
check_interval: 1s
limit_mib: 400
spike_limit_mib: 100
k8sattributes:
auth_type: serviceAccount
passthrough: false
extract:
metadata:
- k8s.pod.name
- k8s.namespace.name
- k8s.node.name
- k8s.deployment.name
- k8s.cluster.uid
batch:
timeout: 200ms
send_batch_size: 8192
exporters:
otlp/gateway:
endpoint: otelcol-gateway.observability.svc.cluster.local:4317
tls:
insecure: true
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
sending_queue:
enabled: true
num_consumers: 4
queue_size: 5000
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, batch]
exporters: [otlp/gateway]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, k8sattributes, batch]
exporters: [otlp/gateway]
logs:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, batch]
exporters: [otlp/gateway]
6.2 gateway Collector config (with tail sampling)
The gateway usually receives the full trace and decides — which traces to keep, where to send them.
# otelcol-gateway-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
memory_limiter:
check_interval: 1s
limit_mib: 4000
spike_limit_mib: 800
tail_sampling:
decision_wait: 10s
num_traces: 100000
expected_new_traces_per_sec: 1000
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-requests
type: latency
latency: { threshold_ms: 1000 }
- name: sample-10pct
type: probabilistic
probabilistic: { sampling_percentage: 10 }
batch:
timeout: 1s
send_batch_size: 16384
exporters:
otlphttp/tempo:
endpoint: http://tempo-distributor.monitoring.svc.cluster.local:4318
prometheusremotewrite:
endpoint: http://mimir-distributor.monitoring.svc.cluster.local/api/v1/push
otlphttp/loki:
endpoint: http://loki-distributor.monitoring.svc.cluster.local:3100/otlp
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch]
exporters: [otlphttp/tempo]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlphttp/loki]
6.3 Operational tips
- memory_limiter goes first. If memory blows up every other processor is useless.
- tail sampling only at the gateway. Do it at the agent and traces fracture.
- k8sattributes belongs on the agent. The gateway may not see pod info.
- Tune batch size and timeout to backend capacity. Too small and RPS explodes; too large and latency climbs.
- Scrape the Collector's own metrics.
:8888/metricsexposesotelcol_processor_dropped_spansand friends.
7. Per-Language Auto-Instrumentation — How Automatic Is It
One of OTel's biggest selling points is auto-instrumentation. Without touching a line of code, spans appear for HTTP handlers, DB drivers, and gRPC clients.
Maturity varies sharply by language.
7.1 Java — best-in-class
The opentelemetry-javaagent.jar is effectively the industry standard. Using JVM's -javaagent mechanism it rewrites bytecode at runtime and auto-instruments more than 100 libraries (Spring, Hibernate, Apache HttpClient, Kafka, JDBC, Servlet, Reactor, gRPC, AWS SDK, …).
java -javaagent:opentelemetry-javaagent.jar \
-Dotel.service.name=order-service \
-Dotel.exporter.otlp.endpoint=http://otelcol-agent:4317 \
-Dotel.exporter.otlp.protocol=grpc \
-Dotel.resource.attributes=deployment.environment.name=prod \
-jar app.jar
That is it. Zero code changes. Spring Boot controllers, every JDBC call, every Kafka producer and consumer span shows up automatically. Trace context propagation is automatic too.
Why Java is best-in-class — JVM allows the freest runtime bytecode manipulation, the OTel Java team is the largest, and the know-how of the Datadog Java agent merged straight into OTel.
7.2 Python — opentelemetry-instrument
Python auto-instruments via the opentelemetry-instrument launcher and the opentelemetry-distro package. It wraps libraries via monkey-patching.
pip install opentelemetry-distro opentelemetry-exporter-otlp \
opentelemetry-instrumentation-flask \
opentelemetry-instrumentation-requests \
opentelemetry-instrumentation-psycopg2
opentelemetry-bootstrap -a install # auto-detect installed libs
OTEL_SERVICE_NAME=order-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otelcol-agent:4317 \
OTEL_RESOURCE_ATTRIBUTES=deployment.environment.name=prod \
opentelemetry-instrument python app.py
Supported libraries include Flask, Django, FastAPI, Starlette, requests, urllib3, httpx, psycopg2, asyncpg, SQLAlchemy, Redis, pymongo, celery, kafka-python, boto3. Not as deep as Java, but it covers the everyday libraries.
7.3 Node.js — require-hook based
Node auto-instruments through @opentelemetry/auto-instrumentations-node, which installs a require hook.
// tracing.js — must load before the app entry point
const { NodeSDK } = require('@opentelemetry/sdk-node')
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node')
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc')
const sdk = new NodeSDK({
serviceName: 'order-service',
traceExporter: new OTLPTraceExporter({
url: 'http://otelcol-agent:4317',
}),
instrumentations: [getNodeAutoInstrumentations()],
})
sdk.start()
Run it:
node --require ./tracing.js app.js
In ESM you need the --import flag and a hook loader (the register() API in Node 20.6+). CommonJS environments still auto-instrument more smoothly.
Supported libraries: Express, Koa, Fastify, Hapi, http, https, pg, mysql, redis, mongoose, ioredis, kafka.js, AWS SDK, GraphQL, etc.
7.4 Go — the compile-time instrumentation revolution
Go was an OTel weakness for a long time. Go is hard to monkey-patch, has no vtable, and runtime interception is messy. For years Go meant manual instrumentation only.
Two paths opened in 2024–2025.
Path 1: code injection at go build time — otel-go-instrumentation
go install github.com/open-telemetry/opentelemetry-go-instrumentation/cli@latest
# Inject auto-instrumentation at build time
otel-instrument go build -o app ./...
The build tool generates wrappers around standard library and popular package calls (net/http, gRPC, database/sql, etc.) and injects them. Zero source changes.
Path 2: outside the process via eBPF — Beyla
Even build-time injection is skipped. Beyla runs as a separate process and uses eBPF to observe syscalls and sockets in the kernel, producing traces. More on this in Section 8.
7.5 .NET, Ruby, PHP, Rust
- .NET —
OpenTelemetry.AutoInstrumentationNuGet package. Rewrites IL via the CoreCLR profiler API. Second-most mature after Java. - Ruby —
opentelemetry-instrumentation-allgem. Monkey-patch based. Rails, Sinatra, Rack, etc. - PHP —
open-telemetry/opentelemetry-auto-laraveland friends. Sometimes needs an OPcache extension. - Rust — virtually no auto-instrumentation. The OTel adapter for the
tracingcrate is the standard manual approach.
7.6 Limits of auto-instrumentation
Auto-instrumentation only sees the boundaries — HTTP, DB, message, RPC. The meaningful business units — "create order", "validate payment", "decrement inventory" — are invisible.
The right answer in production OTel is auto-instrumentation plus hand-coded business spans. Auto-instrumentation catches the infrastructure edges; manual spans add domain meaning.
8. eBPF Auto-Instrumentation — Traces Without an SDK
One of the largest shifts in 2024–2025. eBPF makes traces appear without any SDK installed.
8.1 How it works
An eBPF auto-instrumentation tool runs as a separate process and hooks into the kernel's syscalls and socket events. It directly observes HTTP requests, gRPC calls, and DB connections.
- The app is a plain binary. No SDK, no javaagent.
- An eBPF program hooks
accept,connect,read,write, etc., and decodes the packets. - Decoded transactions are sent over OTLP to a Collector.
8.2 The tools
- Grafana Beyla — Grafana Labs' open-source project. Watches Go, Java, Node, Python, and Rust uniformly. Emits both traces and metrics. GA in 2024.
- Coroot — full-stack observability on top of Beyla — collects via eBPF and ships its own UI. OTel-compatible.
- OpenTelemetry eBPF Collector — the official OTel project. Originally built by Splunk and donated. Watches system metrics and network-level traces.
- Pixie (Pixie Labs) — Kubernetes-only. eBPF plus its own query language. OTel output supported.
- Cilium Hubble — focused on the network layer. Cilium's observability component.
8.3 Strengths of the eBPF path
- Zero SDK lines. Tracing works for legacy binaries and untouchable third-party services.
- Zero instrumentation-miss risk. Visible even if a library version is out of sync with OTel.
- Language-agnostic. One Beyla yields uniform Go, Java, Node, Python, and Rust traces.
- Low operational pressure. No redeploy required.
8.4 Limits of the eBPF path
- Distributed trace context propagation is hard. eBPF can read
traceparentin HTTP headers, but cannot follow that context into function-level calls inside the app. Result — service-level traces look fine, function-level graphs are thin. - Encrypted traffic. To see HTTP/2 inside TLS you need uprobes hooking library functions. Modern tools support this but the compatibility matrix is narrow.
- Kernel privileges required. Pods need
CAP_BPForprivileged. Security policy may block it. - Zero business logic. Domain semantics like "create order" are missed.
8.5 eBPF + SDK hybrid
The right production answer is eBPF plus SDK.
- eBPF — the infrastructure picture (service-to-service calls, DB queries, external API calls, network latency).
- SDK — business meaning (domain spans, custom metrics, business logs).
Tying the two datasets together with the same trace_id is hard. In practice "eBPF trace graph + SDK trace graph" often run side by side without merging. This is the most actively improved area as of May 2026.
9. Resource Detection
Every piece of OTel data is shipped attached to a Resource — metadata like service.name=order-service, deployment.environment.name=prod, host.name=node-01, k8s.pod.name=order-69b.
You can set all of it by hand, but the OTel SDK and Collector include automatic resource detectors.
9.1 Kinds of detectors
- Environment variables —
OTEL_RESOURCE_ATTRIBUTES=service.name=order,deployment.environment.name=prod. - Process — pid, command, runtime version.
- Host — hostname, OS, architecture.
- Container — container.id (extracted from cgroup).
- Kubernetes — env injection via
downwardAPI, or the Collector'sk8sattributesprocessor enriching by pod IP. - Cloud — AWS EC2/ECS/EKS/Lambda, GCP GCE/GKE/Cloud Run, Azure VM/AKS — detect instance metadata via IMDS.
9.2 Precedence
The OTel spec defines an order — environment variables less than SDK defaults less than explicit code settings. Cloud detectors run automatically at SDK startup.
9.3 Production pattern
env:
- name: OTEL_SERVICE_NAME
value: order-service
- name: OTEL_RESOURCE_ATTRIBUTES
value: "deployment.environment.name=prod,service.version=$VERSION"
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: OTEL_RESOURCE_ATTRIBUTES_OTHER
value: "k8s.pod.name=$POD_NAME,k8s.node.name=$NODE_NAME"
Cleaner: hand it off to the Collector's k8sattributes processor. The SDK only sets service.name and the rest is enriched by the Collector via pod IP → API server lookup.
10. From Vendor SDK to OTel SDK — The Migration Story
The most common migration pattern of 2024–2025.
10.1 Scenario — Datadog dd-trace to OTel
You already ship traces via Datadog dd-trace and want to move to OTel. Two paths.
Path 1: dd-trace's OTel API compatibility mode
dd-trace itself can accept the OTel API. You write code against the OTel API; dd-trace handles the wire format.
[Code: OTel API] -> [dd-trace agent: accepts OTel API] -> [Datadog backend: dd-trace format]
Upside: incremental migration. Switch code to the OTel API now, switch the backend later. Downside: until you peel dd-trace off, you have not gained much.
Path 2: OTel SDK direct emit
[Code: OTel API] -> [OTel SDK] -> [OTLP] -> [OTel Collector] -> [Datadog's OTLP endpoint, or swap the backend]
Upside: real OTel. Free to swap backends. Downside: bigger jump. A gap appears wherever dd-trace had auto-instrumentation that OTel still lacks.
10.2 Staged migration order
- Stand up the OTel Collector first. Run alongside the dd-trace agent.
- Adopt the OTel SDK in new services. No new dd-trace.
- Migrate traces, then metrics, then logs. Traces are smoothest; logs are the hardest for dashboard compatibility.
- Rewrite dashboards. Semantic conventions differ (
http.status_codevshttp.response.status_code). Do new dashboards first, do not flip all at once. - Remove the dd-trace agent. Only after every service is on OTel.
10.3 Honest tradeoffs
OTel SDK is not better than vendor agents on every axis.
| Aspect | OTel SDK | Vendor SDK (e.g. dd-trace) |
|---|---|---|
| Auto-instrumentation breadth | Good (Java best-in-class) | Very broad (long accumulation) |
| Runtime overhead | Generally a touch heavier | Often more tuned, sometimes lighter |
| Backend flexibility | Very large | Locked to one vendor |
| Semantic consistency | Standard semantic conventions | Vendor-specific names |
| Diagnostics, support | Community | Vendor SRE support |
| AI/ML workflows | New (OTel GenAI conventions) | Some vendors are ahead |
In short, OTel trades a little runtime cost to unlock the backend. Acceptable? Usually yes — the value of backend freedom typically dwarfs SDK micro-tuning gains.
11. GenAI Semantic Conventions — The Next Area to Lock
The most active area in the OTel camp through 2025–2026. Semantic conventions for tracing LLM calls are right before lock.
Core attributes (beta, near stable):
gen_ai.system—openai,anthropic,google,ollama, ...gen_ai.request.model—claude-3.5-sonnet,gpt-4o, ...gen_ai.usage.input_tokens/gen_ai.usage.output_tokens— cost and rate-limit tracking.gen_ai.response.finish_reasons—stop,length,tool_calls, ...gen_ai.operation.name—chat,tool_call,embedding,text_completion.- Events — per-message inputs and outputs (optional, cost-and-privacy tradeoff).
LangChain, LlamaIndex, OpenLLMetry, Arize Phoenix all track these conventions. You will see LLM-workflow cost, latency, and failure rate in a standardized shape. v1 stable lock in the next quarter or two is likely.
12. Common Mistakes and Anti-Patterns
Mistakes seen often in production.
12.1 SDK pushing straight to the backend
App SDK ----(OTLP)----> Datadog / Honeycomb
Works in small environments. In large ones it fails — you cannot change backend policy (sampling, routing, enrichment) without redeploying the app. Always insert a Collector layer.
12.2 Losing error traces to head sampling
If the SDK head-samples at 1%, 99% of error traces vanish. Errors must be preserved, so use tail sampling (in the Collector).
12.3 Generating trace_id and span_id by hand
We saw code that made a 16-byte UUID and stuffed it as a trace_id — it breaks the standard trace context propagation. Use getCurrentSpan().spanContext() from the OTel API.
12.4 Creating a new Tracer per request
// bad
function handler(req) {
const tracer = trace.getTracer('app') // looks like a fresh one each time
tracer.startSpan(...)
}
getTracer is cached, but pulling it once at module top is clearer about intent.
12.5 OTLP without batch
Without a batch processor, you send a small request per RPS. The batch processor is essentially required.
12.6 PII leaks into attributes
Putting http.request.body or user.email directly into attributes lands plain-text PII in the observability backend. Mask via the attributes processor or drop it.
12.7 Computing P99 from a fixed-bucket Histogram
A default Histogram with 5ms / 10ms / 50ms / 100ms / 500ms boundaries gives a sloppy P99. Default to ExponentialHistogram.
12.8 A single Collector instance as SPOF
Agent as DaemonSet, gateway as HPA. Always multi-instance. If the gateway dies, every signal in that window is gone.
Epilogue — What to Do Once the Standard Is Laid Down
OpenTelemetry sits in May 2026 in the spot of having won the standardization war. But "do you use OTel" is no longer an interesting question. The interesting ones are these:
- How well did you deploy it. Two-tier agent + gateway, or single? Is there tail sampling? Is memory_limiter the first processor?
- Do you follow semantic conventions. Not just trusting auto-instrumentation, but tagging business spans by the conventions too.
- The balance of auto vs manual. Infrastructure edges automatic, business meaning manual.
- Where you put eBPF. Only for legacy and third-party, or as the main path where the SDK side fails?
- Are you ready for the profiles signal. Do your backend and SDK versions accept it?
Adoption checklist
- Is the Collector deployed as a two-tier agent + gateway?
- Is memory_limiter the first processor in every pipeline?
- Is tail sampling in the gateway and are errors and slow requests unconditionally preserved?
- Does the k8sattributes processor run on the agent?
- Does every pipeline include the batch processor?
- For instrumented languages, is it via javaagent, opentelemetry-instrument, require hook, or compile-time injection?
- Do your business spans follow the semantic conventions?
- Does the attributes processor mask PII out of attributes?
- Do you scrape the Collector's self-metrics (
otelcol_processor_dropped_spans, etc.)? - Is ExponentialHistogram the default?
- Is your backend ready for GenAI semantic conventions?
Anti-pattern summary
- SDK direct to backend — insert a Collector layer.
- Head sampling only — loses error traces; add tail sampling.
- No batch processor — one small request per RPS.
- memory_limiter last — pointless if memory blows; put it first.
- PII in raw attributes — mask via attributes processor.
- Auto-instrumentation only, zero business spans — only the infrastructure edge is visible.
- Single Collector instance — SPOF. Multi-instance required.
- Custom attributes ignoring semantic conventions — dashboard compatibility breaks.
- Traces in OTel, metrics and logs in a different pipeline — biggest value is unifying all three signals.
- Enabling profiles in the SDK when the backend cannot receive it — data goes nowhere.
What is next
Next-piece candidates: Operating a Collector at depth — pipeline load, drops, and the trap of metric cardinality, OTel Profiles in practice — catching hotspots in a Go binary with eBPF, One dashboard for LLM cost, latency, and failure via GenAI semantic conventions.
References
- OpenTelemetry official site
- OpenTelemetry GitHub org
- OTLP Specification
- OTLP protobuf schema
- Semantic conventions v1
- Semantic conventions GitHub
- OpenTelemetry Collector
- otelcol-contrib repository
- OpenTelemetry Collector Builder (ocb)
- Java auto-instrumentation
- Python instrumentation
- Node.js auto-instrumentations
- Go auto-instrumentation (compile-time)
- .NET auto-instrumentation
- Grafana Beyla — eBPF auto-instrumentation
- Beyla GitHub
- Coroot
- OpenTelemetry eBPF Collector
- OpenTelemetry Profiles specification
- Profiles data model
- Grafana Tempo
- Jaeger
- SigNoz
- Honeycomb OTel adoption guide
- Datadog OTLP intake
- W3C Trace Context
- GenAI semantic conventions (beta)
- OpenLLMetry
- Arize Phoenix
- CNCF OpenTelemetry graduation announcement