Network & Service Observability 2026 Deep Dive — eBPF · Cilium Hubble · Pixie · Pyroscope · Grafana Loki + Tempo + Mimir · Netdata · OpenTelemetry

Prologue — In 2026, Observability Has to Answer "Why?"

In 2015, observability arrived with a three-pillar vocabulary: metrics, logs, traces. In 2023 the CNCF formally added continuous profiling as the fourth pillar. And in 2026 the real shift happened somewhere else entirely: instrumentation disappeared.

Five years ago, attaching distributed tracing to a Java app meant importing the OpenTelemetry SDK, annotating every method, and threading context propagation through the code. The 2026 default is let the kernel do it via eBPF. Cilium Hubble sees every packet in the cluster. Pixie sees every HTTP/gRPC/SQL call after one DaemonSet install. Beyla, Coroot, Caretta pull out the golden signals without a single line of code change.

That does not mean the SDK era is over. OpenTelemetry graduated from the CNCF in 2024 and OTLP is now the de facto wire protocol. Think of it this way: eBPF gives you "80% of visibility for free", and OTel SDKs fill in "the remaining 20% of business meaning". The real shape of the 2026 stack is a hybrid of eBPF and OTel.

This post draws that map — the four pillars, every eBPF tool worth knowing (Cilium / Hubble / Tetragon / Pixie / Inspektor Gadget / Coroot / Beyla / Caretta), Grafana LGTM (Loki + Tempo + Mimir + Pyroscope), SaaS giants like Datadog/New Relic/Dynatrace, network-specific tools (Suzieq, ntopng, ThousandEyes), and how Korean and Japanese companies actually use them in production.

Observability is a superset of monitoring. If monitoring is "watching whether a known metric crosses a threshold", observability is "the property of a system that lets you answer questions you didn't think to ask in advance". The 2026 difference isn't the tools — it's a stack design that lets you ask those questions.

What this post covers:

The four pillars (metrics / logs / traces / profiles) and golden signals
The eBPF revolution — Cilium 1.16, Hubble, Tetragon, Pixie
OpenTelemetry 2026 — Collector, OTLP, auto-instrument
Metrics stack — Prometheus 3.0, VictoriaMetrics, Mimir
Logs stack — Loki 3, Elastic, Vector, Quickwit, OpenObserve, SigNoz
Traces stack — Tempo 2, Jaeger 2, Zipkin, Honeycomb
Continuous profiling — Pyroscope, Parca, Polar Signals
Network observability — Suzieq, Skydive, ntopng, ThousandEyes
RUM & synthetic monitoring — Cloudflare, Checkly, Grafana Synthetic
APM comparison — Datadog, New Relic, Dynatrace, AppDynamics
K8s observability — Prometheus Operator, k9s, Lens
Service mesh + observability — Kiali, Linkerd dashboard
DevSecOps + observability — Falco + OTel
Storage backends — VictoriaMetrics, Mimir, ClickHouse, MinIO
Cost model — Datadog $35-70/host vs self-host LGTM
Korean adoption — NCsoft Pixie, Coupang Datadog, Naver OTel, Kakao Grafana
Japanese adoption — Mercari, LINE Yahoo, CyberAgent
SLO/SLI and error budget operations
Alerting and PagerDuty/Opsgenie/Incident.io
The shape of AI-native observability
Adoption roadmap — where to start
References

1. The Four Pillars and Golden Signals — What to Measure

The starting question for any observability project is "what do we look at?" The 2026 consensus is four pillars.

Metrics — Time series numbers. Counters, gauges, histograms like http_requests_total and cpu_usage_seconds_total. The cheapest and longest-retained of all.
Logs — Event text. Rich context at the moment something happens, but expensive and slow to search.
Traces — Causality across a distributed request. Shows how a single user call hops from service A → B → C → DB.
Profiles — Function-level breakdown of CPU, memory, lock contention. Tracks continuously which function is consuming the most resources.

Mapping these to Google's SRE four golden signals is the canonical operational view.

Golden Signal	Definition	Example Metric
Latency	Time to process a request	p50/p95/p99 response time
Traffic	Load on the system	RPS, QPS, MB/s
Errors	Failure rate	5xx ratio, exception count
Saturation	Resource fullness	CPU utilisation, queue depth, disk IOPS

USE vs RED, 2023 revisited — Brendan Gregg's USE (Utilisation / Saturation / Errors) is resource-oriented; Tom Wilkie's RED (Rate / Errors / Duration) is request-oriented. Both are cousins of the golden signals — choosing which lens to start with shapes what your first dashboard looks like.

2. The eBPF Revolution — Cilium, Hubble, Tetragon, Pixie

The real inflection point of 2026 observability is eBPF (extended Berkeley Packet Filter). A safe in-kernel VM that intercepts packets, syscalls, and socket events — and what it brought to observability is decisive.

What eBPF changed:

Language-agnostic instrumentation — the kernel sees every syscall whether your app is Go, Java, Python, or Rust.
Zero-code-change — install one DaemonSet and you're done.
Low overhead — kernel-level, typically 1-3% CPU.
L7 visibility — HTTP/gRPC/SQL parsers run inside eBPF, surfacing method, path, and status code.

Key tools:

Cilium 1.16 (Isovalent, acquired by Cisco in 2024) — Kubernetes networking + observability. CNI, service mesh (Cilium Service Mesh), security, and visibility in one package.
Cilium Hubble — flow visibility layer on top of Cilium. kubectl exec into Hubble CLI to see in real time which pod talks to which on what port.
Tetragon — Cilium's runtime-security sibling. Tracks which process opened which file and called which syscall.
Pixie (acquired by New Relic, CNCF Sandbox) — install one DaemonSet on a K8s cluster and every HTTP/gRPC/Postgres/Redis call is captured automatically. Comes with its own query language, PxL.
Inspektor Gadget — eBPF-based debugging toolkit for K8s. Commands like kubectl gadget trace dns give instant traces.
Coroot, Caretta, Beyla — next-gen tools that auto-generate service maps and golden signals via eBPF. Beyla comes from Grafana Labs.
bpftrace / BCC Tools — Brendan Gregg classics. Single-purpose tracers like execsnoop, tcptracer, biolatency.

# Install Cilium Hubble UI via helm
cilium install --version 1.16.0
cilium hubble enable --ui
cilium hubble port-forward
# Visit http://localhost:12000 for real-time flow visualisation

The Pixie magic — typically you have to embed an OTel SDK in every service to get distributed traces. With Pixie, installing one PEM (Pixie Edge Module) DaemonSet is all it takes. In five minutes you have the HTTP/gRPC call graph for all your microservices.

3. OpenTelemetry 2026 — The De Facto Standard

OpenTelemetry (OTel) graduated from the CNCF in 2024. Translation: it's now the second-largest project in the CNCF after Kubernetes. As of 2026, OTel can fairly be called "the wire standard for observation data."

Components:

OTel SDKs — official support for Go / Java / JS / Python / Rust / .NET / PHP / Ruby / Swift, eleven languages total
OTel Collector — the data pipeline: receivers → processors → exporters
Auto-instrumentation — -javaagent for Java, opentelemetry-instrument for Python, single-flag install
OTLP protocol — gRPC + HTTP, single transport for metrics, logs, traces, and profiles (the latter GA in 2025)

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/tempo]

Why OTel won — five years ago, Datadog Agent, New Relic Agent, Jaeger Client, Zipkin Brave each pushed their own SDK. Migrating a company from Datadog to New Relic was a multi-month project. OTel broke that lock-in with a simple promise: "standardise the data, free the backend." By 2026 even Datadog treats OTLP as a first-class citizen.

4. Metrics Stack — Prometheus 3.0 and Friends

Metrics is the oldest and most mature pillar. The 2026 standard is unambiguously the Prometheus ecosystem.

Prometheus 3.0 (released November 2024) — UTF-8 labels, native histograms, built-in OTLP receiver
VictoriaMetrics — Prometheus-compatible high-performance alternative. Single binary, 10x compression
Grafana Mimir — horizontally scalable Prometheus backend. Long-term storage on AWS S3 / GCS / MinIO
Thanos — Mimir's predecessor. Prometheus + S3 + multi-region global view
Cortex — Mimir's ancestor. Still found in some installs
Datadog / New Relic / Dynatrace — commercial metrics stacks, dashboards included

# Prometheus 3.0 scrape + OTLP receive
global:
  scrape_interval: 15s

# OTLP receiver (new in 3.0)
otlp:
  promote_resource_attributes:
    - service.name
    - service.namespace
    - deployment.environment

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod

VictoriaMetrics vs Mimir:

Aspect	VictoriaMetrics	Grafana Mimir
Architecture	Single binary	Distributed microservices
Storage	Local disk	S3 / GCS / MinIO
Query language	PromQL + MetricsQL	PromQL
Compression	Very high	Moderate
Operational cost	Low	Medium
Best scale	Small to mid	Multi-tenant, large

5. Logs Stack — Loki, Elastic, Vector, Quickwit

Logs is the most expensive and most error-prone pillar. The 2026 direction is clear: less indexing, more column stores.

Grafana Loki 3.x — "index only labels" log DB. Same model as Prometheus, S3 backend
Elasticsearch / OpenSearch — classic. Full-text indexing. Powerful but expensive
Quickwit — Rust-written search engine. S3-first, full-text indexing, Elastic alternative
VictoriaLogs — log DB from the VictoriaMetrics team. Fast full-text
OpenObserve — Rust-based all-in-one (metrics, logs, traces)
SigNoz — open-source observability on ClickHouse (metrics, logs, traces)
ClickHouse — column store. Adopted by Uber and Cloudflare as a log backend
Fluentd / Fluent Bit / Vector — log shippers. Vector is the Rust-based next-gen, acquired by Datadog in 2021

# Fluent Bit -> Loki pipeline example
[INPUT]
    Name              tail
    Path              /var/log/containers/*.log
    Parser            docker
    Tag               kube.*

[OUTPUT]
    Name                  loki
    Match                 kube.*
    Host                  loki.observability.svc
    Port                  3100
    Labels                job=fluentbit

Loki vs Elasticsearch, in one sentence each:

Want logs treated like metrics → Loki
Full-text search, SIEM, security analytics first → Elasticsearch / OpenSearch
Rust stack, S3-first → Quickwit / VictoriaLogs
Unified metrics, logs, traces in one project → SigNoz / OpenObserve

6. Traces Stack — Tempo 2, Jaeger 2

Distributed tracing answers the question "where did this user request get slow?"

Grafana Tempo 2.x — label-only indexing, S3 backend, same philosophy as Loki
Jaeger 2 (CNCF Graduated, full rewrite in 2024) — rebuilt on top of OTel Collector. v1-compatible
Zipkin — Twitter original. Still a fine choice when you want a simple install
Honeycomb — the originator of event-based observability. Charity Majors' influence is everywhere
Lightstep (acquired by ServiceNow) — large-scale trace analytics
AWS X-Ray / GCP Cloud Trace / Azure Monitor — cloud-native options
Datadog APM / New Relic APM / Dynatrace — commercial integrations

# One-liner Python OTel auto-instrument
# pip install opentelemetry-distro opentelemetry-exporter-otlp
# opentelemetry-bootstrap -a install
# opentelemetry-instrument --traces_exporter otlp \
#   --exporter_otlp_endpoint http://tempo:4317 python app.py

What Jaeger 2 means — back in the 1.x days Jaeger had its own Cassandra/Elasticsearch backends and its own SDK. In 2024 Jaeger 2 was rebuilt entirely on the OTel Collector. That's a statement: even the trace backend is no longer a lock-in.

7. Continuous Profiling — Pyroscope, Parca

Settling in as the fourth pillar of observability since 2023, continuous profiling captures CPU / memory / lock profiles in production all the time, at 1-5% overhead.

Grafana Pyroscope — CNCF, the successor that merged with Polar Signals' Parca
Parca — the original eBPF-based profiler. Merged into Pyroscope in 2024
Polar Signals Cloud — Parca's commercial hosted offering
Datadog Continuous Profiler — commercial option integrated with Datadog APM
Pyroscope OSS — self-hostable

Typical use cases:

CPU hotspots — which function burns 60% of CPU, at a glance
Memory leak hunting — alloc/free graphs over time
Lock contention — which mutex blocks the most threads

# Deploy Pyroscope to Kubernetes (Helm)
helm repo add grafana https://grafana.github.io/helm-charts
helm install pyroscope grafana/pyroscope \
  --set pyroscope.config.scrape_configs[0].job_name=k8s-pods

eBPF + Pyroscope combo — Pyroscope's eBPF profiler runs at the node level and pulls CPU profiles for every process automatically. Without touching code, you get Go, Rust, C++, and Python profiles in one place.

8. Network-Specific Observability Tools

Distinct from service observability is network observability — the visibility of the network itself.

Suzieq — network observability. Query switch / router / firewall state with SQL
Skydive — real-time network topology + traffic visualisation
ntopng / ntop — flow analytics classic (NetFlow / IPFIX / sFlow)
Netflix Atlas — Netflix's time-series for monitoring. Strong on cloud network metrics
Wireshark / tcpdump — packet capture standard. Still the final answer in the eBPF era
Cisco ThousandEyes — internet path monitoring SaaS. BGP, DNS, CDN visibility
Catchpoint / Pingdom — commercial synthetic and internet monitoring
Kentik — large-scale network analytics SaaS
Arista CloudVision — datacentre network observability

# Suzieq example — find non-established BGP sessions via SQL
$ suzieq-cli
suzieq> bgp show state=NotEstd
namespace  hostname    vrf  peer        state
prod       leaf01      default  10.0.0.2  Active
prod       leaf03      default  10.0.0.6  Connect

Why network observability got important again — five years ago "the cloud vendor handles it" was the answer. By 2026, multi-cloud, multi-region, service mesh, and zero-trust networking are routine, and "who is talking to whom and about what?" is once again the central question.

9. RUM (Real User Monitoring) and Synthetic Monitoring

Just as important as server-side observability is what the user actually sees on screen.

RUM (Real User Monitoring):

Datadog RUM — most mature. Includes session replay
New Relic Browser — browser monitoring classic
Cloudflare Browser Insights — RUM tied to the CDN
Sentry Performance — RUM integrated with error tracking (covered in another post)
PostHog Session Replay — product analytics + session replay

Synthetic Monitoring:

Checkly — code-first synthetic monitoring. Playwright-native
Datadog Synthetics — multi-step API/browser checks
Grafana Synthetic Monitoring — bundled with Grafana Cloud
Pingdom — classic availability monitoring
UptimeRobot, Better Stack — low-cost ping monitoring

// Checkly synthetic monitor example (Playwright)
import { expect, test } from '@playwright/test'

test('homepage loads', async ({ page }) => {
  const res = await page.goto('https://example.com')
  expect(res.status()).toBeLessThan(400)
  await expect(page.locator('h1')).toContainText('Welcome')
})

The 2026 RUM standard — push Core Web Vitals (LCP, INP, CLS) over OTLP into an OTel Collector and view in Grafana. Cloudflare offers this for free.

10. APM Comparison — Datadog vs New Relic vs Dynatrace

The commercial APM Big Three look much like they did a few years ago.

Aspect	Datadog	New Relic	Dynatrace
Strength	Breadth, UX	Pricing, AI	Davis AI, auto-detect
Pricing	35-70 USD per host	0.30 USD per GB (perceived)	DPS units (complex)
OTel	First-class	First-class	First-class
Auto-instr	Very strong	Very strong	Strongest (OneAgent)
K8s	Strong	Strong	Strong
AI analysis	Bits AI	New Relic AI	Davis AI (the original)

Also-rans worth knowing:

AppDynamics (Cisco/Splunk) — enterprise
Elastic Observability — Elastic Stack-integrated
Splunk Observability (formerly SignalFx) — Splunk Cloud integrated
Honeycomb — the prestige brand of event-based observability
Logz.io / Sumo Logic — SaaS ELK and unified

The 2026 trend — fewer companies do single-vendor "all Datadog" lock-ins. More are hybrid: self-hosted Grafana LGTM for the bulk, with Datadog or New Relic only for select areas. Cost pressure is the main driver.

11. K8s Observability — Operator, k9s, Lens

Kubernetes observability is a full sub-category in its own right.

Prometheus Operator — run Prometheus the K8s-native way. ServiceMonitor, PodMonitor CRDs
kube-prometheus-stack — Helm chart that bundles Prometheus + Grafana + Alertmanager + node-exporter
kube-state-metrics — exposes K8s resource state as Prometheus metrics
node-exporter — node-level system metrics
k9s — terminal K8s client par excellence
Lens / OpenLens — GUI K8s IDE
k0s, k0sctl — lightweight K8s distributions
Cilium + Hubble + OTel + Pixie — the full eBPF stack

# ServiceMonitor example for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
    - port: metrics
      interval: 30s

12. Service Mesh and Observability

A service mesh is, by virtue of having a sidecar intercept all traffic, essentially an automatic trace and metric generator.

Istio + Kiali — Kiali is the standard for visualising Istio service mesh topology
Linkerd dashboard — simplicity is the selling point. CNCF Graduated
Consul Connect + Consul UI — the HashiCorp stack
Cilium Service Mesh — eBPF-based, sidecar-less mesh
Envoy Admin — the data plane behind every mesh. /stats endpoint exposes 1,000+ metrics

Sidecar-less mesh — with Istio Ambient Mode and Cilium Service Mesh going GA in 2024-2025, the "node-level data plane" is now the standard. From an observability standpoint, the key insight is that even without sidecars you get the same data.

13. DevSecOps Meets Observability

Runtime security and observability share data, but ask different questions.

Falco — eBPF/syscall runtime security (CNCF Graduated)
Tetragon — Cilium's security sibling
OPA / Gatekeeper — policy engine
Kubescape — K8s security scanner (CNCF Incubating)
Cloud SIEM — Datadog Cloud SIEM, Elastic Security, Sumo Logic Cloud SIEM
CrowdStrike Falcon / Wiz / Lacework — commercial CSPM/CNAPP

# Falco rule example — alert when a shell is spawned inside a container
- rule: Terminal shell in container
  desc: A shell was used as the entrypoint/exec point in a container
  condition: spawned_process and container and shell_procs
  output: "A shell was spawned in a container (user=%user.name container=%container.id)"
  priority: WARNING

14. Storage Backends — VictoriaMetrics, Mimir, ClickHouse, MinIO

The volume of observability data is explosive. Where and how long you store it is the central design question.

VictoriaMetrics — long-term metric storage. Very high compression
Grafana Mimir — distributed backend for Prometheus metrics. S3-first
Grafana Loki — logs. S3-first
Grafana Tempo — traces. S3-first
ClickHouse — column store. Adopted aggressively as a log/trace backend (SigNoz, OpenObserve, Cloudflare)
MinIO — S3-compatible object storage. The de facto backend for self-hosted observability
AWS S3 / GCS / Azure Blob — cloud options

Retention guidelines:

Data type	Hot	Warm	Cold
Metrics	15 days	90 days	13 months+
Logs	7 days	30 days	1 year+ (S3 IA)
Traces	3-7 days	30 days	90 days+
Profiles	7 days	30 days	usually discarded

15. Cost Model — SaaS vs Self-Host

Observability typically consumes 5-15% of infrastructure cost. At scale, 30% is not unusual.

SaaS pricing (2026 list prices):

Datadog APM: ~35-70 USD/host/month, custom metrics separately
New Relic: 99 USD/user/month + ~0.30 USD/GB perceived (the actual model is more complex)
Dynatrace: DPS (Davis Platform Subscription) units, usually 80-200 USD/host/month
Honeycomb: event-volume based, free tier then per-GB
Grafana Cloud Pro: usage-based, free tier then per-series-and-GB

Self-hosted LGTM (rough order):

Roughly: 100 hosts, 1 TB/day of logs, 100 GB/day of traces:

Compute: ~10 K8s nodes, ~3,000 USD/month
S3 storage: ~500-800 USD/month (1-year retention)
Operations: 0.5+ FTE (the largest hidden cost)

Rule of thumb — under 50 hosts, SaaS is almost always cheaper. Over 200 hosts, self-hosted is almost always cheaper. In between it depends on the company. Cost-pressured Korean and Japanese companies are crossing that threshold to self-hosting quickly.

16. Korean Adoption — NCsoft, Coupang, Naver, Kakao

Korea's big-tech observability story tends to follow this arc: Datadog adoption around 2020 → gradual move toward self-hosting from 2024.

NCsoft — presented Pixie and eBPF adoption at KubeCon (2023). Used for microservice tracing of game backends
Coupang — major Datadog user. Unified APM and infrastructure monitoring
Naver — OpenTelemetry adoption. Talked about building in-house metric/trace backends at DEVIEW
Kakao — self-hosted Grafana stack. Heavy use in messenger backend metric visibility
Woowa Brothers (Baemin) — Datadog + self-hosted Grafana hybrid
KakaoBank / Toss — high self-hosting share due to financial regulation. ELK + Grafana common
Daangn Market — Datadog APM + Sentry combo
Line Plus Korea — OpenTelemetry + Grafana

The common pattern — Korean companies tend to (1) start with Datadog/New Relic for fast initial visibility, and (2) move the bulk to LGTM once cost crosses a threshold while keeping RUM/Sentry as SaaS.

17. Japanese Adoption — Mercari, LINE Yahoo, CyberAgent

Japan is generally considered one step ahead of Korea in adopting OpenTelemetry and eBPF.

Mercari — major Datadog user. Presented OTel migration at SRECon 2024
LINE Yahoo — post-merger acceleration of OpenTelemetry standardisation. In-house metric backend
CyberAgent — large-scale Cilium deployment. Multiple eBPF talks at KubeCon
DeNA — Datadog + Grafana hybrid
Rakuten — major Splunk user. APM with New Relic
PayPay — AWS-native observability + Datadog
Smartbank / Kyash — fintech startups adopting OTel
Mercari ML team — Pyroscope/Parca for profiling machine learning workloads

Japanese SRE culture — the SRE community around Mercari and CyberAgent (SRE Lounge, the SRE NEXT conference) plays a major role in spreading observability best practices. That community is one reason Japan adopted OTel a beat ahead of Korea.

18. SLO/SLI and Error Budget Operations

The final destination of observability is SLOs (Service Level Objectives). Define "p99 response time is under 200 ms and availability is 99.9%", monitor it with metrics, and burn down an error budget when you miss.

Tools:

Sloth — Prometheus-based SLO generator. Define SLOs in YAML
Pyrra — SLO controller in the same category as Sloth
OpenSLO — YAML standard for SLO definitions
Nobl9 — dedicated SLO SaaS
Datadog SLO / Honeycomb SLO — APM-integrated options

# Sloth SLO definition example
version: prometheus/v1
service: my-api
slos:
  - name: requests-availability
    objective: 99.9
    sli:
      events:
        error_query: sum(rate(http_requests_total{code=~"5.."}[5m]))
        total_query: sum(rate(http_requests_total[5m]))
    alerting:
      page_alert:
        labels:
          severity: page
      ticket_alert:
        labels:
          severity: ticket

19. Alerting and Incident Management

The final stage of observability is alerting and incident response.

Alertmanager — Prometheus-ecosystem standard
Grafana Alerting — Grafana-integrated option (multi-datasource)
PagerDuty — SaaS standard for incident management
Opsgenie (Atlassian) — PagerDuty's main competitor
Incident.io — chat-first next-gen incident management. Strong Slack integration
FireHydrant, Rootly — newer incident-management SaaS
Better Stack — uptime monitoring + incident management bundle

The 2026 direction — Slack and Teams have become the standard incident room. Tools like Incident.io spin up a channel automatically, update the status page, and kick off the post-mortem. PagerDuty is evolving in the same direction.

20. The Shape of AI-Native Observability

The biggest topic of 2026 is how AI changes observability.

Automatic Root Cause Analysis (RCA) — Dynatrace Davis, New Relic AI, Datadog Bits AI
Natural language query — Honeycomb Query Assistant, Grafana LLM, Datadog AI Assistant
Anomaly detection — every major APM ships ML-based anomaly detection now
LLM observability itself — LangSmith, Helicone, Phoenix, Arize, OpenLLMetry. Trace LLM-response latency, cost, and quality
Agent tracing — OTel's GenAI semantic conventions launched in 2025. A standard for tracing agent tool calls

Observability for AI workloads — LLMs are non-deterministic and expensive. "Did this user request go to GPT-4o or Claude 3.5? How many tokens did it use? Was the query cached?" — these are the new golden signals.

21. Adoption Roadmap — Where to Start

If you were rebuilding the stack from scratch in 2026, this is roughly the recommended order:

Metrics first — Prometheus + Grafana. node-exporter, kube-state-metrics
Standardise logs — Loki or Elastic. Unify shipping with Vector or Fluent Bit
Adopt traces — Tempo + OTel Collector. Start with auto-instrumentation
Fill 80% visibility with eBPF — Cilium Hubble or Pixie, pick one
Add continuous profiling — Pyroscope, integrated into Grafana
Define SLOs — three to five core services in YAML via Sloth
AI/LLM tracing — if you use LLMs, add LangSmith or Phoenix to the OTel pipeline
Automate incident management — Incident.io or PagerDuty, Slack-integrated

Startup (1-20 people):

Lean on SaaS — Datadog or New Relic + Sentry + Better Stack
Don't buy a learning curve, ship code

Mid-size (20-200 people):

Grafana Cloud + Sentry + Incident.io
Watch cost and migrate to self-hosted LGTM progressively

Large (200+ people):

Self-hosted LGTM + SaaS only in select areas (RUM, APM)
Have a dedicated observability platform team