- Published on
Network & Service Observability 2026 Deep Dive — eBPF · Cilium Hubble · Pixie · Pyroscope · Grafana Loki + Tempo + Mimir · Netdata · OpenTelemetry
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Prologue — In 2026, Observability Has to Answer "Why?"
In 2015, observability arrived with a three-pillar vocabulary: metrics, logs, traces. In 2023 the CNCF formally added continuous profiling as the fourth pillar. And in 2026 the real shift happened somewhere else entirely: instrumentation disappeared.
Five years ago, attaching distributed tracing to a Java app meant importing the OpenTelemetry SDK, annotating every method, and threading context propagation through the code. The 2026 default is let the kernel do it via eBPF. Cilium Hubble sees every packet in the cluster. Pixie sees every HTTP/gRPC/SQL call after one DaemonSet install. Beyla, Coroot, Caretta pull out the golden signals without a single line of code change.
That does not mean the SDK era is over. OpenTelemetry graduated from the CNCF in 2024 and OTLP is now the de facto wire protocol. Think of it this way: eBPF gives you "80% of visibility for free", and OTel SDKs fill in "the remaining 20% of business meaning". The real shape of the 2026 stack is a hybrid of eBPF and OTel.
This post draws that map — the four pillars, every eBPF tool worth knowing (Cilium / Hubble / Tetragon / Pixie / Inspektor Gadget / Coroot / Beyla / Caretta), Grafana LGTM (Loki + Tempo + Mimir + Pyroscope), SaaS giants like Datadog/New Relic/Dynatrace, network-specific tools (Suzieq, ntopng, ThousandEyes), and how Korean and Japanese companies actually use them in production.
Observability is a superset of monitoring. If monitoring is "watching whether a known metric crosses a threshold", observability is "the property of a system that lets you answer questions you didn't think to ask in advance". The 2026 difference isn't the tools — it's a stack design that lets you ask those questions.
What this post covers:
- The four pillars (metrics / logs / traces / profiles) and golden signals
- The eBPF revolution — Cilium 1.16, Hubble, Tetragon, Pixie
- OpenTelemetry 2026 — Collector, OTLP, auto-instrument
- Metrics stack — Prometheus 3.0, VictoriaMetrics, Mimir
- Logs stack — Loki 3, Elastic, Vector, Quickwit, OpenObserve, SigNoz
- Traces stack — Tempo 2, Jaeger 2, Zipkin, Honeycomb
- Continuous profiling — Pyroscope, Parca, Polar Signals
- Network observability — Suzieq, Skydive, ntopng, ThousandEyes
- RUM & synthetic monitoring — Cloudflare, Checkly, Grafana Synthetic
- APM comparison — Datadog, New Relic, Dynatrace, AppDynamics
- K8s observability — Prometheus Operator, k9s, Lens
- Service mesh + observability — Kiali, Linkerd dashboard
- DevSecOps + observability — Falco + OTel
- Storage backends — VictoriaMetrics, Mimir, ClickHouse, MinIO
- Cost model — Datadog $35-70/host vs self-host LGTM
- Korean adoption — NCsoft Pixie, Coupang Datadog, Naver OTel, Kakao Grafana
- Japanese adoption — Mercari, LINE Yahoo, CyberAgent
- SLO/SLI and error budget operations
- Alerting and PagerDuty/Opsgenie/Incident.io
- The shape of AI-native observability
- Adoption roadmap — where to start
- References
1. The Four Pillars and Golden Signals — What to Measure
The starting question for any observability project is "what do we look at?" The 2026 consensus is four pillars.
- Metrics — Time series numbers. Counters, gauges, histograms like
http_requests_totalandcpu_usage_seconds_total. The cheapest and longest-retained of all. - Logs — Event text. Rich context at the moment something happens, but expensive and slow to search.
- Traces — Causality across a distributed request. Shows how a single user call hops from service A → B → C → DB.
- Profiles — Function-level breakdown of CPU, memory, lock contention. Tracks continuously which function is consuming the most resources.
Mapping these to Google's SRE four golden signals is the canonical operational view.
| Golden Signal | Definition | Example Metric |
|---|---|---|
| Latency | Time to process a request | p50/p95/p99 response time |
| Traffic | Load on the system | RPS, QPS, MB/s |
| Errors | Failure rate | 5xx ratio, exception count |
| Saturation | Resource fullness | CPU utilisation, queue depth, disk IOPS |
USE vs RED, 2023 revisited — Brendan Gregg's USE (Utilisation / Saturation / Errors) is resource-oriented; Tom Wilkie's RED (Rate / Errors / Duration) is request-oriented. Both are cousins of the golden signals — choosing which lens to start with shapes what your first dashboard looks like.
2. The eBPF Revolution — Cilium, Hubble, Tetragon, Pixie
The real inflection point of 2026 observability is eBPF (extended Berkeley Packet Filter). A safe in-kernel VM that intercepts packets, syscalls, and socket events — and what it brought to observability is decisive.
What eBPF changed:
- Language-agnostic instrumentation — the kernel sees every syscall whether your app is Go, Java, Python, or Rust.
- Zero-code-change — install one DaemonSet and you're done.
- Low overhead — kernel-level, typically 1-3% CPU.
- L7 visibility — HTTP/gRPC/SQL parsers run inside eBPF, surfacing method, path, and status code.
Key tools:
- Cilium 1.16 (Isovalent, acquired by Cisco in 2024) — Kubernetes networking + observability. CNI, service mesh (Cilium Service Mesh), security, and visibility in one package.
- Cilium Hubble — flow visibility layer on top of Cilium.
kubectl execinto Hubble CLI to see in real time which pod talks to which on what port. - Tetragon — Cilium's runtime-security sibling. Tracks which process opened which file and called which syscall.
- Pixie (acquired by New Relic, CNCF Sandbox) — install one DaemonSet on a K8s cluster and every HTTP/gRPC/Postgres/Redis call is captured automatically. Comes with its own query language, PxL.
- Inspektor Gadget — eBPF-based debugging toolkit for K8s. Commands like
kubectl gadget trace dnsgive instant traces. - Coroot, Caretta, Beyla — next-gen tools that auto-generate service maps and golden signals via eBPF. Beyla comes from Grafana Labs.
- bpftrace / BCC Tools — Brendan Gregg classics. Single-purpose tracers like
execsnoop,tcptracer,biolatency.
# Install Cilium Hubble UI via helm
cilium install --version 1.16.0
cilium hubble enable --ui
cilium hubble port-forward
# Visit http://localhost:12000 for real-time flow visualisation
The Pixie magic — typically you have to embed an OTel SDK in every service to get distributed traces. With Pixie, installing one PEM (Pixie Edge Module) DaemonSet is all it takes. In five minutes you have the HTTP/gRPC call graph for all your microservices.
3. OpenTelemetry 2026 — The De Facto Standard
OpenTelemetry (OTel) graduated from the CNCF in 2024. Translation: it's now the second-largest project in the CNCF after Kubernetes. As of 2026, OTel can fairly be called "the wire standard for observation data."
Components:
- OTel SDKs — official support for Go / Java / JS / Python / Rust / .NET / PHP / Ruby / Swift, eleven languages total
- OTel Collector — the data pipeline: receivers → processors → exporters
- Auto-instrumentation —
-javaagentfor Java,opentelemetry-instrumentfor Python, single-flag install - OTLP protocol — gRPC + HTTP, single transport for metrics, logs, traces, and profiles (the latter GA in 2025)
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
exporters:
prometheus:
endpoint: 0.0.0.0:8889
loki:
endpoint: http://loki:3100/loki/api/v1/push
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/tempo]
Why OTel won — five years ago, Datadog Agent, New Relic Agent, Jaeger Client, Zipkin Brave each pushed their own SDK. Migrating a company from Datadog to New Relic was a multi-month project. OTel broke that lock-in with a simple promise: "standardise the data, free the backend." By 2026 even Datadog treats OTLP as a first-class citizen.
4. Metrics Stack — Prometheus 3.0 and Friends
Metrics is the oldest and most mature pillar. The 2026 standard is unambiguously the Prometheus ecosystem.
- Prometheus 3.0 (released November 2024) — UTF-8 labels, native histograms, built-in OTLP receiver
- VictoriaMetrics — Prometheus-compatible high-performance alternative. Single binary, 10x compression
- Grafana Mimir — horizontally scalable Prometheus backend. Long-term storage on AWS S3 / GCS / MinIO
- Thanos — Mimir's predecessor. Prometheus + S3 + multi-region global view
- Cortex — Mimir's ancestor. Still found in some installs
- Datadog / New Relic / Dynatrace — commercial metrics stacks, dashboards included
# Prometheus 3.0 scrape + OTLP receive
global:
scrape_interval: 15s
# OTLP receiver (new in 3.0)
otlp:
promote_resource_attributes:
- service.name
- service.namespace
- deployment.environment
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
VictoriaMetrics vs Mimir:
| Aspect | VictoriaMetrics | Grafana Mimir |
|---|---|---|
| Architecture | Single binary | Distributed microservices |
| Storage | Local disk | S3 / GCS / MinIO |
| Query language | PromQL + MetricsQL | PromQL |
| Compression | Very high | Moderate |
| Operational cost | Low | Medium |
| Best scale | Small to mid | Multi-tenant, large |
5. Logs Stack — Loki, Elastic, Vector, Quickwit
Logs is the most expensive and most error-prone pillar. The 2026 direction is clear: less indexing, more column stores.
- Grafana Loki 3.x — "index only labels" log DB. Same model as Prometheus, S3 backend
- Elasticsearch / OpenSearch — classic. Full-text indexing. Powerful but expensive
- Quickwit — Rust-written search engine. S3-first, full-text indexing, Elastic alternative
- VictoriaLogs — log DB from the VictoriaMetrics team. Fast full-text
- OpenObserve — Rust-based all-in-one (metrics, logs, traces)
- SigNoz — open-source observability on ClickHouse (metrics, logs, traces)
- ClickHouse — column store. Adopted by Uber and Cloudflare as a log backend
- Fluentd / Fluent Bit / Vector — log shippers. Vector is the Rust-based next-gen, acquired by Datadog in 2021
# Fluent Bit -> Loki pipeline example
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag kube.*
[OUTPUT]
Name loki
Match kube.*
Host loki.observability.svc
Port 3100
Labels job=fluentbit
Loki vs Elasticsearch, in one sentence each:
- Want logs treated like metrics → Loki
- Full-text search, SIEM, security analytics first → Elasticsearch / OpenSearch
- Rust stack, S3-first → Quickwit / VictoriaLogs
- Unified metrics, logs, traces in one project → SigNoz / OpenObserve
6. Traces Stack — Tempo 2, Jaeger 2
Distributed tracing answers the question "where did this user request get slow?"
- Grafana Tempo 2.x — label-only indexing, S3 backend, same philosophy as Loki
- Jaeger 2 (CNCF Graduated, full rewrite in 2024) — rebuilt on top of OTel Collector. v1-compatible
- Zipkin — Twitter original. Still a fine choice when you want a simple install
- Honeycomb — the originator of event-based observability. Charity Majors' influence is everywhere
- Lightstep (acquired by ServiceNow) — large-scale trace analytics
- AWS X-Ray / GCP Cloud Trace / Azure Monitor — cloud-native options
- Datadog APM / New Relic APM / Dynatrace — commercial integrations
# One-liner Python OTel auto-instrument
# pip install opentelemetry-distro opentelemetry-exporter-otlp
# opentelemetry-bootstrap -a install
# opentelemetry-instrument --traces_exporter otlp \
# --exporter_otlp_endpoint http://tempo:4317 python app.py
What Jaeger 2 means — back in the 1.x days Jaeger had its own Cassandra/Elasticsearch backends and its own SDK. In 2024 Jaeger 2 was rebuilt entirely on the OTel Collector. That's a statement: even the trace backend is no longer a lock-in.
7. Continuous Profiling — Pyroscope, Parca
Settling in as the fourth pillar of observability since 2023, continuous profiling captures CPU / memory / lock profiles in production all the time, at 1-5% overhead.
- Grafana Pyroscope — CNCF, the successor that merged with Polar Signals' Parca
- Parca — the original eBPF-based profiler. Merged into Pyroscope in 2024
- Polar Signals Cloud — Parca's commercial hosted offering
- Datadog Continuous Profiler — commercial option integrated with Datadog APM
- Pyroscope OSS — self-hostable
Typical use cases:
- CPU hotspots — which function burns 60% of CPU, at a glance
- Memory leak hunting — alloc/free graphs over time
- Lock contention — which mutex blocks the most threads
# Deploy Pyroscope to Kubernetes (Helm)
helm repo add grafana https://grafana.github.io/helm-charts
helm install pyroscope grafana/pyroscope \
--set pyroscope.config.scrape_configs[0].job_name=k8s-pods
eBPF + Pyroscope combo — Pyroscope's eBPF profiler runs at the node level and pulls CPU profiles for every process automatically. Without touching code, you get Go, Rust, C++, and Python profiles in one place.
8. Network-Specific Observability Tools
Distinct from service observability is network observability — the visibility of the network itself.
- Suzieq — network observability. Query switch / router / firewall state with SQL
- Skydive — real-time network topology + traffic visualisation
- ntopng / ntop — flow analytics classic (NetFlow / IPFIX / sFlow)
- Netflix Atlas — Netflix's time-series for monitoring. Strong on cloud network metrics
- Wireshark / tcpdump — packet capture standard. Still the final answer in the eBPF era
- Cisco ThousandEyes — internet path monitoring SaaS. BGP, DNS, CDN visibility
- Catchpoint / Pingdom — commercial synthetic and internet monitoring
- Kentik — large-scale network analytics SaaS
- Arista CloudVision — datacentre network observability
# Suzieq example — find non-established BGP sessions via SQL
$ suzieq-cli
suzieq> bgp show state=NotEstd
namespace hostname vrf peer state
prod leaf01 default 10.0.0.2 Active
prod leaf03 default 10.0.0.6 Connect
Why network observability got important again — five years ago "the cloud vendor handles it" was the answer. By 2026, multi-cloud, multi-region, service mesh, and zero-trust networking are routine, and "who is talking to whom and about what?" is once again the central question.
9. RUM (Real User Monitoring) and Synthetic Monitoring
Just as important as server-side observability is what the user actually sees on screen.
RUM (Real User Monitoring):
- Datadog RUM — most mature. Includes session replay
- New Relic Browser — browser monitoring classic
- Cloudflare Browser Insights — RUM tied to the CDN
- Sentry Performance — RUM integrated with error tracking (covered in another post)
- PostHog Session Replay — product analytics + session replay
Synthetic Monitoring:
- Checkly — code-first synthetic monitoring. Playwright-native
- Datadog Synthetics — multi-step API/browser checks
- Grafana Synthetic Monitoring — bundled with Grafana Cloud
- Pingdom — classic availability monitoring
- UptimeRobot, Better Stack — low-cost ping monitoring
// Checkly synthetic monitor example (Playwright)
import { expect, test } from '@playwright/test'
test('homepage loads', async ({ page }) => {
const res = await page.goto('https://example.com')
expect(res.status()).toBeLessThan(400)
await expect(page.locator('h1')).toContainText('Welcome')
})
The 2026 RUM standard — push Core Web Vitals (LCP, INP, CLS) over OTLP into an OTel Collector and view in Grafana. Cloudflare offers this for free.
10. APM Comparison — Datadog vs New Relic vs Dynatrace
The commercial APM Big Three look much like they did a few years ago.
| Aspect | Datadog | New Relic | Dynatrace |
|---|---|---|---|
| Strength | Breadth, UX | Pricing, AI | Davis AI, auto-detect |
| Pricing | 35-70 USD per host | 0.30 USD per GB (perceived) | DPS units (complex) |
| OTel | First-class | First-class | First-class |
| Auto-instr | Very strong | Very strong | Strongest (OneAgent) |
| K8s | Strong | Strong | Strong |
| AI analysis | Bits AI | New Relic AI | Davis AI (the original) |
Also-rans worth knowing:
- AppDynamics (Cisco/Splunk) — enterprise
- Elastic Observability — Elastic Stack-integrated
- Splunk Observability (formerly SignalFx) — Splunk Cloud integrated
- Honeycomb — the prestige brand of event-based observability
- Logz.io / Sumo Logic — SaaS ELK and unified
The 2026 trend — fewer companies do single-vendor "all Datadog" lock-ins. More are hybrid: self-hosted Grafana LGTM for the bulk, with Datadog or New Relic only for select areas. Cost pressure is the main driver.
11. K8s Observability — Operator, k9s, Lens
Kubernetes observability is a full sub-category in its own right.
- Prometheus Operator — run Prometheus the K8s-native way.
ServiceMonitor,PodMonitorCRDs - kube-prometheus-stack — Helm chart that bundles Prometheus + Grafana + Alertmanager + node-exporter
- kube-state-metrics — exposes K8s resource state as Prometheus metrics
- node-exporter — node-level system metrics
- k9s — terminal K8s client par excellence
- Lens / OpenLens — GUI K8s IDE
- k0s, k0sctl — lightweight K8s distributions
- Cilium + Hubble + OTel + Pixie — the full eBPF stack
# ServiceMonitor example for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
labels:
release: prometheus
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 30s
12. Service Mesh and Observability
A service mesh is, by virtue of having a sidecar intercept all traffic, essentially an automatic trace and metric generator.
- Istio + Kiali — Kiali is the standard for visualising Istio service mesh topology
- Linkerd dashboard — simplicity is the selling point. CNCF Graduated
- Consul Connect + Consul UI — the HashiCorp stack
- Cilium Service Mesh — eBPF-based, sidecar-less mesh
- Envoy Admin — the data plane behind every mesh.
/statsendpoint exposes 1,000+ metrics
Sidecar-less mesh — with Istio Ambient Mode and Cilium Service Mesh going GA in 2024-2025, the "node-level data plane" is now the standard. From an observability standpoint, the key insight is that even without sidecars you get the same data.
13. DevSecOps Meets Observability
Runtime security and observability share data, but ask different questions.
- Falco — eBPF/syscall runtime security (CNCF Graduated)
- Tetragon — Cilium's security sibling
- OPA / Gatekeeper — policy engine
- Kubescape — K8s security scanner (CNCF Incubating)
- Cloud SIEM — Datadog Cloud SIEM, Elastic Security, Sumo Logic Cloud SIEM
- CrowdStrike Falcon / Wiz / Lacework — commercial CSPM/CNAPP
# Falco rule example — alert when a shell is spawned inside a container
- rule: Terminal shell in container
desc: A shell was used as the entrypoint/exec point in a container
condition: spawned_process and container and shell_procs
output: "A shell was spawned in a container (user=%user.name container=%container.id)"
priority: WARNING
14. Storage Backends — VictoriaMetrics, Mimir, ClickHouse, MinIO
The volume of observability data is explosive. Where and how long you store it is the central design question.
- VictoriaMetrics — long-term metric storage. Very high compression
- Grafana Mimir — distributed backend for Prometheus metrics. S3-first
- Grafana Loki — logs. S3-first
- Grafana Tempo — traces. S3-first
- ClickHouse — column store. Adopted aggressively as a log/trace backend (SigNoz, OpenObserve, Cloudflare)
- MinIO — S3-compatible object storage. The de facto backend for self-hosted observability
- AWS S3 / GCS / Azure Blob — cloud options
Retention guidelines:
| Data type | Hot | Warm | Cold |
|---|---|---|---|
| Metrics | 15 days | 90 days | 13 months+ |
| Logs | 7 days | 30 days | 1 year+ (S3 IA) |
| Traces | 3-7 days | 30 days | 90 days+ |
| Profiles | 7 days | 30 days | usually discarded |
15. Cost Model — SaaS vs Self-Host
Observability typically consumes 5-15% of infrastructure cost. At scale, 30% is not unusual.
SaaS pricing (2026 list prices):
- Datadog APM: ~35-70 USD/host/month, custom metrics separately
- New Relic: 99 USD/user/month + ~0.30 USD/GB perceived (the actual model is more complex)
- Dynatrace: DPS (Davis Platform Subscription) units, usually 80-200 USD/host/month
- Honeycomb: event-volume based, free tier then per-GB
- Grafana Cloud Pro: usage-based, free tier then per-series-and-GB
Self-hosted LGTM (rough order):
Roughly: 100 hosts, 1 TB/day of logs, 100 GB/day of traces:
- Compute: ~10 K8s nodes, ~3,000 USD/month
- S3 storage: ~500-800 USD/month (1-year retention)
- Operations: 0.5+ FTE (the largest hidden cost)
Rule of thumb — under 50 hosts, SaaS is almost always cheaper. Over 200 hosts, self-hosted is almost always cheaper. In between it depends on the company. Cost-pressured Korean and Japanese companies are crossing that threshold to self-hosting quickly.
16. Korean Adoption — NCsoft, Coupang, Naver, Kakao
Korea's big-tech observability story tends to follow this arc: Datadog adoption around 2020 → gradual move toward self-hosting from 2024.
- NCsoft — presented Pixie and eBPF adoption at KubeCon (2023). Used for microservice tracing of game backends
- Coupang — major Datadog user. Unified APM and infrastructure monitoring
- Naver — OpenTelemetry adoption. Talked about building in-house metric/trace backends at DEVIEW
- Kakao — self-hosted Grafana stack. Heavy use in messenger backend metric visibility
- Woowa Brothers (Baemin) — Datadog + self-hosted Grafana hybrid
- KakaoBank / Toss — high self-hosting share due to financial regulation. ELK + Grafana common
- Daangn Market — Datadog APM + Sentry combo
- Line Plus Korea — OpenTelemetry + Grafana
The common pattern — Korean companies tend to (1) start with Datadog/New Relic for fast initial visibility, and (2) move the bulk to LGTM once cost crosses a threshold while keeping RUM/Sentry as SaaS.
17. Japanese Adoption — Mercari, LINE Yahoo, CyberAgent
Japan is generally considered one step ahead of Korea in adopting OpenTelemetry and eBPF.
- Mercari — major Datadog user. Presented OTel migration at SRECon 2024
- LINE Yahoo — post-merger acceleration of OpenTelemetry standardisation. In-house metric backend
- CyberAgent — large-scale Cilium deployment. Multiple eBPF talks at KubeCon
- DeNA — Datadog + Grafana hybrid
- Rakuten — major Splunk user. APM with New Relic
- PayPay — AWS-native observability + Datadog
- Smartbank / Kyash — fintech startups adopting OTel
- Mercari ML team — Pyroscope/Parca for profiling machine learning workloads
Japanese SRE culture — the SRE community around Mercari and CyberAgent (SRE Lounge, the SRE NEXT conference) plays a major role in spreading observability best practices. That community is one reason Japan adopted OTel a beat ahead of Korea.
18. SLO/SLI and Error Budget Operations
The final destination of observability is SLOs (Service Level Objectives). Define "p99 response time is under 200 ms and availability is 99.9%", monitor it with metrics, and burn down an error budget when you miss.
Tools:
- Sloth — Prometheus-based SLO generator. Define SLOs in YAML
- Pyrra — SLO controller in the same category as Sloth
- OpenSLO — YAML standard for SLO definitions
- Nobl9 — dedicated SLO SaaS
- Datadog SLO / Honeycomb SLO — APM-integrated options
# Sloth SLO definition example
version: prometheus/v1
service: my-api
slos:
- name: requests-availability
objective: 99.9
sli:
events:
error_query: sum(rate(http_requests_total{code=~"5.."}[5m]))
total_query: sum(rate(http_requests_total[5m]))
alerting:
page_alert:
labels:
severity: page
ticket_alert:
labels:
severity: ticket
19. Alerting and Incident Management
The final stage of observability is alerting and incident response.
- Alertmanager — Prometheus-ecosystem standard
- Grafana Alerting — Grafana-integrated option (multi-datasource)
- PagerDuty — SaaS standard for incident management
- Opsgenie (Atlassian) — PagerDuty's main competitor
- Incident.io — chat-first next-gen incident management. Strong Slack integration
- FireHydrant, Rootly — newer incident-management SaaS
- Better Stack — uptime monitoring + incident management bundle
The 2026 direction — Slack and Teams have become the standard incident room. Tools like Incident.io spin up a channel automatically, update the status page, and kick off the post-mortem. PagerDuty is evolving in the same direction.
20. The Shape of AI-Native Observability
The biggest topic of 2026 is how AI changes observability.
- Automatic Root Cause Analysis (RCA) — Dynatrace Davis, New Relic AI, Datadog Bits AI
- Natural language query — Honeycomb Query Assistant, Grafana LLM, Datadog AI Assistant
- Anomaly detection — every major APM ships ML-based anomaly detection now
- LLM observability itself — LangSmith, Helicone, Phoenix, Arize, OpenLLMetry. Trace LLM-response latency, cost, and quality
- Agent tracing — OTel's GenAI semantic conventions launched in 2025. A standard for tracing agent tool calls
Observability for AI workloads — LLMs are non-deterministic and expensive. "Did this user request go to GPT-4o or Claude 3.5? How many tokens did it use? Was the query cached?" — these are the new golden signals.
21. Adoption Roadmap — Where to Start
If you were rebuilding the stack from scratch in 2026, this is roughly the recommended order:
- Metrics first — Prometheus + Grafana. node-exporter, kube-state-metrics
- Standardise logs — Loki or Elastic. Unify shipping with Vector or Fluent Bit
- Adopt traces — Tempo + OTel Collector. Start with auto-instrumentation
- Fill 80% visibility with eBPF — Cilium Hubble or Pixie, pick one
- Add continuous profiling — Pyroscope, integrated into Grafana
- Define SLOs — three to five core services in YAML via Sloth
- AI/LLM tracing — if you use LLMs, add LangSmith or Phoenix to the OTel pipeline
- Automate incident management — Incident.io or PagerDuty, Slack-integrated
Startup (1-20 people):
- Lean on SaaS — Datadog or New Relic + Sentry + Better Stack
- Don't buy a learning curve, ship code
Mid-size (20-200 people):
- Grafana Cloud + Sentry + Incident.io
- Watch cost and migrate to self-hosted LGTM progressively
Large (200+ people):
- Self-hosted LGTM + SaaS only in select areas (RUM, APM)
- Have a dedicated observability platform team
22. References
- OpenTelemetry Project
- OpenTelemetry CNCF Graduation Announcement
- Cilium Project
- Hubble — Cilium Flow Observability
- Tetragon — eBPF Security Observability
- Pixie — Auto-instrument K8s with eBPF
- Inspektor Gadget
- Grafana Pyroscope
- Grafana Tempo
- Grafana Loki
- Grafana Mimir
- Jaeger 2 Announcement
- Prometheus 3.0 Release Notes
- VictoriaMetrics
- Quickwit — Cloud-native Search
- SigNoz — Open Source Observability
- Coroot — eBPF Observability
- Beyla — eBPF Auto-instrumentation
- Brendan Gregg — BPF Performance Tools
- Google SRE Book — Service Level Objectives
- Honeycomb — Observability Engineering Book
- Datadog Pricing
- Sloth — Prometheus SLO Generator
- Falco — Runtime Security