Skip to content

필사 모드: Observability 2026 Complete Guide — OpenTelemetry, Datadog, Grafana Stack (LGTM+Beyla), Honeycomb, Prometheus, Jaeger, eBPF & SLO Deep Dive

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Intro — In May 2026, observability defaults to OpenTelemetry

Back in 2024, "which APM should we buy" was still a vendor choice. By May 2026, the upstream question has converged. **Instrumentation is OpenTelemetry**, and only the **storage, visualization, alerting, and root-cause analysis** behind it varies by vendor. OTel is the second most active CNCF project after Kubernetes, and OTLP (gRPC/HTTP) is now the de facto wire format. Datadog, New Relic, Splunk, Dynatrace, Honeycomb, Chronosphere, Grafana Cloud, Coralogix, and Logz.io all accept OTLP as a first-class input.

The other major shift is **AI for observability**. In 2025, Datadog Bits AI, New Relic AI (NRAI), and Dynatrace Davis AI all shipped natural-language root-cause summaries, automatic anomaly detection, and incident timeline generation as baseline features. The interface for "why did p99 spike" is moving from query languages to natural language.

This article is not a marketing matrix. It is an honest walk-through of what goes where in 2026 production, how to compose a stack on top of OTel, how to choose SaaS vs self-hosted, and how to operate SLOs and error budgets.

The three pillars — and what's beyond

The classical three pillars are **metrics, logs, traces**. In 2026, two more are first-class.

1. **Continuous profiling**: 24/7 CPU/memory profiles (Pyroscope, Parca, Polar Signals, Datadog Continuous Profiler).

2. **RUM (Real User Monitoring) & Synthetics**: LCP, INP, FCP, long tasks, network waterfalls captured from browsers/mobile.

The industry now also uses the term **MELT** (Metrics, Events, Logs, Traces), or MELT+P with profiling. More important than the pillars is **interconnection**: one click should hop from metric to trace, trace to log, log to profile. This "exemplar → trace → log → profile" chain is the 2026 standard observability UX.

OpenTelemetry — the standardized instrumentation layer

OTel has three parts.

- **API/SDK**: 11 GA languages (Java, Go, Python, .NET, JS, Ruby, Rust, C++, PHP, Erlang, Swift) plus betas.

- **Collector**: a pluggable router for collection, transformation, and forwarding. Runs as Agent or Gateway.

- **Semantic Conventions**: agreed names for attributes (`http.request.method`, `db.system`, `k8s.pod.name`, etc.). GA'd 1.0 in 2024 and rolling through 1.30+ in 2026.

A Python auto-instrumentation example is this short:

requirements.txt

opentelemetry-distro==0.50b0

opentelemetry-exporter-otlp==1.30.0

One-liner auto-instrumentation

$ opentelemetry-bootstrap -a install

$ OTEL_SERVICE_NAME=checkout-api \

OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \

opentelemetry-instrument python app.py

from fastapi import FastAPI

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

app = FastAPI()

@app.get("/checkout/{order_id}")

def checkout(order_id: str):

with tracer.start_as_current_span("validate_order") as span:

span.set_attribute("order.id", order_id)

business logic

return {"ok": True, "order_id": order_id}

Add `opentelemetry-instrument` and over 50 libraries — FastAPI, requests, SQLAlchemy, psycopg, Redis, Kafka clients — gain traces and metrics automatically. **Manual instrumentation should be reserved for business-domain spans only.**

OTel Collector — the heart of the pipeline

The Collector declares a receivers → processors → exporters pipeline in YAML. The 2026 standard deployment is a two-tier "agent + gateway" topology.

otel-collector-gateway.yaml

receivers:

otlp:

protocols:

grpc:

endpoint: 0.0.0.0:4317

http:

endpoint: 0.0.0.0:4318

processors:

batch:

send_batch_size: 8192

timeout: 5s

memory_limiter:

check_interval: 1s

limit_percentage: 80

spike_limit_percentage: 25

resourcedetection:

detectors: [env, system, eks, gcp]

tail_sampling:

decision_wait: 30s

policies:

- name: errors-policy

type: status_code

status_code: {status_codes: [ERROR]}

- name: slow-policy

type: latency

latency: {threshold_ms: 500}

- name: baseline-policy

type: probabilistic

probabilistic: {sampling_percentage: 5}

exporters:

otlp/datadog:

endpoint: api.datadoghq.com:443

headers: {dd-api-key: "$DD_API_KEY"}

otlphttp/grafana:

endpoint: https://otlp-gateway.grafana.net/otlp

prometheus:

endpoint: 0.0.0.0:9464

service:

pipelines:

traces:

receivers: [otlp]

processors: [memory_limiter, resourcedetection, tail_sampling, batch]

exporters: [otlp/datadog, otlphttp/grafana]

metrics:

receivers: [otlp]

processors: [memory_limiter, resourcedetection, batch]

exporters: [prometheus, otlphttp/grafana]

The critical knob is **tail sampling**. Keep 100% of error and slow traces and probabilistically sample around 5% of the rest to control cost. Datadog's trace API is mostly head-sampled, while OTel Collector lets you make the tail decision at the gateway, which is far more flexible.

Metrics — Prometheus and beyond

Prometheus remains the metrics standard in 2026. PromQL is the industry lingua franca, and OTel Collector → Prometheus Remote Write is GA. A single Prometheus has clear cardinality and retention limits, though, so it is usually paired with a **long-term storage backend**.

- **Grafana Mimir** (née Cortex): multi-tenant, S3-backed, PromQL/MQL-compatible, Grafana Labs OSS.

- **Thanos**: sidecar + global query + S3, community standard.

- **Cortex**: the original before the Mimir split; still maintained.

- **VictoriaMetrics**: high compression and single-binary deployment. Strong OSS momentum in 2026.

- **Chronosphere**: cardinality control is the core value proposition. Team from the M3DB project.

A typical PromQL recording rule (5xx ratio SLI):

recording-rules.yaml

groups:

- name: checkout-sli

interval: 30s

rules:

- record: job:http_request_errors:ratio_5m

expr: |

sum(rate(http_server_request_duration_seconds_count{job="checkout-api",http_response_status_code=~"5.."}[5m]))

/

sum(rate(http_server_request_duration_seconds_count{job="checkout-api"}[5m]))

- record: job:http_request_latency:p99_5m

expr: |

histogram_quantile(0.99,

sum by (le) (rate(http_server_request_duration_seconds_bucket{job="checkout-api"}[5m]))

)

- alert: HighErrorRate

expr: job:http_request_errors:ratio_5m > 0.01

for: 10m

labels: {severity: page, service: checkout}

annotations:

summary: "checkout 5xx > 1% for 10m"

`histogram_quantile` and OTel histogram compatibility unified on native (exponential) histograms in mid-2025. Memory use is 5-10x lower than the older le-bucket approach with better accuracy.

Cardinality crisis and cost control

An OTel Collector example to drop unused metrics and attributes:

cardinality drop rules

processors:

transform/metrics:

metric_statements:

- context: datapoint

statements:

- delete_key(attributes, "request_id")

- delete_key(attributes, "build_sha")

- context: metric

statements:

- drop() where name == "unused_metric_legacy"

filter/spans:

spans:

exclude:

match_type: regexp

attributes:

- key: http.url

value: "/healthz|/metrics|/readyz"

Since 2024, the industry headache has been the **cardinality explosion**. Slap user_id, request_id, build_sha onto labels and time series blow up into the millions. Datadog, Splunk Observability, and New Relic all charge per custom metric, so bills doubling quarter over quarter is a common incident.

The 2026 playbook has three prongs.

1. **High-cardinality belongs in traces/logs**: keep it out of metric labels and put it on events (Honeycomb's philosophy).

2. **Adaptive sampling + drop rules**: use the OTel Collector's `transformprocessor` or Datadog's metric pipelines to drop unused labels automatically.

3. **Cardinality control SaaS**: Chronosphere Control Plane, Grafana Adaptive Metrics, and Cribl Stream identify and remove unused series.

Charity Majors's "wide, structured events" pitch from 2017 became the actual standard in 2026. Metrics are reserved for SLI/SLA-grade counters and gauges; high-dimensional data flows into events/traces.

Datadog — the undisputed unified SaaS

Datadog hit 3 billion-plus dollars in annual revenue in 2026 with 800-plus integrations, and is effectively the category definer. APM, Logs, RUM, Synthetic, Continuous Profiler, Database Monitoring, Network Performance Monitoring, and CWPP/CNAPP all live in one console. Bits AI, GA in 2025, handles natural-language trace analysis, incident summarization, and IaC code generation.

The strengths are clear.

- Integration paradise. One click for Kubernetes, Lambda, Postgres, Kafka, Stripe.

- Strong UI/UX and fast queries. Even p99 is index-friendly.

- Watchdog's AI/ML anomaly detection actually works.

- LLM Observability (GA 2025): prompts, tokens, and per-model cost aggregation.

The weaknesses are equally clear.

- **Expensive.** USD 23-40 per host per month plus indexed logs plus custom metrics plus RUM sessions. Skip the quarterly bill simulation at your peril.

- Lock-in. Moving off DD agent and DogStatsD onto OTel is a one-to-two-quarter project.

Grafana Stack — the OSS LGTM+Beyla

Grafana Labs has a full stack with Loki (logs), Mimir (metrics), Tempo (traces), Pyroscope (profiles), and Beyla (eBPF auto-instrumentation). Apache 2.0 + AGPL licensing (edition-dependent) means both self-hosting and Grafana Cloud SaaS are options.

Loki's LogQL feels like PromQL.

Loki LogQL — checkout 5xx trend over the last 5 minutes

{service_name="checkout-api"} |= "ERROR"

| json

| http_response_status_code >= 500

| rate(5m)

Tempo TraceQL — checkout traces with p99 over 1 second

{ resource.service.name = "checkout-api" && duration > 1s && status = error }

Mimir PromQL — node CPU utilization

100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Tempo TraceQL, GA in 2024, is widely seen as the most expressive trace query language by 2026. Span attributes, resource attributes, and child-span conditions all fit in a single query.

**Beyla** is a zero-instrumentation eBPF auto-instrumenter that shipped in 2024. It extracts HTTP/gRPC spans and RED metrics without touching a single line of code. Go, Java, Python, Node.js, and Rust are all supported. The Beyla 1.10 line in 2026 even decodes mTLS traffic.

Honeycomb — wide events and BubbleUp

Honeycomb was founded in 2016 by Charity Majors and Christine Yen, drawing on Facebook Scuba. The unit of work is a **"wide event" with dozens to hundreds of dimensions**, and every query can group or filter on any dimension on the fly. There is no cardinality limit.

- **BubbleUp**: automatically diffs an anomalous slow-trace cluster against a healthy baseline across every dimension. The most-copied UX pattern of 2026.

- **SLOs**: error budgets are first-class citizens. Burn-rate alerts are smarter than naive thresholds.

- **OpenTelemetry-native**: built around OTLP from day one. Refinery (the sampling gateway) was open-sourced in 2024.

Pricing is event-count based and tends to be cheaper than Datadog, though processing metrics and logs separately adds cost. In 2026 it is the preferred Datadog/New Relic alternative for teams who are serious about debugging and SLOs.

Jaeger / Tempo / Zipkin — the OSS tracing trio

The OSS trace backends boil down to three.

- **Jaeger**: from Uber, CNCF Graduated. v2, launched in 2024, is a distro on top of OTel Collector. ClickHouse, Cassandra, and Elasticsearch storage all supported.

- **Tempo**: Grafana Labs, object-storage friendly. Minimal indexing with trace-ID-only lookup keeps costs low. TraceQL is the killer feature.

- **Zipkin**: from Twitter, around since 2012. Simplicity is the strength. More legacy compatibility than new adoption.

Jaeger v2 + Tempo + OTel Collector is the standard self-hosted stack in 2026. A growing cohort (Uber, Cloudflare) uses ClickHouse directly as the trace backend.

eBPF Observability — instrumentation without code changes

eBPF carved out a new observability category. By intercepting syscall, network, and CPU events directly in the kernel, it extracts RED metrics, distributed traces, and profiles with zero code changes.

- **Pixie** (acquired by New Relic): drop a DaemonSet onto a K8s cluster and HTTP/gRPC/DNS/MySQL traffic is automatically captured. PxL is its scripting language.

- **Cilium Tetragon**: a security + observability blend. Combines syscalls, policy, and traces in a single data model.

- **Grafana Beyla**: the auto-instrumenter from above. Integrates with Cilium Hubble.

- **Parca**: cluster-wide continuous profiling. Always-on CPU profiles in pprof format.

- **Coroot**: SaaS + OSS dual model. eBPF-based service map and auto-generated SLOs.

- **Polar Signals**: SaaS built on Parca. The New Relic of continuous profiling.

The weakness of eBPF is **limited parameter extraction in user-space runtimes like Go**, which means business-logic spans still need an OTel SDK. The 2026 division of labor is "eBPF for RED and golden signals, OTel SDK for domain spans".

Operating Prometheus — Mimir vs Thanos vs Cortex vs VictoriaMetrics

The long-term storage debate is still alive in 2026.

| Item | Mimir | Thanos | Cortex | VictoriaMetrics |

| --- | --- | --- | --- | --- |

| License | AGPL v3 | Apache 2.0 | Apache 2.0 | Apache 2.0 |

| Multi-tenancy | Strong | Decent | Strong | Strong |

| Compression | Good | Good | Good | Excellent |

| Single binary | No | No | No | Yes |

| Operational cost | High | Medium | High | Low |

| Commercial support | Grafana | Community | Community | VictoriaMetrics |

| Cardinality ceiling | 100M+ | 50M | 100M+ | 500M+ |

Large SRE teams gravitate to Mimir, lean ops teams to VictoriaMetrics, and existing Thanos users rarely have a reason to migrate.

Continuous profiling — Pyroscope, Parca, Polar Signals

Continuous profiling crystallized as a category through 2024-2025. CPU/memory/lock profiles are captured 24/7 to catch regressions and optimize cost.

- **Pyroscope** (acquired by Grafana Labs): supports Go, Python, Java, Ruby, Node.js. The flamegraph diff UX is the killer feature.

- **Parca**: Polar Signals OSS. eBPF-based, so any language profiles with zero config.

- **Polar Signals**: SaaS Parca. Cluster-wide cost optimization analytics.

- **Datadog Continuous Profiler**: production-grade for over a year. Supports ASP.NET, JVM, Go, Python.

- **AWS CodeGuru Profiler**: Java/Python only. AWS native.

The most common issues teams discover are (1) JSON serialization hotspots, (2) regex compilation in tight loops, (3) untuned Go GC GOGC, and (4) Python GIL contention.

RUM and Synthetic — observability from the user's view

Server metrics alone do not tell you "why the user says it's slow". **RUM (Real User Monitoring)** and **synthetic monitoring** close the gap.

- **Datadog RUM**: automatic session replay, INP/LCP/CLS capture, OTel RUM compatible.

- **New Relic Browser**: integrated with APM. Distributed traces reach the browser.

- **Sentry** (error tracking that became observability): JS/mobile RUM, replay, performance.

- **Grafana Faro** (GA 2024): OSS RUM SDK with Loki/Tempo for self-hosting.

- **SpeedCurve, Calibre**: Core Web Vitals specialist SaaS.

- **Synthetic**: Catchpoint, Checkly, Datadog Synthetic, k6 Cloud (Grafana). Checkly leads new adoption thanks to Playwright.

When Google promoted INP to a Core Web Vital alongside LCP and CLS in 2025, long-task and input-delay measurement in RUM became mandatory.

SLOs and error budgets — the Google SRE legacy

SLOs (Service Level Objectives) became the industry lingua franca after the Google SRE book in 2018. By 2026 there is a dedicated SLO tooling category.

- **Nobl9**: SLOs across multiple data sources (Datadog, Prometheus, NR). Open-sourced a BSD-based core in 2025.

- **Datadog SLO**: native integration with automatic burn-rate alerts.

- **Grafana Cloud SLO**: runs on top of Mimir/Loki/Tempo.

- **OpenSLO**: a spec. Declares SLOs in YAML so they port across tools.

- **Sloth**: a PromQL SLO rule generator. Run SLOs directly on Prometheus.

A canonical SLO definition:

slo.yaml — Sloth format

version: prometheus/v1

service: checkout-api

slos:

- name: availability

objective: 99.9

description: "checkout 5xx ratio under 0.1%"

sli:

events:

error_query: sum(rate(http_server_request_duration_seconds_count{job="checkout-api",http_response_status_code=~"5.."}[{{.window}}]))

total_query: sum(rate(http_server_request_duration_seconds_count{job="checkout-api"}[{{.window}}]))

alerting:

page_alert:

labels: {severity: page}

ticket_alert:

labels: {severity: ticket}

A 99.9% SLO grants a 43.2-minute monthly downtime budget. Tracking that budget via **burn rate** is far smarter than a static threshold alert. Burn rate of 14.4x over a 5-minute window (consuming a year's budget in an hour) pages immediately; 6x over a 6-hour window opens a ticket.

RED, USE, LETS, Golden Signals — methodology cheat sheet

There are four common methodologies for picking observability metrics.

- **RED** (Tom Wilkie): **R**ate, **E**rrors, **D**uration. Standard for request/response services.

- **USE** (Brendan Gregg): **U**tilization, **S**aturation, **E**rrors. Standard for system resources.

- **LETS** (Tammy Bryant): **L**atency, **E**rrors, **T**raffic, **S**aturation. A Honeycomb variant.

- **Google Four Golden Signals**: Latency, Traffic, Errors, Saturation. The SRE book standard.

In practice teams combine "user-facing surface (API, page) = RED + saturation" with "infrastructure resources (node, DB, queue) = USE".

Distributed trace context propagation — W3C TraceContext + Baggage

Distributed tracing requires that trace ID, span ID, and sampling decision propagate across every service hop. The 2026 standard is **W3C TraceContext + W3C Baggage**.

- `traceparent` header: 00 dash trace-id dash parent-id dash flags

- `tracestate` header: vendor extensions

- `baggage` header: user-defined context (user_id, tenant_id, etc.)

The OTel SDK handles propagation automatically through default propagators, but non-HTTP paths like gRPC metadata, Kafka message headers, and Lambda contexts need explicit instrumentation.

A typical query when using ClickHouse as the trace backend:

-- p99 traces in the last hour

SELECT

TraceId,

SpanName,

Duration / 1e6 AS duration_ms,

SpanAttributes['http.url'] AS url,

ResourceAttributes['service.name'] AS service

FROM otel_traces

WHERE Timestamp > now() - INTERVAL 1 HOUR

AND ResourceAttributes['service.name'] = 'checkout-api'

AND StatusCode = 'STATUS_CODE_ERROR'

ORDER BY duration_ms DESC

LIMIT 50;

Log pipelines — Vector, Fluent Bit, Logstash, Elastic, Splunk

Logs are pricier and trickier than metrics or traces. The 2026 standard refines, drops, and extracts fields at the collector tier, with cold storage separated out.

- **Vector** (OSS by Datadog): Rust-based, the throughput and memory leader. VRL is its transform language.

- **Fluent Bit**: CNCF Graduated. The de facto lightweight K8s sidecar. Lua and Wasm filters.

- **Fluentd**: parent of Fluent Bit. Still common as a node agent.

- **Logstash**: ETL from the Elastic camp. Heavy but rich in grok patterns.

- **OTel Collector**: log receivers and exporters are GA. Same pipeline as metrics and traces.

Storage backends split three ways.

- **Elasticsearch / OpenSearch**: powerful full-text search and aggregations. Elastic Observability is polished.

- **Loki**: only labels are indexed; the body is compressed in chunks. Best cost-efficiency but weak full-text search.

- **ClickHouse**: a columnar store. Cloudflare, Uber, and Discord run their own. Excellent cost and speed with high operational lift.

- **Splunk**: the long-standing champion. SPL is powerful, and the bill is famously the highest.

Incident management — Alertmanager, PagerDuty, Opsgenie, Incident.io, FireHydrant, Squadcast

The real work begins after the alert fires. The 2026 incident management lineup:

- **PagerDuty**: the category definer since 2009. Rebranded as PagerDuty Operations Cloud in 2024.

- **Opsgenie** (Atlassian): strong Jira/Confluence/Statuspage integration. New adoption shrank in 2026 as customers moved to PagerDuty or Incident.io.

- **Incident.io**: 2021 UK startup, Slack-native incident management. Added on-call in 2024 and positioned itself as a PagerDuty replacement.

- **FireHydrant**: US-based incident management and runbooks. Slack workflow automation is a strength.

- **Squadcast**: India/SF, an affordable on-call + incident option.

- **Rootly**: Slack-native with status-page integration.

- **Grafana OnCall** (GA 2023, Apache 2.0): the OSS alternative. For teams determined to leave PagerDuty.

- **Alertmanager**: the Prometheus standard. Routing, grouping, silencing.

A canonical Alertmanager routing config:

alertmanager.yml — severity-based routing

route:

receiver: default

group_by: [alertname, service]

group_wait: 30s

group_interval: 5m

repeat_interval: 4h

routes:

- matchers: [severity="page"]

receiver: pagerduty-critical

continue: true

- matchers: [severity="ticket"]

receiver: jira-ops

- matchers: [service="checkout"]

receiver: slack-checkout

receivers:

- name: pagerduty-critical

pagerduty_configs:

- service_key: '$PAGERDUTY_KEY'

- name: slack-checkout

slack_configs:

- api_url: '$SLACK_WEBHOOK'

channel: '#checkout-alerts'

Slack-based ChatOps is the default for every tool in 2026: automatic incident channel creation, automatic role (IC, comms, scribe) assignment, and even postmortem template generation.

AI for observability — Bits AI, Davis AI, NRAI

Every major SaaS shipped an LLM-based assistant in 2025.

- **Datadog Bits AI**: natural-language trace/log analysis, incident summarization, IaC code generation, runbook automation.

- **New Relic AI (NRAI)**: query generation, anomaly-detection explanations, error clustering.

- **Dynatrace Davis AI**: a causation engine since 2017. The LLM-augmented Davis CoPilot landed in 2025.

- **Honeycomb Query Assistant**: natural language to BubbleUp queries.

- **Grafana ML / LLM**: anomaly detection and natural-language dashboard generation.

- **Splunk AI Assistant for SPL**: natural language to SPL.

The honest verdict in 2026 is "prediction was overhyped; search and summarization actually save time". Root-cause analysis is still done by humans.

Korea adoption — PinPoint, Naver D2, Toss, Kakao

Korea's observability scene is unexpectedly strong on the OSS contribution side.

- **Naver PinPoint** (open-sourced 2015): a Java APM under Apache 2.0. Java/PHP/Python/Go agents. Actively maintained in 2026.

- **Naver D2 metrics culture**: operations know-how from the Search and LINE era. Public ELK-based log analysis guides.

- **Toss SRE**: home-grown OTel + Grafana Stack + Pyroscope. The Toss tech blog has several operational case studies.

- **Kakao**: in-house monitoring KAMS evolving into a Datadog/Grafana hybrid. Many trace and metric posts on tech.kakao.com.

- **NHN**: the in-house Watch service inside NHN Cloud. PinPoint-compatible.

- **Coupang**: a large Datadog customer; one of Korea's largest deployments.

- **Karrot (Daangn)**: Datadog + Sentry. Mobile RUM-centric.

PinPoint remains the dominant agent-based OSS APM on the JVM with strong MSA transaction visualization.

Japan adoption — LINE Promgen, Mercari Datadog, GREE Elastic

Japanese internet companies have deep observability stacks too.

- **LINE Promgen / Promscale**: LINE's homegrown Prometheus operations automation, open-sourced in 2018. Manages tens of thousands of targets.

- **Mercari**: a major Datadog customer since 2017. The SRE blog has many posts on KPI operations and Bits AI adoption.

- **GREE**: runs the Elastic Stack (ELK + APM) for gaming infrastructure.

- **DeNA**: AWS + Datadog + in-house monitoring. Shares observability know-how from real-time services like Pococha.

- **CyberAgent / AbemaTV**: Datadog and New Relic in parallel. AbemaTV has published live streaming observability cases.

- **Rakuten**: large multi-purpose microservices with a mix of in-house + New Relic + Datadog.

- **SmartHR / freee**: newer SaaS players; Datadog + Sentry as standard.

The LINE Engineering and Mercari Engineering blogs are the primary Japanese-language observability source.

SaaS vs self-hosted decision guide

There is no universal answer, but there are patterns.

- **Annual observability spend under USD 1M**: SaaS is overwhelmingly cheaper. Including staffing and opportunity cost, self-hosting usually loses.

- **Annual spend USD 1M-5M**: the break-even zone. Typically SaaS plus some self-hosted (log cold tier).

- **Annual spend over USD 5M**: self-hosting is justified. A growing number of teams run a ClickHouse + Tempo + Mimir + Pyroscope + Beyla full stack.

**Data sovereignty** demands (EU GDPR, Korea's cloud security certifications, Japanese government cases) can also force self-hosting.

Cost-control checklist

To avoid bill explosions, check these seven.

1. **No high-cardinality labels**: don't shove user_id or request_id into metric labels.

2. **Adaptive sampling**: 100% on errors, 1-5% on success.

3. **Log tiering**: hot 7-14 days, warm 30-90 days, cold 1 year-plus. Use S3 Glacier.

4. **Drop rules**: drop unused series and logs automatically via OTel Collector or Cribl.

5. **Explicit TTL**: a retention policy on every data type.

6. **Quarterly bill simulation**: compute cost at 2x traffic.

7. **Cardinality alarms**: auto-alarm when a new label appears.

Five observability trends for 2026

1. **OTel by default**: new projects start with the OTel SDK from day one.

2. **eBPF zero-instrumentation**: Beyla and Pixie cover Java/Go/Python RED metrics without code changes.

3. **AI assistants in the daily flow**: natural-language to query, incident summarization, runbook generation.

4. **Cardinality SaaS rises**: Chronosphere, Adaptive Metrics, and Cribl sharpen their value proposition.

5. **OpenTelemetry Logs GA**: the logs signal goes 1.0 alongside metrics and traces. One SDK for all three signals.

References

- [OpenTelemetry official](https://opentelemetry.io/) — standard spec, SDK, Collector

- [Datadog](https://www.datadoghq.com/) — unified SaaS APM

- [Grafana](https://grafana.com/) — Loki/Mimir/Tempo/Pyroscope/Beyla

- [Honeycomb](https://www.honeycomb.io/) — wide-events observability

- [New Relic](https://newrelic.com/) — APM + Pixie + NRAI

- [Dynatrace](https://www.dynatrace.com/) — Davis AI, OneAgent

- [Splunk](https://www.splunk.com/) — Splunk Observability Cloud

- [Prometheus](https://prometheus.io/) — the metrics standard

- [Jaeger](https://www.jaegertracing.io/) — OSS distributed tracing

- [Google SRE Book](https://sre.google/) — SLO/SLI canon

- [Google SRE — Embracing Risk](https://sre.google/sre-book/embracing-risk/) — error budget primer

- [OpenTelemetry Community](https://github.com/open-telemetry/community) — governance

- [Chronosphere](https://chronosphere.io/) — cardinality control

- [Coralogix](https://coralogix.com/) — log analytics SaaS

- [Sentry](https://sentry.io/) — error tracking and performance

현재 단락 (1/337)

Back in 2024, "which APM should we buy" was still a vendor choice. By May 2026, the upstream questio...

작성 글자: 0원문 글자: 23,077작성 단락: 0/337