OSS Monitoring Stack 2026 Deep Dive — Replacing Datadog with SigNoz, Coroot, OpenObserve, Sentry, Grafana, Uptrace

Prologue — The Datadog bill and what comes after

A line from a Series B startup's 2025 quarterly review: "Last month's Datadog bill was $140k." The CTO turned the laptop around and the room went silent. That was 7% of revenue. Indexed logs were half of it, custom metrics 30%, APM hosts the rest. Cutting that by 90% over the next quarter became an OKR.

This scene plays out quarterly in mid-to-late stage SaaS shops. On one side: the Datadog/New Relic/Splunk SaaS matrix. On the other: an OSS observability scene that OpenTelemetry standardization has finally made viable. The 2026 answer is no longer "pick one of the two." It's a modular OSS stack that absorbs the core signals while keeping SaaS only where the marginal value is high.

This is an honest map of the 2026 OSS monitoring landscape.

SigNoz — OpenTelemetry-native, ClickHouse-backed, unified traces/metrics/logs/exceptions. The first OSS that deserves the label "real Datadog alternative."
Coroot — eBPF-based, zero-instrumentation APM. Get service topology and SLO dashboards without changing a single line of code.
OpenObserve — Next-gen logs/traces/metrics in Rust. S3 + Parquet backend pushes storage cost to 5%–30% of legacy stacks at petabyte scale.
Sentry self-hosted — The canon of error tracking. If Sentry SaaS pricing is a problem, this is still the first answer.
Uptrace — Lightweight ClickHouse-based APM. The easiest single-node OpenTelemetry on-ramp.
Grafana stack — Loki/Tempo/Mimir/Pyroscope/k6/Beyla/Alloy. Still the largest single OSS observability ecosystem.

Where each shines, why the storage backend decides everything, what OpenTelemetry permanently changed, and the true cost of self-hosting — with no illusions.

1. OpenTelemetry — The standard that changed everything

The single reason OSS observability got dangerous in 2026 is OpenTelemetry (OTel). Before it, every SaaS forced its own agent and your code lived inside that agent. Once you were tied to Datadog APM, escaping meant refactoring.

What OTel changed:

Instrumentation is vendor-neutral. Emit metrics, traces, and logs via OTLP/gRPC in one format and the receiver — SigNoz, Datadog, Grafana — does not matter.
Semantic conventions are standardized. Names like http.request.method, db.system, messaging.destination.name mean the same thing in every tool.
Auto-instrumentation libraries matured. Python, Node, Java, Go, .NET, Ruby — all support zero-code instrumentation.
OpenTelemetry Collector became the de facto router/filter/aggregator. Changing the receiver leaves the application code alone.

In 2024 OTel became a CNCF Graduated project. In 2025 the logs signal went GA. As of 2026 traces, metrics, logs, and exceptions are all stable. Profiling entered beta as OTel Profiles.

One-line takeaway: install OTel once and the backend becomes an interchangeable option. This is why every OSS tool is surviving simultaneously.

2. The 2026 OSS observability landscape at a glance

A comparison matrix first. Wide table — view sideways.

Tool	Primary signals	Storage backend	Strengths	Weaknesses	License	Sweet spot
SigNoz	Traces, metrics, logs, exceptions	ClickHouse	OTel-native, unified UI	ClickHouse ops burden	MIT + some Enterprise	Mid/large
Coroot	Metrics, traces (eBPF)	Prometheus, ClickHouse	Zero instrumentation, auto topology	No Windows	Apache 2.0 + Enterprise	Mid-size K8s
OpenObserve	Logs, traces, metrics	S3 + Parquet	5%–30% storage cost	Newer UI, partial compatibility	AGPL-3.0	Large log volume
Sentry self-hosted	Errors, sessions, traces	PostgreSQL, ClickHouse	Best error UX	Self-host policy risk	FSL/BSL	All sizes
Uptrace	Traces, metrics, logs	ClickHouse	Easy single-node bootstrap	Small community	AGPL-3.0	Small/mid
Grafana stack	5 separate tools	Mixed	Largest eco, LGTM	Five-tool integration burden	AGPL-3.0	All sizes

The real axis of this table is the storage backend. ClickHouse for compression and query speed, S3 + Parquet for cheap long retention, Prometheus TSDB optimized for metrics. The trade-off triangle returns below.

3. SigNoz — A genuine Datadog alternative

SigNoz started in 2021 as an OTel-native APM/logs/metrics tool. By 2026 it has 40k GitHub stars, SOC 2 Type II, and a follow-on Series A. Self-hosted is free (Community Edition); SaaS is priced per GB/CPU/host.

Tech stack:

Instrumentation — OTel SDK directly. No agent.
Ingestion — OpenTelemetry Collector full stack, or direct OTLP.
Storage — ClickHouse (required). Traces, metrics, and logs in one DB.
Query — SigNoz UI + PromQL + ClickHouse SQL.
Alerts — Grafana-style rules or native UI.

Why ClickHouse? OLAP columnar DBs are a natural fit for observability. A single trace row has 30–80 attributes, and almost every query is "filter plus aggregate." Columnar engines read only required columns, LZ4 compression yields 8–12 times on average, and ClickHouse itself ingests hundreds of millions of rows per minute.

Key differentiators:

Hop across traces, metrics, and logs in one UI — the same trace_id is embedded in all three signals. One click moves you.
Exception signal — like Sentry, errors are grouped and aggregated independently.
PromQL plus ClickHouse SQL — familiar to Grafana users, with SQL for deep analytics.
Built-in tail-based sampling — preserving only errors and slow traces is a first-class feature.

Single-node docker-compose:

git clone -b main https://github.com/SigNoz/signoz.git
cd signoz/deploy
./install.sh
# Browser: http://localhost:3301

Kubernetes via Helm:

helm repo add signoz https://charts.signoz.io
helm install signoz signoz/signoz \
  --namespace platform \
  --create-namespace \
  --set otelCollector.replicaCount=3 \
  --set clickhouse.persistence.size=500Gi

Sizing guide:

Below 10k traces/second — single ClickHouse node works. 16 vCPU / 64 GB / 1 TB NVMe.
10k to 100k traces/second — a 3-node ClickHouse cluster with ZooKeeper or ClickHouse Keeper.
Above 100k — you need a dedicated SRE team. At that point re-run the Datadog vs self-host ROI with the new labor cost.

Operational traps:

ClickHouse disk pressure is a silent killer. Set TTL policies precisely (default: 7 days traces, 30 days metrics, 14 days logs).
A single OTel Collector becomes a single point of failure. Use the DaemonSet + Gateway pattern.
Cardinality explosion. Do not pin user_id or request_id as labels; keep them as attributes.

4. Coroot — Zero-instrumentation APM via eBPF

Coroot's goal is to start APM without changing a line of code. eBPF hooks TCP connections at the kernel level and classifies HTTP/gRPC/DB traffic automatically. It is attractive in environments where you have no time or no permission to bolt SDKs into every service — legacy Java, in-house PHP, Node.

Core architecture:

coroot-node-agent — a DaemonSet eBPF probe on every node.
coroot-cluster-agent — collects Kubernetes metadata.
coroot — UI plus rule engine. Prometheus metrics, ClickHouse traces, inventory all in one.

What eBPF catches:

Service topology — automatic.
SLO/SLI estimation — response time and error rate computed automatically.
DB and message queue calls — Postgres, MySQL, Redis, Kafka protocol decoding.
Per-container CPU and memory — via cgroup.
Node network losses and retransmits — TCP statistics.

What eBPF cannot catch:

Business context such as domain attributes (order id, payment amount).
Function-level traces (the call graph inside the code).
Windows (eBPF is a Linux kernel feature).
Plaintext HTTP after TLS termination — this requires uprobe hooking into SSL libraries (Coroot partially supports it).

In 2026 Coroot's licensing is two-track. Core is Apache 2.0 OSS; AI-based RCA, SSO, and RBAC are Enterprise. The free self-host covers 80 percent of needs.

Kubernetes install:

helm repo add coroot https://coroot.github.io/helm-charts
helm install coroot coroot/coroot \
  --namespace coroot \
  --create-namespace \
  --set clickhouse.shards=1 \
  --set prometheus.retention=15d

Open the UI once and you suddenly see what is actually running in the cluster — that revelation is Coroot's real value. Zero instrumentation effort, instant visibility.

A great pattern: Coroot for the automatic baseline, then SigNoz or Sentry on top with manual instrumentation only where business context demands it.

5. OpenObserve — Rust plus S3 for petabytes at 5 percent cost

OpenObserve appeared in 2023 as a Rust-based logs/traces/metrics platform. By 2026 it has 16k GitHub stars, a Series A round, and an AGPL-3.0 license. Its biggest weapon is storage cost.

One benchmark line: for the same log volume, OpenObserve uses 140 times less storage than Elasticsearch and operates at 1/140 the cost of Splunk. How? S3 plus Parquet plus indexless full-text search.

Traditional log systems (Elasticsearch, Splunk):

Index every field — disks explode.
Cache indexes in memory — expensive instances required.
Store on local disk — long-term cold storage requires a separate path.

OpenObserve's alternative:

Indexless. Logs are compressed into Parquet and dropped straight into S3.
In-memory bloom filters for the first pass, columnar scans for the actual match.
S3 itself is effectively infinite + cheap + 11 nines of durability. Cold/hot tiering is automatic.

Trade-offs:

Without indexes, "find X across every log ever" can be slower than Elasticsearch.
But time range plus one or two fields queries are fast (the 90% real-world pattern).
Bloom filter plus Parquet statistics handle most full-text needs.

In 2026 OpenObserve supports:

Logs — JSON, Syslog, Fluentbit, OTLP.
Traces — OTLP/gRPC plus Jaeger compatibility.
Metrics — Prometheus Remote Write plus OTLP.
Alerts and dashboards — native UI.
Multi-tenancy, SSO, RBAC — Enterprise tier.

Single-line docker boot:

docker run -d \
  -v $PWD/data:/data \
  -p 5080:5080 \
  -e ZO_ROOT_USER_EMAIL=admin@example.com \
  -e ZO_ROOT_USER_PASSWORD=admin \
  public.ecr.aws/zinclabs/openobserve:latest

Kubernetes plus S3 mode:

# values.yaml (Helm)
config:
  ZO_S3_BUCKET_NAME: my-o2-logs
  ZO_S3_REGION_NAME: us-east-1
  ZO_DATA_STREAM_TYPE: s3
ingester:
  replicaCount: 3
querier:
  replicaCount: 2

If you operate at petabyte scale and logs are the primary signal — security teams, fintechs, game studios — OpenObserve is the real answer. Compared with Loki plus S3, OpenObserve's single-tool integration lowers operational burden further.

6. Sentry self-hosted — Still the canon of errors, but policies wobbled

Sentry has been the de facto error-tracking standard since 2008. Self-host was free for a long time. Then 2024 happened.

In 2023 the license switched to FSL (Functional Source License) — a BSL variant that auto-converts to Apache 2.0 after two years.
In 2024 some new features (Replay, Crons) were SaaS-first and OSS-delayed.
In 2025 the self-host package grew heavier (50 GB disk, many components).
Official line from Sentry: "self-host stays alive, but SaaS-first development pace will accelerate."

As of May 2026 self-host is still alive. Monthly releases ship, and core features — Error, Performance, Profiling, Crons — are all included. But the operational burden is not light.

Self-host components:

Snuba — query layer on top of ClickHouse
Relay — event collection and filter gateway
Kafka — event queue
Redis — cache and rate limiting
PostgreSQL — metadata
Sentry core — Django plus Celery
Symbolicator — source map and symbol resolution

That is 8 to 12 containers, 50 GB of disk as a starting line. A single node handles up to one million events per day without strain. Above that, Kafka and ClickHouse start to split out.

Install:

git clone https://github.com/getsentry/self-hosted.git
cd self-hosted
./install.sh
# Tune options like SENTRY_EVENT_RETENTION_DAYS=90 in .env

Three scenarios:

Pure error tracking — Sentry self-host dominates. No OSS peer matches its grouping, trend, and release comparison UX.
Errors plus full-stack observability — SigNoz wins for integration. Sentry-like exceptions are first-class signals in SigNoz too.
Errors plus heavy cost pressure — when Sentry SaaS Developer or Team pricing is the pain point, self-host is the answer.

A common question: "Is Sentry OTel-compatible?" Partially yes. You can send OTel traces to Sentry (Performance signal), but errors still use the Sentry SDK as the canonical path. As of 2026, work to align OTel exception signals with Sentry grouping is underway.

7. Uptrace — Lightweight ClickHouse APM

Uptrace is a ClickHouse-backed OTel APM like SigNoz, but its virtue is starting light on a single node. AGPL-3.0, free self-host, paid SaaS.

Differences from SigNoz:

Single-binary boot possible — a single container in Docker.
One ClickHouse node is enough — handles up to 50M spans per day without strain.
A simpler UI — trace-focused, metrics secondary.

When you pick Uptrace:

Small team and even one self-host operator is a stretch.
ClickHouse is new but you want to try it once.
You want to look at traces deeply, one signal at a time.

When you go to SigNoz instead:

You need traces + metrics + logs + exceptions in one place.
The team has grown enough to accept ClickHouse cluster operations.

One-line docker:

docker run -d --name uptrace \
  -p 14317:14317 -p 14318:14318 \
  -v $PWD/uptrace.yml:/etc/uptrace/uptrace.yml \
  uptrace/uptrace:latest

Uptrace also bundles the OTel Collector in-process (Standalone mode). For teams new to OTel, this is the lowest barrier to entry.

8. Grafana stack — Biggest eco, most modular choice

The stack Grafana Labs has built since 2017 is the largest single OSS observability faction in 2026. By tool:

Prometheus — the de facto metrics standard. 2024 added OTLP receive for OTel compatibility.
Mimir — Prometheus's horizontal-scale backend. Multi-tenant, S3-backed.
Loki — logs. "Index labels only" philosophy keeps storage cost down.
Tempo — distributed tracing. S3, GCS, MinIO backends.
Pyroscope — continuous profiling. eBPF plus language-specific SDKs.
Grafana — the dashboard standard. LGTM data on one canvas.
Alloy — OpenTelemetry Collector distribution. Successor to Grafana Agent in 2024.
Beyla — eBPF auto-instrumentation. Lighter than Coroot, single-service granularity.
k6 — load testing. Couples naturally with observability signals.

Strengths:

Each tool deep in one signal. A great fit for teams with dedicated observability staff.
Federated operation possible. Mimir here, Tempo there, Loki somewhere else.
The Grafana dashboard ecosystem is enormous; visualizing anything is familiar.

Weaknesses:

Operating five tools. Grafana is the unified UI but the backends are separate.
Heavy on day one. A single docker-compose will not finish the job.
Alerts and rules differ per tool. Standardization required.

When to pick the Grafana stack:

Already running Prometheus and the natural extension wins.
A dedicated SRE/observability team.
Multi-tenant or federated cluster requirements.

When SigNoz or Uptrace win instead:

You want everything in one unified UI.
Operational staff is thin.
This is a brand new start.

Grafana Cloud's free tier is generous enough that "self-host vs Grafana Cloud" must always be on the comparison sheet. The 2026 free tier is 10k metric series, 50 GB logs, 50 GB traces, 14-day retention.

9. Storage backends — The real decision

Eighty percent of tool choice is backend choice. Three patterns split the market.

9.1 ClickHouse — Columnar OLAP

Used by: SigNoz, Uptrace, Sentry (via Snuba), Coroot (for traces).

Strengths:

Compression 8–12 times. Low disk cost.
Hundreds of millions of rows per minute ingest.
Columnar scans make narrow queries fast.
Native SQL.

Weaknesses:

Operational difficulty is high. Backups, replicas, Keeper, all require learning.
Merge and mutation cost. A bad fit for frequent updates.
Cardinality explosion risk. Bad label design will blow up disk.

9.2 S3 + Parquet — Indexless object storage

Used by: OpenObserve, Tempo, Loki (blocks), Mimir (blocks).

Strengths:

Effectively infinite, 11 nines durable, cheap.
Cold storage automatic. Hot/cold tiering options.
Decoupled compute. Scale query nodes independently.

Weaknesses:

Slower than local disk. A caching layer is mandatory.
Efficient when queries are bounded by time plus narrow fields.
Heavy full-text analytics are weak spots.

9.3 Prometheus TSDB — Time-series only

Used by: Prometheus, Mimir, Cortex, VictoriaMetrics.

Strengths:

Overwhelmingly optimized for metrics. Small disk, fast queries.
PromQL ecosystem is the standard.
Operationally simple in single-node form.

Weaknesses:

Wrong fit for traces and logs. Time series only.
Horizontal scale needs separate tools like Mimir or Thanos.

Combination guide:

Metrics dominate 80 percent — Prometheus plus Mimir.
Logs dominate 80 percent — OpenObserve, or Loki plus S3.
Traces dominate 80 percent — SigNoz/Uptrace plus ClickHouse, or Tempo plus S3.
All three balanced — SigNoz alone.

10. Datadog to SigNoz — A real migration

A reconstructed migration journal from a 50k-host SaaS in late 2025.

10.1 Starting state

Monthly bill of $140k (APM$ 80k, logs $40k, metrics$ 20k).
80 backend services in Java, Go, and Node.
Auto-instrumented by the Datadog Agent.
Custom metrics growing 12 percent per month due to cardinality drift.

10.2 Phased migration

Phase 1 — Abstract instrumentation (4 weeks)

Move to the OTel SDK. Prefer auto-instrumentation to minimize code changes. Manually add only business attributes.

Phase 2 — Dual backend (6 weeks)

The OTel Collector ships to Datadog and SigNoz simultaneously. SigNoz first in dev and staging, then a weekly rollout across production services. Validate data parity.

Phase 3 — SLO verification (4 weeks)

Rebuild critical alerts and dashboards in SigNoz. Confirm response time and error rate SLOs behave the same as Datadog Monitor. Give on-call time to get familiar with the SigNoz UI.

Phase 4 — Remove Datadog (2 weeks)

Cut over service by service. The last step is removing the Datadog Agent.

10.3 Results

Bill went from $140k to SigNoz infra at$ 12k plus increased SRE labor — net 90 percent cost cut.
Trace retention 30 days to 15 days (re-evaluation showed it was enough).
Custom metrics dropped 50 percent via cardinality policy.
On-call MTTR unchanged.

10.4 Regrets

The dual-run window was too short. Eight weeks recommended.
Logs migration should have moved in parallel. Doing it separately broke context.
Forgot ClickHouse disk monitoring early on and had one disk-full incident.

11. The honest cost of self-hosting

"OSS means free" is a lie. The honest cost model is this.

Infrastructure: ClickHouse, S3, Prometheus nodes, backups, multi-AZ. $10k–$ 50k per month range (for 50–500 hosts).
People: 0.3–1 SRE FTE. $50k–$ 150k per year.
Downtime risk: if observability itself dies, troubleshooting is blind. HA design is mandatory.
Learning curve: ClickHouse, Kafka, S3 lifecycle, OTel Collector tuning, retention policy. The first six months are trial and error.

When self-host wins:

Bill is above $30k per month, and
You have SRE capacity, and
Compliance requires data not to leave the perimeter, and
You want to control cardinality and retention policy yourself.

When SaaS wins:

Small team, no SRE.
Bill is below $5k per month.
Security and compliance do not block SaaS.
Speed of getting started is the higher-value lever.

Middle option: Grafana Cloud or SigNoz Cloud. OSS backends as a managed service. Bills hover at 30–50 percent of Datadog with the operational burden removed.

12. Anti-patterns to avoid

Long-accumulated traps.

"Observability equals logs." Logs alone, without traces and metrics, make causal tracing impossible.
Cardinality explosion — the moment user_id and request_id become labels, you are finished. Attributes only.
Full throttle without sampling — 10k RPS with 100 percent trace retention? Disk explodes. Tail-based sampling keeps only errors and slow paths.
Alert flood — 30 alerts per hour per person become noise. Use SLO-based alerts for things that actually matter.
Single OTel Collector — single point of failure. DaemonSet plus Gateway, two tiers.
No TTL — you only notice when disk is full. Set this in the first week of self-host.
200 dashboards — nobody looks at them. Five golden dashboards plus automatic alerts beats it.
Deploying the tool without making instrumentation mandatory — the tool is in place but there is no data. Add instrumentation gates and CI validation.
Ignoring OTel semantic conventions — a week later even you cannot find your own keys. Follow the conventions.
All-in on one backend — when ClickHouse goes down, SigNoz, Sentry, and Uptrace all die. Separate.

13. Decision flowchart

Answer in order.

Current monthly observability cost? Under $10k — keep SaaS.
Do you have an SRE or observability lead? No — SaaS or managed (Grafana Cloud, SigNoz Cloud).
Does compliance ban external transfer? Yes — self-host decision is made.
Which signal dominates?
- Unified traces+metrics+logs — SigNoz.
- Logs dominate — OpenObserve or Loki.
- Start zero-instrumentation via eBPF — Coroot.
- Errors are the focus — Sentry self-host.
- Lightweight single node — Uptrace.
- Modular operation — Grafana stack.
Three-year scale forecast? Petabyte logs likely — OpenObserve or Loki. Smaller — SigNoz.

14. Where the wind blows after 2026

Where OSS observability is heading.

OTel Profiles GA — beta in 2025, GA expected late 2026. Continuous profiling becomes the fourth signal.
AI-driven RCA — Coroot, SigNoz, and Grafana are all shipping RCA features split across OSS and Enterprise tiers. LLM-generated first-cut hypotheses for incidents becomes standard.
eBPF goes mainstream — beyond Beyla and Coroot, more tools will adopt eBPF auto-instrumentation.
Decoupled storage standardization — Iceberg and Parquet-based common formats are loosening tool lock-in.
End of cardinality pricing — Datadog and New Relic face pressure to retreat from label cardinality pricing. The freedom of OSS is pressuring the entire pricing model.

Three-year forecast: OTel + S3-Parquet + ClickHouse + eBPF will be the de facto baseline OSS observability stack. Datadog and New Relic will still exist, but market share is likely to dip into the 30–40 percent range.

Epilogue — Self-host is a right and a responsibility

An old truth: observability is not insurance, it is product. Outsourcing that product's backend to SaaS is a legitimate choice; operating it yourself is a legitimate choice. Both cost something. SaaS in the form of a bill; self-host in the form of time and SRE salary.

The interesting thing about 2026 is that the option set genuinely expanded. OpenTelemetry broke vendor lock-in, ClickHouse and S3-Parquet bent the storage cost curve, and eBPF drove instrumentation effort toward zero. The result is that the era of "one tool does it all" is over and the era of modular tool composition has begun.

14.1 Adoption checklist

14.2 Anti-pattern summary

Mistaking observability for logs.
Ignoring cardinality explosion.
100 percent retention with no sampling.
Alert flood that destroys trust.
Operating a single OTel Collector.
Forgetting TTL.
200 dashboards.
Deploying tools without mandating instrumentation.
Ignoring semantic conventions.
All-in on one backend.

14.3 Next post preview

The next post is a deep dive on the OpenTelemetry Collector. DaemonSet vs Gateway, processors, filters, routers, real tail-based sampling implementation, Kafka backing, retries, DLQ. The real operational starting point for OSS observability lives there.

References

OpenTelemetry: https://opentelemetry.io/
OTel Collector: https://opentelemetry.io/docs/collector/
SigNoz: https://signoz.io/ · https://github.com/SigNoz/signoz
Coroot: https://coroot.com/ · https://github.com/coroot/coroot
OpenObserve: https://openobserve.ai/ · https://github.com/openobserve/openobserve
Sentry self-hosted: https://develop.sentry.dev/self-hosted/ · https://github.com/getsentry/self-hosted
Uptrace: https://uptrace.dev/ · https://github.com/uptrace/uptrace
Grafana Loki: https://grafana.com/oss/loki/
Grafana Tempo: https://grafana.com/oss/tempo/
Grafana Mimir: https://grafana.com/oss/mimir/
Grafana Pyroscope: https://grafana.com/oss/pyroscope/
Grafana Beyla: https://grafana.com/oss/beyla-ebpf/
Grafana Alloy: https://grafana.com/oss/alloy-opentelemetry-collector/
ClickHouse: https://clickhouse.com/
Prometheus: https://prometheus.io/
CNCF Observability TAG: https://github.com/cncf/tag-observability