Observability Deep Dive: Logs, Tracing, and LLM Monitoring

Introduction — What Observability Really Means
The Three Pillars — Logs, Metrics, Traces
How Signals Connect — trace_id and Exemplars
Logs — Loki vs OpenSearch
Distributed Tracing — OpenTelemetry Becomes the Standard
LLM Observability — Why It Needs Special Tools
- Langfuse — Traces of LLM Calls
- Other Options — LangSmith, Helicone, Phoenix
Putting It Together — An OpenTelemetry-Centered Stack
- Common Traps — Cost and Cardinality
- Decision Checklist
Conclusion
References

Introduction — What Observability Really Means

The system got slow. Users say "payments aren't going through." But the servers are running fine, CPU looks normal, and there are no obvious error logs. Where is the problem? The ability to answer that question is exactly what observability is.

Observability is often confused with monitoring. Monitoring watches whether predefined metrics stay in their normal range. Observability is whether you can answer arbitrary questions about a system's internal state from the outside. The former guards against known unknowns; the latter lets you dig into unknown unknowns.

The raw materials for this ability are commonly called the three pillars: logs, metrics, and traces. This post walks through each pillar, contrasts the choices for log storage (Loki vs OpenSearch) and distributed tracing (OpenTelemetry, Jaeger, Tempo), and then moves on to the newly emerging field of LLM observability (Langfuse and friends).

The Three Pillars — Logs, Metrics, Traces

The three signals record the same events from different angles.

Logs: records of individual events at a point in time. "At 12:03:11, payment for order 4821 failed, reason: card declined." The most detailed, but the volume explodes easily.
Metrics: numeric values aggregated over time. "Requests per second," "p99 latency," "error rate." Cheap to store and ideal for dashboards and alerts, but they carry no context about individual events.
Traces: the full journey of a single request across many services. Starting at the API gateway, passing through the order service, payment service, and the database, showing how long each segment took.

Recently, continuous profiling is frequently cited as a fourth pillar. It continuously samples CPU and memory at the function level in production, answering "which function is eating the CPU" at a code level deeper than traces. Tools like Grafana Pyroscope and Parca are representative.

The key is not to view the three signals in isolation but to view them linked together. You spot "the error rate spiked" in a metric, jump to the trace at that moment to confirm "the payment service was slow," and read the log attached to that trace to find the cause: "card issuer timeout." This flow is the essence of observability.

How Signals Connect — trace_id and Exemplars

The glue that links the three signals is a correlation ID, above all the trace_id.

The most practical link is embedding the trace_id in logs. If you emit the trace_id as a field of your structured logs, you can look at a trace and filter down to "only the logs this request emitted." Conversely, from an error log you can open the full trace by its trace_id.

{
  "timestamp": "2026-07-03T12:03:11.482Z",
  "level": "ERROR",
  "service": "payment-service",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "message": "card declined",
  "order_id": 4821
}

Metrics and traces connect through exemplars. An exemplar attaches one representative trace_id to a metric data point. For example, if you attach the trace_id of an actually-slow request to the slow bucket of a "p99 latency" histogram, you can click the spiking point on a dashboard and jump straight to that trace. It is a bridge from metrics (what) to traces (why).

Making these connections work is half of observability design. No matter how many signals you have, if they aren't linked, you're left with the pain of eyeballing three separate windows whenever a problem hits.

Logs — Loki vs OpenSearch

Where and how you store logs strongly shapes both cost and the search experience. The two representative camps are Grafana Loki and OpenSearch (the former Elasticsearch lineage).

Loki — "Prometheus for Logs"

Grafana Loki's philosophy is, in one sentence, "don't index the log body." Loki builds no inverted index over the full text of logs; instead it indexes only the labels. The log body itself is compressed and thrown into cheap object storage (S3, GCS, and the like) as-is. So just as Prometheus treats metrics by labels, Loki treats logs by labels. That's where the nickname "Prometheus for logs" comes from.

The query language is LogQL. You first narrow the log streams by label, then filter text within them.

{app="payment-service", level="error"} |= "card declined" | json | order_id="4821"

Here the leading curly-brace part is the label selector (the indexed part), and everything from |= onward is a body filter. Thanks to this structure the index stays small, so storage cost is low and operation is light. On the other hand, if you don't narrow enough by label and full-text search a vast time range, it can get slow. Label design determines success or failure.

The trap to watch for is high-cardinality labels. If you use something whose value space explodes — like user_id or trace_id — as a label, the streams shard into millions and Loki degrades sharply. Such values belong in the log body, found via a filter, not as labels.

OpenSearch — Powerful Full-Text Search

OpenSearch (and its root, Elasticsearch) takes the opposite approach. It builds an inverted index over nearly every field of a log. This makes complex full-text search, aggregation, and faceted analysis over arbitrary fields fast, and it comes with a powerful exploration UI in Kibana / OpenSearch Dashboards.

The same query written in the OpenSearch query DSL looks like this.

{
  "query": {
    "bool": {
      "must": [
        { "match": { "service": "payment-service" } },
        { "match": { "message": "card declined" } },
        { "term": { "order_id": 4821 } }
      ]
    }
  }
}

The price is weight. Because it indexes everything, it uses a lot of storage and memory, and cluster operation (shards, heap, rebalancing) is hands-on. As log volume grows, infrastructure cost climbs steeply.

When to Choose Which

Loki: natural if you're cost-sensitive, use logs mostly by "narrowing by label and scanning recent ranges," and already live in the Grafana / Prometheus ecosystem.
OpenSearch: powerful if you frequently need strong full-text search and complex aggregation/analysis over arbitrary fields (security log analysis, auditing, SIEM), and can afford the infrastructure cost.

Structured Logging and Log Levels

Whichever storage you pick, there's a shared principle. Logs are better emitted not as human-readable sentences but as structured data (JSON) for machines to parse. Compared to a string like "2026-07-03 payment failed for order 4821", the field-separated JSON above is overwhelmingly better for filtering and aggregation.

Log levels (DEBUG, INFO, WARN, ERROR) are the first dial for controlling noise. And for high-traffic services, sampling is essential. You don't need to keep 100% of the INFO logs of successful requests. For instance, sample only 1% of normal requests while keeping 100% of errors, striking a balance between cost and information.

Distributed Tracing — OpenTelemetry Becomes the Standard

In microservices, a single user request passes through many services. Reconstructing that journey is distributed tracing, and the de facto standard here is OpenTelemetry (OTel).

Why OpenTelemetry

OpenTelemetry is a vendor-neutral standard under the CNCF. Previously, every observability tool had its own instrumentation SDK, so switching backends meant rewriting all your instrumentation. OTel standardized this instrumentation layer. You instrument your code with OTel once, and decide where the data goes (Jaeger, Tempo, a commercial vendor, etc.) later, by changing only exporter configuration. Breaking lock-in is its core value.

The basic concepts of tracing are these.

Trace: the full journey of one request. Identified by a unique trace_id.
Span: one unit of work that makes up a trace (e.g., "DB query," "call payment API"). It has start and end times.
Parent-child relationship: spans nest. Inside an "order processing" span, "check inventory" and "payment" spans go in as children. This tree gives a trace its shape.
Attributes and events: key-value metadata attached to a span (e.g., http.method, db.system) and timestamped records of things that happened during a span's life.

Context Propagation — the traceparent Header

When service A calls service B, the trace_id must cross over for the two services' spans to belong to the same trace. The standard for this context propagation is W3C Trace Context — that is, the traceparent header carried on the HTTP request. Its value is roughly the version, trace_id, parent span_id, and flags joined by hyphens.

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             │  │                                │                │
        version  trace_id (128-bit)            parent span_id     flags (sampled)

The receiving service reads this header and attaches its own span as a child under the same trace_id. That's how a trace stays unified across service boundaries.

The OTel Collector and Instrumentation

The OTel Collector is a standalone process placed between your applications and your backends. It receives telemetry sent by many services, processes it (batching, sampling, editing attributes), and exports it to the backends you want. Apps only need to know about the Collector, and you manage backend swaps, multi-destination fan-out, and sampling policy centrally in the Collector.

There are two styles of instrumentation.

Auto-instrumentation: a language agent automatically hooks popular libraries (HTTP servers, DB drivers, gRPC, etc.) and creates spans for you. With almost no code changes, you can get started quickly.
Manual instrumentation: you open spans and attach attributes directly via the SDK. Needed when you want to capture meaningful segments of business logic in detail.

A minimal OTel SDK initialization looks roughly like this (Node.js example).

const { NodeSDK } = require("@opentelemetry/sdk-node");
const {
  getNodeAutoInstrumentations,
} = require("@opentelemetry/auto-instrumentations-node");
const {
  OTLPTraceExporter,
} = require("@opentelemetry/exporter-trace-otlp-http");

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: "http://otel-collector:4318/v1/traces",
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

Backends — Jaeger and Tempo

Collected traces are stored and queried in a tracing backend.

Jaeger: a mature open-source tracing system that started at Uber. Its own UI offers trace search, dependency graphs, and timelines.
Grafana Tempo: the trace counterpart to Loki. It doesn't index traces but stores them whole in cheap object storage, drastically lowering cost. Lookup by trace_id is the default, and it integrates smoothly with Loki, Prometheus, and Grafana.

Sampling — Head vs Tail

Storing 100% of traces makes volume and cost unmanageable, so you sample.

Head-based sampling: decides "keep this, drop that" immediately at the start of a trace. For example, "keep only 1%." Simple and cheap, but it can drop the very traces that had problems (slow or errored).
Tail-based sampling: decides after the trace finishes, having seen the whole thing. You can set rules like "always keep traces that have an error or are slower than p99," dropping normal traffic while reliably keeping problematic traffic. In exchange, you must buffer until the trace ends, which costs the Collector more load and memory.

LLM Observability — Why It Needs Special Tools

Now we move to a new field that has surged recently. LLM applications (chatbots, RAG, agents) aren't well served by traditional observability tools alone. The reasons are clear.

Non-determinism: the same input yields different output every time. A binary "success/failure" judgment can't capture quality.
Prompt and version drift: changing a prompt by one line and watching quality collapse is common. You need to track which prompt version produced which result.
Cost: LLM calls are billed per token. You need to see which request, which chain step, eats how many tokens to control cost.
Quality evaluation: even when latency is normal, the answer can be wrong or hallucinated. Correctness must be evaluated separately (evals).
RAG debugging: when an answer is off, you must distinguish whether retrieval fetched the wrong documents or generation ignored the documents.

Langfuse — Traces of LLM Calls

Langfuse is an open-source LLM observability platform that tackles this head-on. Conceptually it resembles distributed tracing: it captures one user interaction as one trace and records each step inside as nested spans. But the span content is LLM-specific. Each LLM call carries the prompt, completion, token usage, cost, and latency, and chain, agent, and retrieval steps nest as parent-child.

A heavily simplified shape of a single RAG pipeline trace looks like this.

{
  "trace_id": "t_9f2c",
  "name": "rag-chat",
  "input": "What is your refund policy?",
  "spans": [
    {
      "name": "retrieval",
      "type": "retriever",
      "input": "refund policy",
      "output": ["doc-12", "doc-45"],
      "latency_ms": 82
    },
    {
      "name": "llm-answer",
      "type": "generation",
      "model": "claude-x",
      "prompt_tokens": 1240,
      "completion_tokens": 180,
      "cost_usd": 0.0042,
      "latency_ms": 1310
    }
  ],
  "scores": [{ "name": "helpfulness", "value": 0.9 }]
}

The core features Langfuse gives you are these.

Nested traces: visualizes chains, agents, tool calls, and retrieval as a hierarchy. You see at a glance which tools an agent called and how many times.
Prompt version management: store and deploy prompts by version, and link which version produced which trace.
Evals and scores: attach scores to traces. LLM-as-a-judge automated evaluation, rule-based scores, and human labeling can all live here.
User feedback: connect end-user feedback like thumbs up/down to the relevant trace and use it as a real quality signal.

Other Options — LangSmith, Helicone, Phoenix

Langfuse isn't the only choice.

LangSmith: a commercial platform built by the LangChain team. It integrates tightly with LangChain/LangGraph and offers dense tracing, evaluation, and dataset management.
Helicone: distinguished by a proxy approach. Placing it as a proxy in front of your LLM API lets you add logging, caching, and cost tracking with minimal code changes.
Arize Phoenix: open source, and its strength is being OpenTelemetry-based (OpenInference). On top of tracing and evaluation, it's strong on ML observability like embedding and drift analysis.

The selection criteria are roughly this. LangChain-centric, go LangSmith; add quickly with no code changes, Helicone; value open source and the OTel standard, and Phoenix or Langfuse feel natural.

Putting It Together — An OpenTelemetry-Centered Stack

Now let's tie the whole thing into one picture. A common setup today puts OpenTelemetry at the hub.

  apps (services) ── OTel SDK ──▶ [ OTel Collector ]
                                       │
              ┌────────────────────────┼────────────────────────┐
              ▼                        ▼                        ▼
        [ Loki (logs) ]        [ Tempo (traces) ]      [ Prometheus (metrics) ]
              └───────────── unified query in Grafana ──────────────┘

  LLM calls ────────────────▶ [ Langfuse (LLM traces, cost, evals) ]

The core principle is the trace_id correlation stressed earlier. Apps are instrumented with OTel and send to the Collector; the Collector fans logs out to Loki, traces to Tempo, and metrics to Prometheus. In Grafana you place all three on one screen and move from log to trace by trace_id, and from metric to trace by exemplar. The LLM part is captured separately by Langfuse, but if you carry the same trace_id, you can link it to your ordinary traces too.

Common Traps — Cost and Cardinality

Cardinality explosion: putting values that grow without bound — user_id, request_id, trace_id — into metric labels or Loki labels blows up the time series and streams and topples your storage. Put high-cardinality values in the log body or trace attributes, and keep only bounded-value things (service name, region, status code) as labels.
Reckless 100% retention: storing every log and trace in full becomes unaffordable. Set a policy early that samples the normal and reliably keeps the problematic (errors, slow requests).
Unconnected signals: if you don't embed the trace_id in logs, the three pillars drift apart. Make sure to inject the trace_id early in instrumentation.
Neglected LLM cost: as prompts grow longer or retries increase, token cost quietly explodes. Continuously watch per-request and per-step cost with a tool like Langfuse.

Decision Checklist

Log storage: Loki for cost- and label-centric use; OpenSearch for powerful full-text search and analysis.
Instrumentation: if starting fresh, go OpenTelemetry from the start. It keeps you free of vendor lock-in.
Trace backend: Tempo for low cost and the Grafana ecosystem; Jaeger for a standalone, mature UI.
Sampling: tail-based if you hate missing problem traces; head-based for simplicity and low cost.
LLM: if you run an LLM app, definitely adopt one of Langfuse / LangSmith / Helicone / Phoenix.
Correlation: whatever you use, embed the trace_id in every log to link the signals.

Conclusion

Observability isn't finished by any single tool. It's the work of weaving the three pillars — logs, metrics, and traces (plus the rising profiling) — together by trace_id to build "a system that can answer arbitrary questions." For logs, Loki and OpenSearch form the two poles of cost and search power; for tracing, OpenTelemetry unifies the standard while Jaeger and Tempo receive it; and on top, tools like Langfuse fill in the new LLM layer.

The point is not "what is best" but "what fits this system," and above all, "are the signals connected to each other?" If you stand up the three pillars well and bind them firmly by trace_id, then in the next 3 a.m. incident, instead of eyeballing three windows, you can click your way down to the root cause.

References

OpenTelemetry official docs: https://opentelemetry.io/docs/
W3C Trace Context spec: https://www.w3.org/TR/trace-context/
Grafana Loki docs: https://grafana.com/docs/loki/latest/
Grafana Tempo docs: https://grafana.com/docs/tempo/latest/
Jaeger official site: https://www.jaegertracing.io/
OpenSearch docs: https://opensearch.org/docs/latest/
Langfuse docs: https://langfuse.com/docs
Arize Phoenix: https://docs.arize.com/phoenix