Skip to content
Published on

LLM Evaluation & Observability: Eval Harness, LLM-as-Judge, Tracing, Regression Prevention (2025)

Authors

Season 4 Ep 6 — Ep 1–5 covered "how to build". Ep 6 covers "how do you know it actually works". An LLM product without evaluation is a car driven blindfolded.

Prologue — The End of "LGTM-Driven Development"

Until 2023, many LLM products were built via "Looks Good To Me" development: tweak the prompt, run it a few times, say "oh, seems better", merge. If a regression occurred, it ended with "no idea why".

In 2025, that no longer works:

  1. Product scale: hundreds of millions of calls per month; one regression hits tens of thousands of users.
  2. Team scale: multiple engineers touch the same prompts and models → diffused responsibility.
  3. Competition: one month of inaction lets rivals pull ahead → fast iteration is mandatory, which only evaluation-driven workflows enable.

In short, in LLM products evaluation must be as routine as test automation in software engineering, not a one-off ML evaluation event.


1. Four Layers of Evaluation

1.1 Layer 0 — Structural / Smoke Tests

  • Is the response valid JSON?
  • Are all required fields present?
  • Did it respect length/format constraints?
  • Did the API return 200?

The basics. Must run on every PR in CI.

1.2 Layer 1 — Golden Set Ground-Truth Comparison

  • Classification/extraction with clear answers: Accuracy/F1
  • Natural-language generation: BLEU/ROUGE/METEOR (reference only)
  • Feels like unit tests.

1.3 Layer 2 — Quality Judgment

  • Human evaluators or LLM-as-judge
  • Subjective axes: "is this response accurate / useful / safe?"
  • Requires statistical-significance checks.

1.4 Layer 3 — Production Metrics

  • Thumbs up/down, re-ask rate, conversation abandonment
  • Task success rate (conversion / resolution)
  • Latency, cost, error rate — ops metrics

Higher layers provide stronger signal but worse frequency, cost, and speed. Combine wisely.


2. Building the Eval Dataset

2.1 Diversify Sources

  • Production log sampling: reflect real distribution
  • Past incident replay: "this must never happen again"
  • Synthetic by difficulty: LLM-generate Easy/Medium/Hard
  • Edge-case only: safety / bias tests

2.2 Size

  • Start: 100–300 is enough
  • Steady state: 1,000–3,000
  • Regression smoke: 10–30 very fast

2.3 Labeling

  • Easy when a single answer exists
  • When multiple answers are valid, define a rubric:
    • "Correct if it contains ...", "Wrong if it contains ..."
  • Target inter-labeler agreement (Kappa) above 0.7.

2.4 Splits

  • Train (prompt tuning / few-shot / fine-tune)
  • Dev (iteration during development)
  • Test (final decisions. Never expose to training/tuning)

3. LLM-as-Judge — Blessing and Curse

3.1 Basic Idea

Ask a large model (GPT-4o, Claude 3.5/4, etc.) "is this response correct?" for automated evaluation.

System: You are a fair evaluator.
User:   [question] [response]
        Does this response answer the question correctly? yes/no with a one-line reason.

3.2 Upsides

  • Hundreds of times faster and cheaper than humans
  • Consistent criteria
  • Handles subjective judgment reasonably

3.3 Fatal Pitfalls

  1. Position bias: in "which is better?" pairwise, flipping A/B can flip the verdict.
  2. Length bias: longer answers judged better.
  3. Self-preference: GPT rates GPT-generated text higher (cross-model bias).
  4. Rubric variance: even with the same rubric, scores vary across runs.
  5. Easy-to-game: if the eval prompt is known, models learn to "fool" the judge.

3.4 Calibration Techniques

  • Position swap: evaluate in both A/B orders → accept only if they agree
  • Multiple judges: three different models → majority vote
  • Pairwise > scalar: "A vs B" is more stable than "score out of 10"
  • Chain-of-Thought: require reasoning before verdict
  • Rubric anchoring: fix 10 canonical "clearly good/bad" examples
  • Calibration set: measure judge accuracy on 200 human-labeled items → apply correction

3.5 Keep Humans in the Loop

  • Always review 10–20% by humans
  • Track monthly correlation between judge and human scores
  • If correlation drops below 0.7, revise judge prompt/model

4. Observability — Trace / Span / Metric

4.1 Terminology

  • Trace: full execution flow of one user request
  • Span: one unit within a trace (LLM call, tool call, DB query)
  • Metric: time-series aggregate (QPS, p95 latency, cost/day)

4.2 OpenTelemetry — the De Facto Standard

As of 2025, LLM observability is converging on OpenTelemetry schemas (OpenLLMetry, OpenInference). Instrumenting with OTEL SDKs makes vendor switching trivial.

4.3 LLM-Specific Attributes

In addition to standard trace attributes:

  • gen_ai.system: openai / anthropic / google / ...
  • gen_ai.request.model, gen_ai.response.model
  • gen_ai.usage.input_tokens, output_tokens, cache_read, cache_write
  • gen_ai.temperature, top_p
  • gen_ai.prompt (optional, PII-masked): prompt sample
  • gen_ai.response.content (optional): response sample
  • Cost: auto-computed in USD

4.4 Aggregate Metrics

  • p50/p95/p99 latency (first_token / total)
  • QPS, error rate, retry rate
  • Cost/day, cost/user, cost/feature
  • Quality: daily trend of eval-set scores

5. Vendor Comparison

5.1 LangSmith

  • Official LangChain product, hosted SaaS.
  • Trace + eval sets + feedback + prompt hub integrated.
  • First choice for LangGraph/LangChain users.

5.2 LangFuse

  • Open source (self-hostable); SaaS also available.
  • Auto-instrumentation for OpenTelemetry / OpenAI SDK.
  • Prompt versioning and eval support.
  • Strong for enterprise self-hosting.

5.3 Arize Phoenix

  • Open source; integrates with Arize production platform.
  • Strong at embedding/drift/RAG retrieval-quality visualization.
  • Local UI you can run instantly.

5.4 Helicone

  • Gateway-style (proxy). Just change the URL instead of adding an SDK.
  • Built-in cost/latency savings (cache, routing).
  • Focus on low-latency observability.

5.5 Weights & Biases Weave

  • W&B's LLM-specific module.
  • Unifies experiments, evaluation, and production tracking.

5.6 Selection Guide

SituationPick
LangChain/LangGraph stackLangSmith
Self-hosting required (regulated)LangFuse
RAG-heavy, embedding drift mattersPhoenix
Gateway for fast rolloutHelicone
Research + experimentsW&B Weave
Already on Datadog/New RelicVendor's LLM extension (OpenLLMetry)

6. RAG Evaluation

6.1 Measure Separately

  • Retrieval: did we find the right docs? (Hit@k, MRR, NDCG)
  • Generation: did we use them well? (Faithfulness, Answer Relevancy)
  • Context Quality: few irrelevant docs? (Context Precision/Recall)

Without separating retrieval vs. generation failures, tuning is guesswork.

6.2 Frameworks like RAGAS

  • Faithfulness: is the answer grounded in the documents?
  • Answer Relevancy: does the answer actually address the question?
  • Context Precision: ratio of retrieved docs that are actual evidence
  • Context Recall: how much of the required evidence did we retrieve?

Open-source RAGAS, DeepEval compute these metrics automatically.

6.3 Golden Q&A Set

  • Question, ground-truth doc IDs, expected answer or required keywords.
  • Without this set, RAG tuning is pure guesswork.

7. Agent Evaluation

7.1 Metrics

  • Task success rate: final result correct?
  • Step efficiency: how many steps? (fewer is better, but not too few)
  • Tool selection accuracy: correct tool chosen?
  • Cost / Latency distribution: p50/p95
  • Safety: attempts and successes at forbidden actions

7.2 Trajectory Evaluation

Not just the outcome — the trajectory matters.

  • Same result with "5 unnecessary tool calls" vs. "2 clean calls" means 3x ops cost.

7.3 Replay-Based Evaluation

  • Replay past runs via LangGraph/LangSmith checkpoints.
  • Simulate new model/prompt on the same trajectory → detect regressions.

8. Wiring Eval into CI/CD

8.1 PR Stage

  • Layer 0 (smoke) 20 items — within 2 min
  • Layer 1 (golden) 100 items — within 10 min
  • Auto-fail if thresholds missed.

8.2 Post-Main-Merge

  • Layer 2 (quality, LLM-judge) 500 items — 30 min
  • Push results to Slack/Discord team channel.

8.3 Weekly Full Run

  • Layer 3 (production metrics) weekly report
  • Month-over-month quality / cost / latency changes.

8.4 Shadow & A/B

  • Shadow-call new model/prompt in parallel; log responses (user sees the old response).
  • After collection, compare → promote via A/B.

8.5 Canary

  • Traffic 1–5% → stabilize → 50% → 100%
  • Auto-rollback (p95 latency 2x or quality -5pp).

9. Safety, Bias, Hallucination

9.1 Safety Benchmarks

  • RealToxicityPrompts, ToxiGen (English)
  • Korean public benches are scarce → custom internal set mandatory.
  • Jailbreak / prompt-injection automation (Garak, PyRIT).

9.2 Bias

  • Gender, age, region, occupation axes: "same question, different answer?"
  • Account for Korea-specific axes (university, military service, region).

9.3 Hallucination

  • Fact-verification benches (FEVER, etc.).
  • In RAG, Faithfulness proxies hallucination.

9.4 Refusal Appropriateness

  • Don't over-refuse legitimate requests.
  • Monitor false-refusal rate.

10. Incident Response Playbook

10.1 SEV Definitions

  • SEV1: widespread impact, wrong/dangerous answers (immediate)
  • SEV2: 10%+ quality regression (within 6h)
  • SEV3: cost/latency regression, small impact (weekly)

10.2 Response Order

  1. Block: turn off routing to the problematic model/prompt/version.
  2. Isolate: identify which config is at fault (logs / traces).
  3. Mitigate: roll back to last stable version.
  4. Root cause: combine eval set + logs.
  5. Prevent recurrence: permanently add the failure to the eval set.
  6. Postmortem: internal publication within 24–72h.

10.3 Checkpoints

  • Every deploy instantly rollback-able?
  • Traffic gating adjustable per second?
  • Clear on-call ownership?

11. User Feedback Loop

11.1 Collection

  • Thumbs up/down + optional reason (checkbox + free text).
  • Inline buttons get higher engagement than dashboards.
  • Collect right after task completion.

11.2 Usage

  • Thumbs-down cases → eval-set candidates.
  • Thumbs-up cases → DPO / preference data.
  • Repeated same complaint → triggers prompt/RAG tuning.

11.3 Privacy

  • Consent for input/output storage, deletion API, retention policy.
  • User-ID hashing, masking of sensitive domains.
  • Comply with Korean law (PIPA, pseudonymization).

12. Korean / Korean-Market Observability

12.1 Per-Language Metrics

  • Same product often differs in Korean vs. English quality.
  • Dashboards must filter by language.

12.2 Korean Eval Resources

  • KMMLU, HAE-RAE, LogicKor, KoBench, Ko-MT-Bench.
  • Internal benches remain the final decision signal.

12.3 Regulation & Audit

  • Finance, healthcare: retain audit logs 5–10 years.
  • Store external API call time/content/response in immutable storage.

12.4 On-prem Observability

  • Self-host LangFuse, Phoenix, OpenTelemetry Collector.
  • Block external egress; internal audit only.

13. Ten Anti-Patterns

13.1 Deciding by "the numbers look good"

Without checking significance (sample size, variance).

13.2 Single-model LLM-judge

Self-preference bias. Use multi-judge + human sampling.

13.3 Eval/training set leakage

Check via hash-based dedup.

13.4 No regression tests in PRs

Regressions reach main before detection. Layer 0/1 mandatory in CI.

13.5 Storing raw PII

Regulatory violation + secondary damage on incident.

13.6 Watching cost, not quality

"Cheaper but satisfaction dropped" goes unnoticed.

13.7 No trace

Agent failures unexplained. Debugging impossible.

13.8 No user feedback

Throwing away the most valuable free signal.

13.9 Exposing the "test" set to tuning

Inflated results, disappointment next quarter.

13.10 No recurrence prevention

A team hitting the same failure three times has no eval system.


14. Pre-Launch Checklist (12 items)

  • Eval Layers 0/1/2/3 defined with execution cadence
  • 300+ eval items, Train/Dev/Test split
  • LLM-judge with position swap + human review
  • OpenTelemetry-based Trace/Span/Metric instrumentation
  • Cost/latency dashboard p50/p95/p99
  • User feedback (thumbs up/down) + reason option
  • Incident playbook + on-call
  • Canary + auto-rollback policy
  • Monthly safety/bias/hallucination benches
  • Privacy policy (log retention, deletion API)
  • Regression tests in CI with alerts on failure
  • Quarterly large-scale human evaluation (qualitative included)

15. Next — Season 4 Ep 7: "The Local LLM Era"

2025 is also the year local LLMs became practical.

  • Models: Llama 3.1/3.3, Qwen2.5/3, Mistral, Gemma 3, Phi
  • Engines: vLLM, TGI, SGLang, llama.cpp, Ollama, LMDeploy
  • Hardware: RTX 4090/5090, H100, Apple Silicon (M3/M4 Ultra)
  • Quantization: INT4/INT8, AWQ, GPTQ, SmoothQuant, EXL2
  • Real use: internal knowledge bots, code assistants, doc processing
  • Privacy-first products: personal data never leaves the boundary
  • Cost/power calculations
  • Korean-language local picks (Solar, Qwen, EXAONE)
  • Real benchmarks (tokens/sec, latency, quality)

The end of "depending on external APIs for everything". We draw a sharp line between when local LLMs make sense and when they don't.

See you next time.


TL;DR: Evaluation and observability are foundational infrastructure for LLM products. Split Layers 0–3 with matching frequency and cost; calibrate LLM-as-judge with position swap, multi-judge, and human review; instrument OpenTelemetry-based Trace/Span/Metric from day one. RAG, agents, and fine-tuning each require distinct evaluations, and incident playbooks plus user feedback loops power continuous improvement. "AI without measurement is a car without a steering wheel."