LLM Evaluation & Observability: Eval Harness, LLM-as-Judge, Tracing, Regression Prevention (2025)

Season 4 Ep 6 — Ep 1–5 covered "how to build". Ep 6 covers "how do you know it actually works". An LLM product without evaluation is a car driven blindfolded.

Prologue — The End of "LGTM-Driven Development"
1. Four Layers of Evaluation
2. Building the Eval Dataset
3. LLM-as-Judge — Blessing and Curse
4. Observability — Trace / Span / Metric
5. Vendor Comparison
6. RAG Evaluation
7. Agent Evaluation
8. Wiring Eval into CI/CD
9. Safety, Bias, Hallucination
10. Incident Response Playbook
11. User Feedback Loop
12. Korean / Korean-Market Observability
13. Ten Anti-Patterns
14. Pre-Launch Checklist (12 items)
15. Next — Season 4 Ep 7: "The Local LLM Era"

Prologue — The End of "LGTM-Driven Development"

Until 2023, many LLM products were built via "Looks Good To Me" development: tweak the prompt, run it a few times, say "oh, seems better", merge. If a regression occurred, it ended with "no idea why".

In 2025, that no longer works:

Product scale: hundreds of millions of calls per month; one regression hits tens of thousands of users.
Team scale: multiple engineers touch the same prompts and models → diffused responsibility.
Competition: one month of inaction lets rivals pull ahead → fast iteration is mandatory, which only evaluation-driven workflows enable.

In short, in LLM products evaluation must be as routine as test automation in software engineering, not a one-off ML evaluation event.

1. Four Layers of Evaluation

1.1 Layer 0 — Structural / Smoke Tests

Is the response valid JSON?
Are all required fields present?
Did it respect length/format constraints?
Did the API return 200?

The basics. Must run on every PR in CI.

1.2 Layer 1 — Golden Set Ground-Truth Comparison

Classification/extraction with clear answers: Accuracy/F1
Natural-language generation: BLEU/ROUGE/METEOR (reference only)
Feels like unit tests.

1.3 Layer 2 — Quality Judgment

Human evaluators or LLM-as-judge
Subjective axes: "is this response accurate / useful / safe?"
Requires statistical-significance checks.

1.4 Layer 3 — Production Metrics

Thumbs up/down, re-ask rate, conversation abandonment
Task success rate (conversion / resolution)
Latency, cost, error rate — ops metrics

Higher layers provide stronger signal but worse frequency, cost, and speed. Combine wisely.

2. Building the Eval Dataset

2.1 Diversify Sources

Production log sampling: reflect real distribution
Past incident replay: "this must never happen again"
Synthetic by difficulty: LLM-generate Easy/Medium/Hard
Edge-case only: safety / bias tests

2.2 Size

Start: 100–300 is enough
Steady state: 1,000–3,000
Regression smoke: 10–30 very fast

2.3 Labeling

Easy when a single answer exists
When multiple answers are valid, define a rubric:
- "Correct if it contains ...", "Wrong if it contains ..."
Target inter-labeler agreement (Kappa) above 0.7.

2.4 Splits

Train (prompt tuning / few-shot / fine-tune)
Dev (iteration during development)
Test (final decisions. Never expose to training/tuning)

3. LLM-as-Judge — Blessing and Curse

3.1 Basic Idea

Ask a large model (GPT-4o, Claude 3.5/4, etc.) "is this response correct?" for automated evaluation.

System: You are a fair evaluator.
User:   [question] [response]
        Does this response answer the question correctly? yes/no with a one-line reason.

3.2 Upsides

Hundreds of times faster and cheaper than humans
Consistent criteria
Handles subjective judgment reasonably

3.3 Fatal Pitfalls

Position bias: in "which is better?" pairwise, flipping A/B can flip the verdict.
Length bias: longer answers judged better.
Self-preference: GPT rates GPT-generated text higher (cross-model bias).
Rubric variance: even with the same rubric, scores vary across runs.
Easy-to-game: if the eval prompt is known, models learn to "fool" the judge.

3.4 Calibration Techniques

Position swap: evaluate in both A/B orders → accept only if they agree
Multiple judges: three different models → majority vote
Pairwise > scalar: "A vs B" is more stable than "score out of 10"
Chain-of-Thought: require reasoning before verdict
Rubric anchoring: fix 10 canonical "clearly good/bad" examples
Calibration set: measure judge accuracy on 200 human-labeled items → apply correction

3.5 Keep Humans in the Loop

Always review 10–20% by humans
Track monthly correlation between judge and human scores
If correlation drops below 0.7, revise judge prompt/model

4. Observability — Trace / Span / Metric

4.1 Terminology

Trace: full execution flow of one user request
Span: one unit within a trace (LLM call, tool call, DB query)
Metric: time-series aggregate (QPS, p95 latency, cost/day)

4.2 OpenTelemetry — the De Facto Standard

As of 2025, LLM observability is converging on OpenTelemetry schemas (OpenLLMetry, OpenInference). Instrumenting with OTEL SDKs makes vendor switching trivial.

4.3 LLM-Specific Attributes

In addition to standard trace attributes:

gen_ai.system: openai / anthropic / google / ...
gen_ai.request.model, gen_ai.response.model
gen_ai.usage.input_tokens, output_tokens, cache_read, cache_write
gen_ai.temperature, top_p
gen_ai.prompt (optional, PII-masked): prompt sample
gen_ai.response.content (optional): response sample
Cost: auto-computed in USD

4.4 Aggregate Metrics

p50/p95/p99 latency (first_token / total)
QPS, error rate, retry rate
Cost/day, cost/user, cost/feature
Quality: daily trend of eval-set scores

5. Vendor Comparison

5.1 LangSmith

Official LangChain product, hosted SaaS.
Trace + eval sets + feedback + prompt hub integrated.
First choice for LangGraph/LangChain users.

5.2 LangFuse

Open source (self-hostable); SaaS also available.
Auto-instrumentation for OpenTelemetry / OpenAI SDK.
Prompt versioning and eval support.
Strong for enterprise self-hosting.

5.3 Arize Phoenix

Open source; integrates with Arize production platform.
Strong at embedding/drift/RAG retrieval-quality visualization.
Local UI you can run instantly.

5.4 Helicone

Gateway-style (proxy). Just change the URL instead of adding an SDK.
Built-in cost/latency savings (cache, routing).
Focus on low-latency observability.

5.5 Weights & Biases Weave

W&B's LLM-specific module.
Unifies experiments, evaluation, and production tracking.

5.6 Selection Guide

Situation	Pick
LangChain/LangGraph stack	LangSmith
Self-hosting required (regulated)	LangFuse
RAG-heavy, embedding drift matters	Phoenix
Gateway for fast rollout	Helicone
Research + experiments	W&B Weave
Already on Datadog/New Relic	Vendor's LLM extension (OpenLLMetry)

6. RAG Evaluation

6.1 Measure Separately

Retrieval: did we find the right docs? (Hit@k, MRR, NDCG)
Generation: did we use them well? (Faithfulness, Answer Relevancy)
Context Quality: few irrelevant docs? (Context Precision/Recall)

Without separating retrieval vs. generation failures, tuning is guesswork.

6.2 Frameworks like RAGAS

Faithfulness: is the answer grounded in the documents?
Answer Relevancy: does the answer actually address the question?
Context Precision: ratio of retrieved docs that are actual evidence
Context Recall: how much of the required evidence did we retrieve?

Open-source RAGAS, DeepEval compute these metrics automatically.

6.3 Golden Q&A Set

Question, ground-truth doc IDs, expected answer or required keywords.
Without this set, RAG tuning is pure guesswork.

7. Agent Evaluation

7.1 Metrics

Task success rate: final result correct?
Step efficiency: how many steps? (fewer is better, but not too few)
Tool selection accuracy: correct tool chosen?
Cost / Latency distribution: p50/p95
Safety: attempts and successes at forbidden actions

7.2 Trajectory Evaluation

Not just the outcome — the trajectory matters.

Same result with "5 unnecessary tool calls" vs. "2 clean calls" means 3x ops cost.

7.3 Replay-Based Evaluation

Replay past runs via LangGraph/LangSmith checkpoints.
Simulate new model/prompt on the same trajectory → detect regressions.

8. Wiring Eval into CI/CD

8.1 PR Stage

Layer 0 (smoke) 20 items — within 2 min
Layer 1 (golden) 100 items — within 10 min
Auto-fail if thresholds missed.

8.2 Post-Main-Merge

Layer 2 (quality, LLM-judge) 500 items — 30 min
Push results to Slack/Discord team channel.

8.3 Weekly Full Run

Layer 3 (production metrics) weekly report
Month-over-month quality / cost / latency changes.

8.4 Shadow & A/B

Shadow-call new model/prompt in parallel; log responses (user sees the old response).
After collection, compare → promote via A/B.

8.5 Canary

Traffic 1–5% → stabilize → 50% → 100%
Auto-rollback (p95 latency 2x or quality -5pp).

9. Safety, Bias, Hallucination

9.1 Safety Benchmarks

RealToxicityPrompts, ToxiGen (English)
Korean public benches are scarce → custom internal set mandatory.
Jailbreak / prompt-injection automation (Garak, PyRIT).

9.2 Bias

Gender, age, region, occupation axes: "same question, different answer?"
Account for Korea-specific axes (university, military service, region).

9.3 Hallucination

Fact-verification benches (FEVER, etc.).
In RAG, Faithfulness proxies hallucination.

9.4 Refusal Appropriateness

Don't over-refuse legitimate requests.
Monitor false-refusal rate.

10. Incident Response Playbook

10.1 SEV Definitions

SEV1: widespread impact, wrong/dangerous answers (immediate)
SEV2: 10%+ quality regression (within 6h)
SEV3: cost/latency regression, small impact (weekly)

10.2 Response Order

Block: turn off routing to the problematic model/prompt/version.
Isolate: identify which config is at fault (logs / traces).
Mitigate: roll back to last stable version.
Root cause: combine eval set + logs.
Prevent recurrence: permanently add the failure to the eval set.
Postmortem: internal publication within 24–72h.

10.3 Checkpoints

Every deploy instantly rollback-able?
Traffic gating adjustable per second?
Clear on-call ownership?

11. User Feedback Loop

11.1 Collection

Thumbs up/down + optional reason (checkbox + free text).
Inline buttons get higher engagement than dashboards.
Collect right after task completion.

11.2 Usage

Thumbs-down cases → eval-set candidates.
Thumbs-up cases → DPO / preference data.
Repeated same complaint → triggers prompt/RAG tuning.

11.3 Privacy

Consent for input/output storage, deletion API, retention policy.
User-ID hashing, masking of sensitive domains.
Comply with Korean law (PIPA, pseudonymization).

12. Korean / Korean-Market Observability

12.1 Per-Language Metrics

Same product often differs in Korean vs. English quality.
Dashboards must filter by language.

12.2 Korean Eval Resources

KMMLU, HAE-RAE, LogicKor, KoBench, Ko-MT-Bench.
Internal benches remain the final decision signal.

12.3 Regulation & Audit

Finance, healthcare: retain audit logs 5–10 years.
Store external API call time/content/response in immutable storage.

12.4 On-prem Observability

Self-host LangFuse, Phoenix, OpenTelemetry Collector.
Block external egress; internal audit only.

13. Ten Anti-Patterns

13.1 Deciding by "the numbers look good"

Without checking significance (sample size, variance).

13.2 Single-model LLM-judge

Self-preference bias. Use multi-judge + human sampling.

13.3 Eval/training set leakage

Check via hash-based dedup.

13.4 No regression tests in PRs

Regressions reach main before detection. Layer 0/1 mandatory in CI.

13.5 Storing raw PII

Regulatory violation + secondary damage on incident.

13.6 Watching cost, not quality

"Cheaper but satisfaction dropped" goes unnoticed.

13.7 No trace

Agent failures unexplained. Debugging impossible.

13.8 No user feedback

Throwing away the most valuable free signal.

13.9 Exposing the "test" set to tuning

Inflated results, disappointment next quarter.

13.10 No recurrence prevention

A team hitting the same failure three times has no eval system.

14. Pre-Launch Checklist (12 items)

15. Next — Season 4 Ep 7: "The Local LLM Era"

2025 is also the year local LLMs became practical.

Models: Llama 3.1/3.3, Qwen2.5/3, Mistral, Gemma 3, Phi
Engines: vLLM, TGI, SGLang, llama.cpp, Ollama, LMDeploy
Hardware: RTX 4090/5090, H100, Apple Silicon (M3/M4 Ultra)
Quantization: INT4/INT8, AWQ, GPTQ, SmoothQuant, EXL2
Real use: internal knowledge bots, code assistants, doc processing
Privacy-first products: personal data never leaves the boundary
Cost/power calculations
Korean-language local picks (Solar, Qwen, EXAONE)
Real benchmarks (tokens/sec, latency, quality)

The end of "depending on external APIs for everything". We draw a sharp line between when local LLMs make sense and when they don't.

See you next time.

TL;DR: Evaluation and observability are foundational infrastructure for LLM products. Split Layers 0–3 with matching frequency and cost; calibrate LLM-as-judge with position swap, multi-judge, and human review; instrument OpenTelemetry-based Trace/Span/Metric from day one. RAG, agents, and fine-tuning each require distinct evaluations, and incident playbooks plus user feedback loops power continuous improvement. "AI without measurement is a car without a steering wheel."