Split View: LLM 평가 & 관측성 완전 가이드: Eval Harness, LLM-as-Judge, Tracing, 회귀 방지 (2025)

LLM 평가 & 관측성 완전 가이드: Eval Harness, LLM-as-Judge, Tracing, 회귀 방지 (2025)

Season 4 Ep 6 — Ep 1–5은 "어떻게 만드는가"였다. Ep 6은 "만든 게 정말 작동하는지 어떻게 아는가". 평가 없는 LLM 제품은 눈 감고 운전하는 차다.

Prologue — "LGTM 주도 개발"의 종말
1장 · 평가의 4가지 층
2장 · Eval 데이터셋 만들기
3장 · LLM-as-Judge — 축복과 저주
4장 · 관측성 3층 — Trace · Span · Metric
5장 · 벤더 비교 — LangSmith · LangFuse · Phoenix · Helicone · W&B
6장 · RAG 평가
7장 · 에이전트 평가
8장 · CI/CD에 평가 끼워넣기
9장 · 안전·편향·환각 측정
10장 · 사고(Incident) 대응 플레이북
11장 · 사용자 피드백 루프
12장 · 한국어·한국 환경 관측성
13장 · 안티패턴 10선
14장 · 체크리스트 — 평가·관측성 런칭 전 12가지
15장 · 다음 글 예고 — Season 4 Ep 7: "로컬 LLM 시대"

Prologue — "LGTM 주도 개발"의 종말

2023년까지 많은 LLM 제품은 "Looks Good To Me" 주도 개발이었다. 프롬프트 고치고, 몇 번 돌려보고, "오 좋아진 것 같아" 하고 머지. 회귀가 나도 "왜 그런지 모르겠어요"로 끝.

2025년엔 이게 더 이상 통하지 않는다. 이유 3가지:

제품 규모: 월 수억 호출, 회귀 한 번이 수만 사용자에게 영향
팀 규모: 여러 엔지니어가 같은 프롬프트·모델 건드림 → 책임 분산
경쟁: 한 달만 손 놓으면 경쟁 제품이 앞서감 → 빠른 반복 필수, 평가 기반만이 가능

즉 LLM 제품에선 평가가 ML 모델 평가가 아니라 SW 엔지니어링의 테스트 자동화 수준으로 일상화돼야 한다. 이 글은 그걸 어떻게 하는지 정리.

1장 · 평가의 4가지 층

1.1 Layer 0 — 구조/스모크 테스트

응답이 유효한 JSON인가
요구 필드가 다 있는가
길이·포맷 제약을 지켰는가
API 호출이 200으로 돌아왔는가

기본 중의 기본. CI에서 매 PR에 돌려야 함.

1.2 Layer 1 — 골든셋 정답 비교

정답이 명확한 태스크(분류, 추출): Accuracy/F1
자연어 생성: BLEU/ROUGE/METEOR (참조용 정도)
주로 유닛 테스트 느낌

1.3 Layer 2 — 품질 판정

사람 평가자 or LLM-as-judge
"이 응답이 정확한가/유용한가/안전한가" 같은 주관적 축
통계적 유의성 검증 필요

1.4 Layer 3 — 프로덕션 지표

Thumbs up/down, 재질문 비율, 대화 이탈률
태스크 성공률(전환·해결)
지연·비용·에러율 등 운영 지표

Layer가 올라갈수록 신호는 강하지만, 빈도·비용·속도가 불리. 적절히 조합해야 함.

2장 · Eval 데이터셋 만들기

2.1 소스 다양화

프로덕션 로그 샘플링: 실제 분포 반영
과거 사고(incident) 재현: "이 문제는 다시 안 나야 한다"
난이도별 합성: LLM으로 Easy/Medium/Hard 샘플 생성
엣지 케이스 전용: 안전성·편향 테스트

2.2 규모

시작: 100–300개로 충분
안정 운영: 1,000–3,000개
회귀 방지용 "스모크": 10–30개 초고속

2.3 라벨링

단일 "정답"이 있으면 쉬움
여러 정답 가능 → 기준(rubric)을 명확히:
- "다음을 포함하면 정답: …", "다음을 포함하면 오답: …"
라벨러 간 일치도(Kappa) 0.7 이상을 목표

2.4 분할

Train (프롬프트 튜닝/Few-shot/Fine-tune용)
Dev (개발 중 반복 평가)
Test (최종 의사결정. 훈련/튜닝에 절대 노출 금지)

3장 · LLM-as-Judge — 축복과 저주

3.1 기본 아이디어

큰 모델(GPT-4o, Claude 3.5/4 등)에게 "이 응답이 정답인가?"를 물어서 자동 평가.

System: 당신은 공정한 평가자입니다.
User:  [질문] [응답]
       이 응답이 질문에 정확히 답하고 있나요? yes/no와 근거 한 줄.

3.2 장점

사람보다 수백 배 빠르고 싸다
일관된 기준 적용 가능
주관적 판정도 대신 해주는 경향

3.3 치명적 함정

Position bias: "둘 중 어느 게 더 나아요?" 물었을 때 A/B 위치만 바꿔도 결과 뒤집힘
Length bias: 긴 응답을 더 좋게 판정
Self-preference: GPT로 생성된 걸 GPT가 더 좋게 평가 (모델 간 편향)
Rubric 편차: 같은 기준 설명도 재실행마다 점수 분산
Easy-to-game: 평가 프롬프트를 알면 "판정을 속이는" 응답 학습

3.4 보정 테크닉

Position swap: A/B 순서를 바꿔 두 번 평가 → 일치할 때만 유효
Multiple judges: 서로 다른 모델 3개로 평가 → 다수결
Pairwise > Scalar: "몇 점?"보다 "A vs B 중 뭐가 나음?"이 안정적
Chain-of-Thought: 판정 전 이유를 쓰게 함(품질 ↑)
Rubric anchoring: 10개 대표 예시를 "이건 확실히 좋음/나쁨"으로 고정
Calibration set: 사람 라벨 200개로 판정 모델의 정확도 측정 → 보정 계수 적용

3.5 사람 평가와 병행

전체 중 10–20%는 항상 사람이 검수
LLM-judge 점수와 사람 점수의 상관을 월간 트래킹
상관이 0.7 이하로 떨어지면 judge 프롬프트·모델 개편

4장 · 관측성 3층 — Trace · Span · Metric

4.1 용어

Trace: 한 사용자 요청의 전체 실행 흐름 (여러 서비스·단계 포함)
Span: Trace의 한 단위 (LLM 호출, 툴 호출, DB 쿼리 등)
Metric: 시간대별 집계 (QPS, p95 지연, 비용/일)

4.2 OpenTelemetry 기반 사실상 표준

2025년 기준, LLM 관측성도 OpenTelemetry 스키마(OpenLLMetry, OpenInference 등)로 수렴하고 있다. 어떤 벤더를 쓰든 OTEL SDK로 계측해두면 이전·병행이 쉽다.

4.3 LLM 전용 속성

일반 trace에 더해 기록:

gen_ai.system: openai/anthropic/google/…
gen_ai.request.model, gen_ai.response.model
gen_ai.usage.input_tokens, output_tokens, cache_read, cache_write
gen_ai.temperature, top_p
gen_ai.prompt: (선택, PII 마스킹 후) 프롬프트 샘플
gen_ai.response.content: (선택) 응답 샘플
비용: $ 단위로 자동 계산

4.4 집계 지표

p50/p95/p99 지연 (first_token / total)
QPS, 에러율, 재시도율
비용/일, 비용/사용자, 비용/기능
품질: 평가셋 점수의 일일 추이

5장 · 벤더 비교 — LangSmith · LangFuse · Phoenix · Helicone · W&B

5.1 LangSmith

LangChain 공식. 호스티드 SaaS.
Trace + 평가셋 + 피드백 + 프롬프트 허브 통합
LangGraph·LangChain 사용자는 1순위

5.2 LangFuse

오픈소스(자체 호스팅 가능). SaaS도 제공.
OpenTelemetry·OpenAI SDK 자동 계측
프롬프트 버저닝·Eval 지원
엔터프라이즈 자체 호스팅 수요에 유리

5.3 Arize Phoenix

오픈소스. Arize의 프로덕션 플랫폼과 연계.
임베딩·Drift·RAG 검색 품질 시각화가 특히 강점
로컬에서 바로 돌려볼 수 있는 UI

5.4 Helicone

게이트웨이형(프록시). SDK 삽입 대신 URL만 바꾸면 됨
비용·지연 절감 기능(캐시, 라우팅) 기본
저지연 관측이 핵심

5.5 Weights & Biases Weave

ML 실험 관리 강자 W&B의 LLM 전용 모듈
실험·평가·프로덕션 추적을 하나로

5.6 선택 가이드

상황	추천
LangChain/LangGraph 스택	LangSmith
자체 호스팅 필수(규제)	LangFuse
RAG 중심, 임베딩 드리프트 중요	Phoenix
게이트웨이로 빠르게	Helicone
연구·실험 병행	W&B Weave
이미 Datadog/NewRelic	해당 벤더의 LLM 확장 (OpenLLMetry)

6장 · RAG 평가

6.1 분리 측정

Retrieval: 맞는 문서를 찾았나 (Hit@k, MRR, NDCG)
Generation: 찾은 문서를 잘 썼나 (Faithfulness, Answer Relevancy)
Context Quality: 불필요한 문서가 덜 섞였나 (Context Precision/Recall)

Retrieval이 실패인지 Generation이 실패인지 구분 못 하면 튜닝이 어렵다.

6.2 RAGAS 등 프레임워크

Faithfulness: 답변이 문서에서 뒷받침되는가
Answer Relevancy: 답변이 질문에 실제로 답하는가
Context Precision: 가져온 문서 중 실제 정답 근거의 비율
Context Recall: 정답에 필요한 문서를 얼마나 가져왔나

오픈소스 RAGAS, DeepEval 등이 해당 지표를 자동 산출.

6.3 골든 Q&A 셋

질문, 정답 문서 ID, 기대 답변 or 포함 키워드
이 셋이 없으면 RAG 튜닝은 전부 추측

7장 · 에이전트 평가

7.1 메트릭

Task success rate: 최종 결과가 정답?
Step efficiency: 몇 스텝? (낮을수록 좋지만 너무 적으면 대충)
Tool selection accuracy: 올바른 툴을 골랐나
Cost / Latency 분포: p50/p95
Safety: 금지 동작 시도/성공 횟수

7.2 Trajectory 평가

단순 결과만이 아니라 경로(trajectory) 자체가 의미 있다.

동일 결과라도 "불필요한 툴 호출 5번"과 "깔끔한 2번"은 운영 비용이 3배 차이

7.3 Replay 기반 평가

LangGraph/LangSmith의 체크포인트로 과거 실행 재생
새 모델·프롬프트를 동일 trajectory로 시뮬레이션 → 회귀 확인

8장 · CI/CD에 평가 끼워넣기

8.1 PR 단계

Layer 0 (스모크) 20개 — 2분 이내 완료
Layer 1 (골든셋) 100개 — 10분 이내
임계치 미달 시 자동 실패

8.2 메인 머지 후

Layer 2 (품질 판정, LLM-judge) 500개 — 30분
결과를 Slack/Discord로 팀 채널에

8.3 주 1회 대규모

Layer 3 (프로덕션 지표) 주간 리포트
전월 대비 품질·비용·지연 변화

8.4 Shadow & A/B

새 모델·프롬프트를 그림자로 동시 호출, 응답 기록 (사용자엔 기존 응답)
일정 기간 수집 후 비교 → A/B 프로모션

8.5 Canary

트래픽 1–5% → 지표 안정화 → 50% → 100%
자동 롤백(p95 지연 2배 or 품질 -5%p)

9장 · 안전·편향·환각 측정

9.1 안전(Safety) 벤치

RealToxicityPrompts, ToxiGen (영어)
한국어는 공개 벤치 부족 → 사내 커스텀 셋 필수
Jailbreak / 프롬프트 인젝션 자동 테스트(Garak, PyRIT)

9.2 편향(Bias)

성별, 연령, 지역, 직업 등 축별로 "같은 질문에 다른 답" 실험
한국 사회 특수 축(대학, 군복무, 출신지) 고려

9.3 환각(Hallucination)

사실 대조 벤치(FEVER 등)
RAG 환경에선 Faithfulness가 환각 지표의 프록시

9.4 거부(Refusal) 적정성

"정당한 요청을 과도하게 거부"하지 않는지도 측정
False refusal 비율 모니터링

10장 · 사고(Incident) 대응 플레이북

10.1 SEV 정의

SEV1: 사용자 다수 영향, 잘못된/위험한 답 (즉시 대응)
SEV2: 품질 회귀 10%+ (6시간 내)
SEV3: 비용/지연 회귀, 소수 영향 (주간 처리)

10.2 대응 순서

차단: 문제 모델/프롬프트/버전으로의 라우팅 일시 OFF
격리: 어느 구성이 문제인지 식별 (로그·Trace로)
완화: 직전 안정 버전으로 롤백
루트 분석: 평가셋·로그 결합으로 원인 규명
재발 방지: 해당 실패 사례를 평가셋에 영구 추가
포스트모템: 24–72시간 내 공개(내부)

10.3 체크포인트

모든 배포는 즉시 롤백 가능한가?
트래픽 게이팅을 초 단위로 조정 가능한가?
누가 권한 있는지 명확한가 (온콜)?

11장 · 사용자 피드백 루프

11.1 수집

👍 / 👎 + 선택적 이유(체크박스 + 자유 텍스트)
인라인(응답 옆) 버튼이 대시보드보다 참여율 높음
태스크 완료 직후가 최적 타이밍

11.2 활용

👎 케이스 → 평가셋 후보로 검토
👍 케이스 → DPO·Preference 데이터로
반복적인 동일 불만 → 프롬프트·RAG 튜닝 트리거

11.3 프라이버시

입력/출력 저장 동의, 삭제 API, 보유 기간 정책
사용자 식별자 해싱, 민감 도메인 마스킹
한국 법(개인정보보호법, 가명처리) 준수

12장 · 한국어·한국 환경 관측성

12.1 언어별 지표 분리

같은 제품이라도 한국어 응답 품질과 영어 품질이 다름
대시보드에서 언어별로 필터링 가능해야 함

12.2 한국어 평가 리소스

KMMLU, HAE-RAE, LogicKor, KoBench, Ko-MT-Bench
내부 벤치가 여전히 최종 판단 기준

12.3 규제·감사

금융, 의료 등은 감사 로그 5–10년 보관
외부 API 호출 시점·내용·응답을 불변 스토리지에 저장

12.4 On-prem 관측성

LangFuse, Phoenix, OpenTelemetry Collector 자체 호스팅
외부 전송 차단, 내부 감사만

13장 · 안티패턴 10선

13.1 "숫자 좋아 보임"으로 의사결정

통계적 유의성(표본 크기, 분산) 확인 없이 판정.

13.2 LLM-judge 단일 모델만 사용

Self-preference 편향. 다중 judge + 사람 샘플 병행.

13.3 평가셋과 훈련셋 섞임

Leakage. 해시 기반 중복 체크.

13.4 회귀 테스트가 PR에 없음

회귀가 main에 들어간 후 발견. CI에 Layer 0/1은 필수.

13.5 PII 그대로 저장

규제 위반 + 사고 시 2차 피해.

13.6 비용만 보고 품질 안 봄

"싸졌는데 만족도 떨어짐"을 놓침.

13.7 Trace 없음

에이전트 실패 원인 불명. 디버깅 불가.

13.8 사용자 피드백 미수집

가장 값진 신호를 공짜로 버림.

13.9 "테스트" 평가셋을 튜닝에 노출

결과가 부풀려짐. 다음 분기에 실망.

13.10 사고 재발 방지 안 함

같은 실패가 세 번째 나오는 팀은 평가 체계가 없는 것.

14장 · 체크리스트 — 평가·관측성 런칭 전 12가지

15장 · 다음 글 예고 — Season 4 Ep 7: "로컬 LLM 시대"

2025년은 "로컬 LLM이 실용 범위에 들어온 해"이기도 하다.

모델: Llama 3.1/3.3, Qwen2.5/3, Mistral, Gemma 3, Phi 등
엔진: vLLM, TGI, SGLang, llama.cpp, Ollama, LMDeploy
하드웨어: RTX 4090/5090, H100, Apple Silicon (M3/M4 Ultra)
양자화: INT4/INT8, AWQ, GPTQ, SmoothQuant, EXL2
실사용: 사내 지식 챗봇, 코드 어시스턴트, 문서 처리
Privacy-first 제품: 개인 정보가 외부로 안 나가게
비용·전력 계산
한국어 로컬 모델 선택 (Solar, Qwen, EXAONE)
실전 벤치마크(토큰/초, 지연, 품질)

"외부 API에 모든 걸 의존해야 했던 시절"의 종말. 로컬 LLM이 언제 말이 되고 언제 안 되는지 명확히 긋는다.

다음 글에서 만나자.

요약: 평가와 관측성은 LLM 제품의 기초 인프라다. Layer 0–3으로 쪼개서 각각의 빈도·비용을 맞추고, LLM-as-judge는 position swap·다중 judge·사람 검수로 보정하며, OpenTelemetry 기반 Trace/Span/Metric을 첫날부터 박는다. RAG·에이전트·Fine-tune은 각자 다른 평가가 필요하고, 사고(incident) 대응 플레이북과 사용자 피드백 루프가 제품 개선의 엔진이 된다. "측정 없는 AI는 운전대 없는 차."

LLM Evaluation & Observability: Eval Harness, LLM-as-Judge, Tracing, Regression Prevention (2025)

Season 4 Ep 6 — Ep 1–5 covered "how to build". Ep 6 covers "how do you know it actually works". An LLM product without evaluation is a car driven blindfolded.

Prologue — The End of "LGTM-Driven Development"
1. Four Layers of Evaluation
2. Building the Eval Dataset
3. LLM-as-Judge — Blessing and Curse
4. Observability — Trace / Span / Metric
5. Vendor Comparison
6. RAG Evaluation
7. Agent Evaluation
8. Wiring Eval into CI/CD
9. Safety, Bias, Hallucination
10. Incident Response Playbook
11. User Feedback Loop
12. Korean / Korean-Market Observability
13. Ten Anti-Patterns
14. Pre-Launch Checklist (12 items)
15. Next — Season 4 Ep 7: "The Local LLM Era"

Prologue — The End of "LGTM-Driven Development"

Until 2023, many LLM products were built via "Looks Good To Me" development: tweak the prompt, run it a few times, say "oh, seems better", merge. If a regression occurred, it ended with "no idea why".

In 2025, that no longer works:

Product scale: hundreds of millions of calls per month; one regression hits tens of thousands of users.
Team scale: multiple engineers touch the same prompts and models → diffused responsibility.
Competition: one month of inaction lets rivals pull ahead → fast iteration is mandatory, which only evaluation-driven workflows enable.

In short, in LLM products evaluation must be as routine as test automation in software engineering, not a one-off ML evaluation event.

1. Four Layers of Evaluation

1.1 Layer 0 — Structural / Smoke Tests

Is the response valid JSON?
Are all required fields present?
Did it respect length/format constraints?
Did the API return 200?

The basics. Must run on every PR in CI.

1.2 Layer 1 — Golden Set Ground-Truth Comparison

Classification/extraction with clear answers: Accuracy/F1
Natural-language generation: BLEU/ROUGE/METEOR (reference only)
Feels like unit tests.

1.3 Layer 2 — Quality Judgment

Human evaluators or LLM-as-judge
Subjective axes: "is this response accurate / useful / safe?"
Requires statistical-significance checks.

1.4 Layer 3 — Production Metrics

Thumbs up/down, re-ask rate, conversation abandonment
Task success rate (conversion / resolution)
Latency, cost, error rate — ops metrics

Higher layers provide stronger signal but worse frequency, cost, and speed. Combine wisely.

2. Building the Eval Dataset

2.1 Diversify Sources

Production log sampling: reflect real distribution
Past incident replay: "this must never happen again"
Synthetic by difficulty: LLM-generate Easy/Medium/Hard
Edge-case only: safety / bias tests

2.2 Size

Start: 100–300 is enough
Steady state: 1,000–3,000
Regression smoke: 10–30 very fast

2.3 Labeling

Easy when a single answer exists
When multiple answers are valid, define a rubric:
- "Correct if it contains ...", "Wrong if it contains ..."
Target inter-labeler agreement (Kappa) above 0.7.

2.4 Splits

Train (prompt tuning / few-shot / fine-tune)
Dev (iteration during development)
Test (final decisions. Never expose to training/tuning)

3. LLM-as-Judge — Blessing and Curse

3.1 Basic Idea

Ask a large model (GPT-4o, Claude 3.5/4, etc.) "is this response correct?" for automated evaluation.

System: You are a fair evaluator.
User:   [question] [response]
        Does this response answer the question correctly? yes/no with a one-line reason.

3.2 Upsides

Hundreds of times faster and cheaper than humans
Consistent criteria
Handles subjective judgment reasonably

3.3 Fatal Pitfalls

Position bias: in "which is better?" pairwise, flipping A/B can flip the verdict.
Length bias: longer answers judged better.
Self-preference: GPT rates GPT-generated text higher (cross-model bias).
Rubric variance: even with the same rubric, scores vary across runs.
Easy-to-game: if the eval prompt is known, models learn to "fool" the judge.

3.4 Calibration Techniques

Position swap: evaluate in both A/B orders → accept only if they agree
Multiple judges: three different models → majority vote
Pairwise > scalar: "A vs B" is more stable than "score out of 10"
Chain-of-Thought: require reasoning before verdict
Rubric anchoring: fix 10 canonical "clearly good/bad" examples
Calibration set: measure judge accuracy on 200 human-labeled items → apply correction

3.5 Keep Humans in the Loop

Always review 10–20% by humans
Track monthly correlation between judge and human scores
If correlation drops below 0.7, revise judge prompt/model

4. Observability — Trace / Span / Metric

4.1 Terminology

Trace: full execution flow of one user request
Span: one unit within a trace (LLM call, tool call, DB query)
Metric: time-series aggregate (QPS, p95 latency, cost/day)

4.2 OpenTelemetry — the De Facto Standard

As of 2025, LLM observability is converging on OpenTelemetry schemas (OpenLLMetry, OpenInference). Instrumenting with OTEL SDKs makes vendor switching trivial.

4.3 LLM-Specific Attributes

In addition to standard trace attributes:

gen_ai.system: openai / anthropic / google / ...
gen_ai.request.model, gen_ai.response.model
gen_ai.usage.input_tokens, output_tokens, cache_read, cache_write
gen_ai.temperature, top_p
gen_ai.prompt (optional, PII-masked): prompt sample
gen_ai.response.content (optional): response sample
Cost: auto-computed in USD

4.4 Aggregate Metrics

p50/p95/p99 latency (first_token / total)
QPS, error rate, retry rate
Cost/day, cost/user, cost/feature
Quality: daily trend of eval-set scores

5. Vendor Comparison

5.1 LangSmith

Official LangChain product, hosted SaaS.
Trace + eval sets + feedback + prompt hub integrated.
First choice for LangGraph/LangChain users.

5.2 LangFuse

Open source (self-hostable); SaaS also available.
Auto-instrumentation for OpenTelemetry / OpenAI SDK.
Prompt versioning and eval support.
Strong for enterprise self-hosting.

5.3 Arize Phoenix

Open source; integrates with Arize production platform.
Strong at embedding/drift/RAG retrieval-quality visualization.
Local UI you can run instantly.

5.4 Helicone

Gateway-style (proxy). Just change the URL instead of adding an SDK.
Built-in cost/latency savings (cache, routing).
Focus on low-latency observability.

5.5 Weights & Biases Weave

W&B's LLM-specific module.
Unifies experiments, evaluation, and production tracking.

5.6 Selection Guide

Situation	Pick
LangChain/LangGraph stack	LangSmith
Self-hosting required (regulated)	LangFuse
RAG-heavy, embedding drift matters	Phoenix
Gateway for fast rollout	Helicone
Research + experiments	W&B Weave
Already on Datadog/New Relic	Vendor's LLM extension (OpenLLMetry)

6. RAG Evaluation

6.1 Measure Separately

Retrieval: did we find the right docs? (Hit@k, MRR, NDCG)
Generation: did we use them well? (Faithfulness, Answer Relevancy)
Context Quality: few irrelevant docs? (Context Precision/Recall)

Without separating retrieval vs. generation failures, tuning is guesswork.

6.2 Frameworks like RAGAS

Faithfulness: is the answer grounded in the documents?
Answer Relevancy: does the answer actually address the question?
Context Precision: ratio of retrieved docs that are actual evidence
Context Recall: how much of the required evidence did we retrieve?

Open-source RAGAS, DeepEval compute these metrics automatically.

6.3 Golden Q&A Set

Question, ground-truth doc IDs, expected answer or required keywords.
Without this set, RAG tuning is pure guesswork.

7. Agent Evaluation

7.1 Metrics

Task success rate: final result correct?
Step efficiency: how many steps? (fewer is better, but not too few)
Tool selection accuracy: correct tool chosen?
Cost / Latency distribution: p50/p95
Safety: attempts and successes at forbidden actions

7.2 Trajectory Evaluation

Not just the outcome — the trajectory matters.

Same result with "5 unnecessary tool calls" vs. "2 clean calls" means 3x ops cost.

7.3 Replay-Based Evaluation

Replay past runs via LangGraph/LangSmith checkpoints.
Simulate new model/prompt on the same trajectory → detect regressions.

8. Wiring Eval into CI/CD

8.1 PR Stage

Layer 0 (smoke) 20 items — within 2 min
Layer 1 (golden) 100 items — within 10 min
Auto-fail if thresholds missed.

8.2 Post-Main-Merge

Layer 2 (quality, LLM-judge) 500 items — 30 min
Push results to Slack/Discord team channel.

8.3 Weekly Full Run

Layer 3 (production metrics) weekly report
Month-over-month quality / cost / latency changes.

8.4 Shadow & A/B

Shadow-call new model/prompt in parallel; log responses (user sees the old response).
After collection, compare → promote via A/B.

8.5 Canary

Traffic 1–5% → stabilize → 50% → 100%
Auto-rollback (p95 latency 2x or quality -5pp).

9. Safety, Bias, Hallucination

9.1 Safety Benchmarks

RealToxicityPrompts, ToxiGen (English)
Korean public benches are scarce → custom internal set mandatory.
Jailbreak / prompt-injection automation (Garak, PyRIT).

9.2 Bias

Gender, age, region, occupation axes: "same question, different answer?"
Account for Korea-specific axes (university, military service, region).

9.3 Hallucination

Fact-verification benches (FEVER, etc.).
In RAG, Faithfulness proxies hallucination.

9.4 Refusal Appropriateness

Don't over-refuse legitimate requests.
Monitor false-refusal rate.

10. Incident Response Playbook

10.1 SEV Definitions

SEV1: widespread impact, wrong/dangerous answers (immediate)
SEV2: 10%+ quality regression (within 6h)
SEV3: cost/latency regression, small impact (weekly)

10.2 Response Order

Block: turn off routing to the problematic model/prompt/version.
Isolate: identify which config is at fault (logs / traces).
Mitigate: roll back to last stable version.
Root cause: combine eval set + logs.
Prevent recurrence: permanently add the failure to the eval set.
Postmortem: internal publication within 24–72h.

10.3 Checkpoints

Every deploy instantly rollback-able?
Traffic gating adjustable per second?
Clear on-call ownership?

11. User Feedback Loop

11.1 Collection

Thumbs up/down + optional reason (checkbox + free text).
Inline buttons get higher engagement than dashboards.
Collect right after task completion.

11.2 Usage

Thumbs-down cases → eval-set candidates.
Thumbs-up cases → DPO / preference data.
Repeated same complaint → triggers prompt/RAG tuning.

11.3 Privacy

Consent for input/output storage, deletion API, retention policy.
User-ID hashing, masking of sensitive domains.
Comply with Korean law (PIPA, pseudonymization).

12. Korean / Korean-Market Observability

12.1 Per-Language Metrics

Same product often differs in Korean vs. English quality.
Dashboards must filter by language.

12.2 Korean Eval Resources

KMMLU, HAE-RAE, LogicKor, KoBench, Ko-MT-Bench.
Internal benches remain the final decision signal.

12.3 Regulation & Audit

Finance, healthcare: retain audit logs 5–10 years.
Store external API call time/content/response in immutable storage.

12.4 On-prem Observability

Self-host LangFuse, Phoenix, OpenTelemetry Collector.
Block external egress; internal audit only.

13. Ten Anti-Patterns

13.1 Deciding by "the numbers look good"

Without checking significance (sample size, variance).

13.2 Single-model LLM-judge

Self-preference bias. Use multi-judge + human sampling.

13.3 Eval/training set leakage

Check via hash-based dedup.

13.4 No regression tests in PRs

Regressions reach main before detection. Layer 0/1 mandatory in CI.

13.5 Storing raw PII

Regulatory violation + secondary damage on incident.

13.6 Watching cost, not quality

"Cheaper but satisfaction dropped" goes unnoticed.

13.7 No trace

Agent failures unexplained. Debugging impossible.

13.8 No user feedback

Throwing away the most valuable free signal.

13.9 Exposing the "test" set to tuning

Inflated results, disappointment next quarter.

13.10 No recurrence prevention

A team hitting the same failure three times has no eval system.

14. Pre-Launch Checklist (12 items)

15. Next — Season 4 Ep 7: "The Local LLM Era"

2025 is also the year local LLMs became practical.

Models: Llama 3.1/3.3, Qwen2.5/3, Mistral, Gemma 3, Phi
Engines: vLLM, TGI, SGLang, llama.cpp, Ollama, LMDeploy
Hardware: RTX 4090/5090, H100, Apple Silicon (M3/M4 Ultra)
Quantization: INT4/INT8, AWQ, GPTQ, SmoothQuant, EXL2
Real use: internal knowledge bots, code assistants, doc processing
Privacy-first products: personal data never leaves the boundary
Cost/power calculations
Korean-language local picks (Solar, Qwen, EXAONE)
Real benchmarks (tokens/sec, latency, quality)

The end of "depending on external APIs for everything". We draw a sharp line between when local LLMs make sense and when they don't.

See you next time.

TL;DR: Evaluation and observability are foundational infrastructure for LLM products. Split Layers 0–3 with matching frequency and cost; calibrate LLM-as-judge with position swap, multi-judge, and human review; instrument OpenTelemetry-based Trace/Span/Metric from day one. RAG, agents, and fine-tuning each require distinct evaluations, and incident playbooks plus user feedback loops power continuous improvement. "AI without measurement is a car without a steering wheel."