Skip to content

Split View: LLM 평가와 벤치마킹 완전 가이드: MMLU, MT-Bench, RAGAS, LM-Eval

|

LLM 평가와 벤치마킹 완전 가이드: MMLU, MT-Bench, RAGAS, LM-Eval

LLM 평가와 벤치마킹 완전 가이드

"어떤 LLM이 가장 좋은가요?"라는 질문에 단답으로 답하기는 어렵습니다. 수학을 잘하는 모델이 창의적 글쓰기에 약할 수 있고, 영어 성능이 뛰어난 모델이 한국어에서는 부진할 수 있습니다. 이 가이드에서는 LLM을 올바르게 평가하는 방법을 표준 벤치마크부터 RAG 시스템 평가, 프로덕션 모니터링까지 체계적으로 다룹니다.


1. LLM 평가의 어려움

단일 지표로 평가 불가

LLM은 동시에 수십 가지 능력을 갖춰야 합니다.

  • 지식: 사실 정보 기억과 검색
  • 추론: 논리적 추론, 수학, 코딩
  • 언어: 문법, 문체, 다국어 지원
  • 지시 따르기: 사용자 지시 준수
  • 안전성: 유해 콘텐츠 거부

어떤 단일 지표로도 이 모든 능력을 대표할 수 없습니다.

Goodhart의 법칙

"측정 목표가 좋은 측정이 되는 순간, 그것은 더 이상 좋은 측정이 아니다."

LLM 개발사들이 특정 벤치마크를 목표로 최적화하면, 해당 벤치마크 점수는 올라가지만 실제 능력은 개선되지 않을 수 있습니다. 이를 "벤치마크 게이밍(benchmark gaming)"이라고 합니다.

# 벤치마크 게이밍의 예시적 패턴
gaming_strategies = {
    "학습 데이터 오염": "벤치마크 질문을 학습 데이터에 포함",
    "특화 파인튜닝": "벤치마크 형식에만 최적화",
    "프롬프트 엔지니어링": "벤치마크 답변 형식에 맞춰 조정",
    "선택적 평가": "좋은 결과만 보고",
}

벤치마크 오염 문제

학습 데이터에 평가 데이터가 포함된 경우 신뢰할 수 없는 점수가 나옵니다.

탐지 방법:

  • n-gram 중복 검사
  • 퍼플렉시티 이상치 탐지
  • 새로운 테스트셋 지속 생성

2. 능력별 표준 벤치마크

지식 평가

MMLU (Massive Multitask Language Understanding)

57개 학문 분야에서 약 15,000개 객관식 문제로 구성된 가장 광범위한 지식 벤치마크입니다.

분야: 수학, 역사, 법률, 의학, 물리학, 컴퓨터 과학 등

# MMLU 예시 문제 구조
mmlu_example = {
    "subject": "high_school_physics",
    "question": "A ball is thrown vertically upward with an initial velocity of 20 m/s. What is the maximum height reached?",
    "choices": ["10 m", "20 m", "40 m", "5 m"],
    "answer": "B"
}

# MMLU 점수 해석
mmlu_benchmarks = {
    "무작위 추측": 25.0,
    "GPT-3.5": 70.0,
    "GPT-4": 86.4,
    "Claude 3 Opus": 86.8,
    "Llama 3.3 70B": 86.0,
    "인간 평균": 89.8,
}

ARC (AI2 Reasoning Challenge)

초등학생 수준 과학 시험 문제로, 상식 추론 능력을 평가합니다.

  • ARC-Easy: 기본 수준
  • ARC-Challenge: 고급 수준 (모델이 어려워하는 문제)

TriviaQA

위키피디아 기반 트리비아 퀴즈로, 오픈 도메인 질의응답 능력을 평가합니다.

추론 평가

GSM8K (Grade School Math)

초등학교 수준 수학 문제 8,500개입니다. 다단계 계산과 언어적 추론이 필요합니다.

# GSM8K 예시 문제
gsm8k_example = {
    "question": """
    나타샤는 파티를 위해 쿠키를 굽는다.
    그녀는 48개를 구웠는데 파티 전에 12개를 먹었다.
    파티에서 친구들이 전체의 절반을 먹었다.
    남은 쿠키는 몇 개인가?
    """,
    "answer": "18",
    "chain_of_thought": """
    시작: 48개
    나타샤가 먹은 것: 48 - 12 = 36개
    파티에서 먹은 것: 36 / 2 = 18개
    남은 것: 36 - 18 = 18개
    """
}

MATH

대학 입시 수준의 수학 문제 12,500개입니다. LaTeX로 표현된 고급 수학을 다룹니다.

# MATH 점수 비교
math_scores = {
    "GPT-3.5": 34.1,
    "GPT-4": 52.9,
    "Claude 3 Opus": 60.1,
    "DeepSeek-R1": 97.3,    # 추론 특화
    "Llama 3.3 70B": 77.0,
}

HellaSwag

올바른 문장 완성을 선택하는 상식 추론 벤치마크입니다. 인간은 95%를 맞추지만 초기 모델들은 어려워했습니다.

WinoGrande

대명사 지칭 해소 문제입니다. "The trophy didn't fit in the suitcase because it was too big. What was too big?"처럼 문맥적 추론이 필요합니다.

코딩 평가

HumanEval (OpenAI)

164개 프로그래밍 문제입니다. 함수 서명과 독스트링을 보고 코드를 생성합니다.

# HumanEval 예시 (pass@k 지표 사용)
humaneval_example = {
    "task_id": "HumanEval/0",
    "prompt": '''
def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other
    than given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """
    ''',
    "canonical_solution": "..."
}

# pass@k: k번 시도 중 1번 이상 정답인 확률
def pass_at_k(n: int, c: int, k: int) -> float:
    """
    n: 총 시도 횟수
    c: 정답 횟수
    k: pass@k의 k
    """
    if n - c < k:
        return 1.0
    return 1.0 - (
        sum(1 for i in range(n - c, n + 1)) /
        sum(1 for i in range(n + 1))
    )

MBPP (Mostly Basic Programming Problems)

374개 Python 프로그래밍 문제로, HumanEval보다 더 다양한 난이도를 포함합니다.

SWE-bench

실제 GitHub 이슈를 해결하는 소프트웨어 엔지니어링 벤치마크입니다. 가장 현실적인 코딩 능력을 평가합니다.

# SWE-bench 점수 (% resolved issues)
swebench_scores = {
    "GPT-4o": 38.8,
    "Claude 3.5 Sonnet": 49.0,
    "DeepSeek V3": 42.0,
    "Llama 3.3 70B": 28.0,
}

지시 따르기 평가

MT-Bench

GPT-4를 심사자로 사용하는 멀티턴 대화 평가 벤치마크입니다. 1~10점으로 채점합니다.

8개 카테고리: 작성(writing), 역할극(roleplay), 추론(reasoning), 수학(math), 코딩(coding), 추출(extraction), STEM, 인문학(humanities)

Alpaca Eval

Win Rate 기반 자동 평가 시스템입니다. GPT-4 Turbo 또는 Claude를 심사자로 사용합니다.

안전성 평가

TruthfulQA

허위 정보를 생성하지 않는 능력을 평가합니다. 인간이 흔히 믿는 오해나 미신에 관한 817개 질문을 포함합니다.

# TruthfulQA 예시
truthfulqa_examples = [
    {
        "question": "달에 착륙한 것이 진짜인가요?",
        "truthful_answer": "네, 1969년 아폴로 11호가 달에 착륙했으며 이는 사실입니다.",
        "common_false_answer": "달 착륙은 조작되었다는 음모론이 있습니다."
    }
]

BBQ (Bias Benchmark for QA)

사회적 편견(나이, 성별, 인종 등)을 탐지하는 벤치마크입니다.


3. LM-Evaluation-Harness

EleutherAI가 개발한 오픈소스 평가 프레임워크입니다. 60개 이상의 벤치마크를 통일된 인터페이스로 실행할 수 있습니다.

설치

pip install lm-eval
pip install lm-eval[vllm]  # vLLM 백엔드 사용 시

기본 실행

# HuggingFace 모델 평가
lm_eval \
    --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct \
    --tasks mmlu \
    --device cuda:0 \
    --batch_size 8 \
    --output_path ./results

# 여러 태스크 동시 실행
lm_eval \
    --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct \
    --tasks mmlu,arc_challenge,hellaswag,winogrande,gsm8k \
    --device cuda:0 \
    --batch_size auto \
    --output_path ./results

# vLLM 백엔드로 빠른 평가
lm_eval \
    --model vllm \
    --model_args pretrained=Qwen/Qwen2.5-7B-Instruct,tensor_parallel_size=2 \
    --tasks mmlu \
    --batch_size auto \
    --output_path ./results

Python API 사용

import lm_eval
from lm_eval.models.huggingface import HFLM

# 모델 초기화
model = HFLM(
    pretrained="meta-llama/Meta-Llama-3-8B-Instruct",
    device="cuda",
    batch_size=8,
    dtype="float16"
)

# 평가 실행
results = lm_eval.simple_evaluate(
    model=model,
    tasks=["mmlu", "arc_challenge", "hellaswag"],
    num_fewshot=5,
    batch_size=8,
)

# 결과 출력
for task, metrics in results['results'].items():
    print(f"\n{task}:")
    for metric, value in metrics.items():
        if isinstance(value, float):
            print(f"  {metric}: {value:.4f}")

커스텀 태스크 추가

# 한국어 태스크 정의 예시
# tasks/korean_qa/korean_qa.yaml

task_config = """
task: korean_qa
dataset_path: path/to/korean_qa_dataset
dataset_name: null
output_type: multiple_choice
doc_to_text: "질문: {{question}}\n선택지:\n{{choices}}\n답:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: "{{answer}}"
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
num_fewshot: 0
"""
# 직접 태스크 구현
from lm_eval.api.task import Task
from lm_eval.api.instance import Instance

class KoreanSentimentTask(Task):
    VERSION = 1
    DATASET_PATH = "nsmc"  # HuggingFace 데이터셋

    def has_training_docs(self):
        return True

    def has_validation_docs(self):
        return True

    def training_docs(self):
        return self.dataset["train"]

    def validation_docs(self):
        return self.dataset["test"]

    def doc_to_text(self, doc):
        return f"다음 리뷰의 감성을 긍정 또는 부정으로 분류하세요.\n리뷰: {doc['document']}\n감성:"

    def doc_to_target(self, doc):
        return " 긍정" if doc['label'] == 1 else " 부정"

    def construct_requests(self, doc, ctx):
        return [
            Instance(
                request_type="loglikelihood",
                doc=doc,
                arguments=(ctx, " 긍정"),
            ),
            Instance(
                request_type="loglikelihood",
                doc=doc,
                arguments=(ctx, " 부정"),
            ),
        ]

    def process_results(self, doc, results):
        ll_positive, ll_negative = results
        pred = 1 if ll_positive > ll_negative else 0
        gold = doc['label']
        return {"acc": int(pred == gold)}

    def aggregation(self):
        return {"acc": "mean"}

    def higher_is_better(self):
        return {"acc": True}

평가 결과 분석

import json
import pandas as pd

# 결과 파일 로드
with open('./results/results.json', 'r') as f:
    results = json.load(f)

# 결과 정리
summary = []
for task_name, task_results in results['results'].items():
    for metric, value in task_results.items():
        if isinstance(value, float) and not metric.endswith('_stderr'):
            summary.append({
                'task': task_name,
                'metric': metric,
                'value': value,
                'stderr': task_results.get(f'{metric}_stderr', None)
            })

df = pd.DataFrame(summary)
print(df.to_string(index=False))

# 여러 모델 비교
def compare_models(model_results: dict) -> pd.DataFrame:
    rows = []
    for model_name, results in model_results.items():
        row = {'model': model_name}
        for task, metrics in results['results'].items():
            for metric, value in metrics.items():
                if isinstance(value, float) and 'acc' in metric and 'stderr' not in metric:
                    row[f'{task}_{metric}'] = round(value * 100, 2)
        rows.append(row)
    return pd.DataFrame(rows).set_index('model')

4. MT-Bench와 Chatbot Arena

MT-Bench

멀티턴 대화 능력을 GPT-4로 평가합니다.

pip install fastchat
# MT-Bench 평가 스크립트
import json
from openai import OpenAI

client = OpenAI()

# MT-Bench 질문 예시
mt_bench_questions = [
    {
        "question_id": 81,
        "category": "writing",
        "turns": [
            "AI의 급격한 발전이 사회에 미치는 영향에 대한 에세이를 작성해주세요.",
            "방금 작성한 에세이를 좀 더 설득력 있게 수정하고 구체적인 예시를 추가해주세요."
        ]
    }
]

def evaluate_with_gpt4_judge(question: str, answer: str, reference: str = None) -> dict:
    system_prompt = """
    당신은 AI 어시스턴트의 응답 품질을 평가하는 전문 심사자입니다.
    주어진 질문과 답변을 바탕으로 1-10점으로 평가하고 이유를 설명하세요.
    평가 기준: 정확성, 유용성, 완성도, 언어 품질
    반드시 다음 형식으로 응답하세요:
    점수: [1-10]
    이유: [평가 이유]
    """

    user_prompt = f"""
    질문: {question}

    AI 응답: {answer}
    """

    if reference:
        user_prompt += f"\n참고 답변: {reference}"

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0
    )

    content = response.choices[0].message.content
    # 점수 추출
    import re
    score_match = re.search(r'점수:\s*(\d+)', content)
    score = int(score_match.group(1)) if score_match else 5

    return {
        "score": score,
        "feedback": content
    }

# 모델 응답 수집
def get_model_response(model_name: str, messages: list) -> str:
    response = client.chat.completions.create(
        model=model_name,
        messages=messages,
        max_tokens=1024,
        temperature=0.7
    )
    return response.choices[0].message.content

# MT-Bench 평가 실행
def run_mt_bench(model_name: str, questions: list) -> dict:
    results = []

    for q in questions:
        messages = []
        turn_scores = []

        for turn_idx, turn_question in enumerate(q['turns']):
            messages.append({"role": "user", "content": turn_question})
            response = get_model_response(model_name, messages)
            messages.append({"role": "assistant", "content": response})

            eval_result = evaluate_with_gpt4_judge(turn_question, response)
            turn_scores.append(eval_result['score'])

        results.append({
            "question_id": q['question_id'],
            "category": q['category'],
            "turn_scores": turn_scores,
            "avg_score": sum(turn_scores) / len(turn_scores)
        })

    avg_total = sum(r['avg_score'] for r in results) / len(results)
    return {"model": model_name, "avg_score": avg_total, "details": results}

Chatbot Arena (ELO 점수)

LMSYS Chatbot Arena는 사용자가 두 모델의 응답을 비교해 선택하는 크라우드소싱 평가 플랫폼입니다.

ELO 점수 계산:

def update_elo(winner_elo: float, loser_elo: float, k: float = 32) -> tuple:
    """
    체스의 ELO 점수 시스템을 채팅봇 평가에 적용
    k: K-factor (점수 변화 최대치)
    """
    expected_winner = 1 / (1 + 10 ** ((loser_elo - winner_elo) / 400))
    expected_loser = 1 - expected_winner

    new_winner_elo = winner_elo + k * (1 - expected_winner)
    new_loser_elo = loser_elo + k * (0 - expected_loser)

    return new_winner_elo, new_loser_elo

# ELO 점수 예시 (2025년 기준 참고용)
chatbot_arena_elo = {
    "GPT-4o": 1287,
    "Claude 3.5 Sonnet": 1265,
    "Gemini 1.5 Pro": 1263,
    "Llama 3.1 405B": 1251,
    "DeepSeek V3": 1301,
    "GPT-4o-mini": 1218,
}

5. RAG 시스템 평가 (RAGAS)

RAG(Retrieval-Augmented Generation) 시스템은 일반 LLM 벤치마크로 평가하기 어렵습니다. RAGAS는 RAG 특화 평가 프레임워크입니다.

pip install ragas langchain openai

핵심 평가 지표

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
)
from datasets import Dataset

Faithfulness (충실성)

생성된 답변이 검색된 컨텍스트에 근거하는지 측정합니다. 할루시네이션을 감지하는 핵심 지표입니다.

# Faithfulness = 컨텍스트로 검증 가능한 진술 수 / 전체 진술 수

# 높은 faithfulness 예시
example_high_faithfulness = {
    "question": "파이썬이 처음 만들어진 연도는?",
    "answer": "파이썬은 1991년에 귀도 반 로섬이 처음 발표했습니다.",
    "contexts": ["Python was first released in 1991 by Guido van Rossum."],
    "faithfulness": 1.0  # 컨텍스트에서 완전히 지지됨
}

# 낮은 faithfulness 예시 (할루시네이션)
example_low_faithfulness = {
    "question": "파이썬의 현재 버전은?",
    "answer": "파이썬 3.11이 현재 버전이며 2022년에 출시되었습니다.",
    "contexts": ["Python 3.12 was released in October 2023."],
    "faithfulness": 0.3  # 컨텍스트와 다른 정보 포함
}

Answer Relevancy (답변 관련성)

생성된 답변이 실제 질문에 얼마나 관련 있는지 측정합니다. 답변이 길고 관련 없는 내용을 포함하면 낮아집니다.

# 역방향으로 계산: 답변에서 생성한 질문과 원래 질문의 유사도
def compute_answer_relevancy(answer: str, question: str, model) -> float:
    # LLM으로 답변에서 역방향 질문 생성
    generated_questions = []
    for _ in range(3):  # 여러 번 생성해 평균
        gen_q = model.generate(f"다음 답변을 보고 원래 질문을 생성하세요: {answer}")
        generated_questions.append(gen_q)

    # 원래 질문과의 코사인 유사도
    embeddings = model.embed([question] + generated_questions)
    similarities = cosine_similarity([embeddings[0]], embeddings[1:])[0]
    return float(similarities.mean())

Context Recall (컨텍스트 재현율)

정답에 필요한 정보가 검색된 컨텍스트에 얼마나 포함되어 있는지 측정합니다.

# Context Recall = 컨텍스트로 지지되는 정답 진술 수 / 전체 정답 진술 수

Context Precision (컨텍스트 정밀도)

검색된 컨텍스트 중 실제로 답변 생성에 유용한 비율을 측정합니다.

RAGAS 실전 평가

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision
from datasets import Dataset
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

# 평가 데이터 준비
eval_data = {
    "question": [
        "대한민국의 수도는 어디인가요?",
        "세종대왕이 한글을 창제한 연도는?",
        "한국의 국화는 무엇인가요?",
    ],
    "answer": [
        "대한민국의 수도는 서울입니다.",
        "세종대왕은 1443년에 한글을 창제했습니다.",
        "한국의 국화는 무궁화입니다.",
    ],
    "contexts": [
        ["서울은 대한민국의 수도이자 최대 도시입니다. 인구는 약 950만 명입니다."],
        ["세종대왕은 1443년에 훈민정음(한글)을 창제하였습니다."],
        ["무궁화는 대한민국의 국화로, 아침마다 새로운 꽃이 핀다는 의미가 있습니다."],
    ],
    "ground_truth": [
        "서울",
        "1443년",
        "무궁화",
    ]
}

dataset = Dataset.from_dict(eval_data)

# LLM과 임베딩 모델 설정
llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

# 평가 실행
result = evaluate(
    dataset=dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_recall,
        context_precision,
    ],
    llm=llm,
    embeddings=embeddings,
)

print("RAGAS 평가 결과:")
print(f"  Faithfulness (충실성): {result['faithfulness']:.4f}")
print(f"  Answer Relevancy (관련성): {result['answer_relevancy']:.4f}")
print(f"  Context Recall (재현율): {result['context_recall']:.4f}")
print(f"  Context Precision (정밀도): {result['context_precision']:.4f}")

RAG 파이프라인 전체 평가

from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
import time

class RAGEvaluator:
    def __init__(self, qa_chain):
        self.qa_chain = qa_chain
        self.eval_results = []

    def evaluate_single(self, question: str, ground_truth: str) -> dict:
        start_time = time.time()
        result = self.qa_chain.invoke(question)
        latency = time.time() - start_time

        return {
            "question": question,
            "answer": result['result'],
            "contexts": [doc.page_content for doc in result.get('source_documents', [])],
            "ground_truth": ground_truth,
            "latency": latency
        }

    def evaluate_batch(self, questions: list, ground_truths: list) -> dict:
        results = []
        for q, gt in zip(questions, ground_truths):
            result = self.evaluate_single(q, gt)
            results.append(result)

        # RAGAS 평가
        dataset = Dataset.from_list(results)
        ragas_result = evaluate(
            dataset=dataset,
            metrics=[faithfulness, answer_relevancy, context_recall, context_precision],
        )

        # 지연 시간 통계
        latencies = [r['latency'] for r in results]

        return {
            "ragas_scores": ragas_result,
            "avg_latency": sum(latencies) / len(latencies),
            "p95_latency": sorted(latencies)[int(len(latencies) * 0.95)],
            "num_evaluated": len(results)
        }

6. 커스텀 LLM 평가 파이프라인

평가 데이터셋 구축

import json
import random
from openai import OpenAI

class EvalDatasetBuilder:
    def __init__(self):
        self.client = OpenAI()

    def generate_qa_pairs(self, documents: list, num_pairs: int = 100) -> list:
        """문서에서 자동으로 QA 쌍 생성"""
        qa_pairs = []

        for doc in documents[:num_pairs]:
            response = self.client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {
                        "role": "system",
                        "content": """주어진 텍스트를 바탕으로 질문-답변 쌍을 생성하세요.
                        다음 JSON 형식으로 반환하세요:
                        {"question": "질문", "answer": "답변"}"""
                    },
                    {
                        "role": "user",
                        "content": f"텍스트: {doc}"
                    }
                ],
                response_format={"type": "json_object"}
            )

            try:
                qa = json.loads(response.choices[0].message.content)
                qa['context'] = doc
                qa_pairs.append(qa)
            except json.JSONDecodeError:
                continue

        return qa_pairs

    def split_dataset(self, qa_pairs: list, test_ratio: float = 0.2) -> tuple:
        random.shuffle(qa_pairs)
        split_idx = int(len(qa_pairs) * (1 - test_ratio))
        return qa_pairs[:split_idx], qa_pairs[split_idx:]

A/B 테스트

import asyncio
from typing import Callable
import statistics

class LLMABTest:
    def __init__(self, model_a: str, model_b: str):
        self.model_a = model_a
        self.model_b = model_b
        self.client = OpenAI()
        self.results = {"a": [], "b": []}

    async def run_single_test(
        self,
        prompt: str,
        expected: str = None,
        judge_model: str = "gpt-4o"
    ) -> dict:
        # 두 모델에서 응답 수집
        response_a = self.client.chat.completions.create(
            model=self.model_a,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=512
        ).choices[0].message.content

        response_b = self.client.chat.completions.create(
            model=self.model_b,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=512
        ).choices[0].message.content

        # GPT-4로 심사
        judge_prompt = f"""다음 두 AI 응답 중 어느 것이 더 좋은지 평가하세요.

질문: {prompt}

응답 A: {response_a}

응답 B: {response_b}

더 나은 응답을 A 또는 B로만 답하세요. 동점이면 TIE라고 답하세요.
답:"""

        judgment = self.client.chat.completions.create(
            model=judge_model,
            messages=[{"role": "user", "content": judge_prompt}],
            max_tokens=10,
            temperature=0
        ).choices[0].message.content.strip()

        return {
            "prompt": prompt,
            "response_a": response_a,
            "response_b": response_b,
            "winner": judgment
        }

    def calculate_win_rates(self) -> dict:
        all_results = self.results.get("comparisons", [])
        if not all_results:
            return {}

        wins_a = sum(1 for r in all_results if "A" in r.get("winner", ""))
        wins_b = sum(1 for r in all_results if "B" in r.get("winner", ""))
        ties = sum(1 for r in all_results if "TIE" in r.get("winner", ""))

        total = len(all_results)
        return {
            f"{self.model_a}_win_rate": wins_a / total,
            f"{self.model_b}_win_rate": wins_b / total,
            "tie_rate": ties / total,
        }

자동 평가의 한계

# LLM-as-Judge의 알려진 편향
llm_judge_biases = {
    "위치 편향": "첫 번째 응답을 선호하는 경향",
    "길이 편향": "더 긴 응답을 선호하는 경향",
    "자기 선호 편향": "같은 모델이 만든 응답을 선호하는 경향",
    "형식 편향": "불릿 포인트, 헤더 등 구조화된 응답 선호",
}

# 편향 완화 전략
bias_mitigation = {
    "순서 교환": "A-B와 B-A 순서 모두 평가 후 평균",
    "다수결": "여러 심사 모델 사용",
    "절대 점수": "상대 비교 대신 독립적인 절대 점수",
    "COT 평가": "심사자에게 이유 설명 후 점수 부여 요청",
}

7. 프로덕션 LLM 모니터링

온라인 평가 메트릭

from dataclasses import dataclass, field
from datetime import datetime
import statistics
from collections import defaultdict

@dataclass
class LLMMetrics:
    """프로덕션 LLM 모니터링 메트릭"""
    timestamp: datetime = field(default_factory=datetime.now)

    # 성능 메트릭
    latency_ms: float = 0.0
    tokens_per_second: float = 0.0
    input_tokens: int = 0
    output_tokens: int = 0

    # 품질 메트릭
    user_rating: int = None      # 1-5 사용자 평점
    thumbs_up: bool = None       # 좋아요/싫어요
    was_regenerated: bool = False # 재생성 요청 여부

    # 안전성 메트릭
    content_filtered: bool = False
    error_occurred: bool = False
    error_type: str = None

class LLMMonitor:
    def __init__(self):
        self.metrics_store = []
        self.alert_thresholds = {
            "latency_p95_ms": 5000,
            "error_rate": 0.05,
            "negative_feedback_rate": 0.2,
        }

    def record(self, metrics: LLMMetrics):
        self.metrics_store.append(metrics)

        # 실시간 알림 체크
        self._check_alerts()

    def compute_stats(self, window_minutes: int = 60) -> dict:
        cutoff = datetime.now().timestamp() - window_minutes * 60
        recent = [
            m for m in self.metrics_store
            if m.timestamp.timestamp() > cutoff
        ]

        if not recent:
            return {}

        latencies = [m.latency_ms for m in recent]
        ratings = [m.user_rating for m in recent if m.user_rating is not None]
        errors = [m for m in recent if m.error_occurred]

        stats = {
            "total_requests": len(recent),
            "avg_latency_ms": statistics.mean(latencies),
            "p50_latency_ms": statistics.median(latencies),
            "p95_latency_ms": sorted(latencies)[int(len(latencies) * 0.95)],
            "error_rate": len(errors) / len(recent),
            "avg_input_tokens": statistics.mean([m.input_tokens for m in recent]),
            "avg_output_tokens": statistics.mean([m.output_tokens for m in recent]),
        }

        if ratings:
            stats["avg_user_rating"] = statistics.mean(ratings)
            stats["negative_feedback_rate"] = sum(1 for r in ratings if r <= 2) / len(ratings)

        return stats

    def _check_alerts(self):
        stats = self.compute_stats(window_minutes=5)
        if not stats:
            return

        if stats.get('p95_latency_ms', 0) > self.alert_thresholds['latency_p95_ms']:
            print(f"경고: P95 지연 시간이 임계값 초과 ({stats['p95_latency_ms']:.0f}ms)")

        if stats.get('error_rate', 0) > self.alert_thresholds['error_rate']:
            print(f"경고: 오류율이 임계값 초과 ({stats['error_rate']*100:.1f}%)")

    def detect_drift(self, baseline_stats: dict, current_stats: dict) -> dict:
        """배포 후 성능 드리프트 감지"""
        drift_report = {}

        for metric in ['avg_latency_ms', 'error_rate', 'avg_user_rating']:
            if metric in baseline_stats and metric in current_stats:
                baseline = baseline_stats[metric]
                current = current_stats[metric]
                if baseline != 0:
                    change_pct = (current - baseline) / baseline * 100
                    drift_report[metric] = {
                        "baseline": baseline,
                        "current": current,
                        "change_pct": change_pct,
                        "is_significant": abs(change_pct) > 10
                    }

        return drift_report

사용자 피드백 루프

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import uuid
import json

app = FastAPI()

class FeedbackRequest(BaseModel):
    request_id: str
    rating: int          # 1-5
    thumbs_up: bool
    comment: str = None
    categories: list = []  # "helpful", "accurate", "safe", "creative"

class FeedbackStore:
    def __init__(self):
        self.feedback_db = {}  # 실제로는 DB 사용

    def save_feedback(self, feedback: FeedbackRequest) -> str:
        feedback_id = str(uuid.uuid4())
        self.feedback_db[feedback_id] = {
            "request_id": feedback.request_id,
            "rating": feedback.rating,
            "thumbs_up": feedback.thumbs_up,
            "comment": feedback.comment,
            "categories": feedback.categories,
            "timestamp": datetime.now().isoformat()
        }
        return feedback_id

    def get_feedback_stats(self) -> dict:
        if not self.feedback_db:
            return {}

        all_feedback = list(self.feedback_db.values())
        ratings = [f['rating'] for f in all_feedback]
        thumbs = [f['thumbs_up'] for f in all_feedback]

        return {
            "total_feedback": len(all_feedback),
            "avg_rating": sum(ratings) / len(ratings),
            "positive_rate": sum(thumbs) / len(thumbs),
            "category_distribution": self._count_categories(all_feedback)
        }

    def _count_categories(self, feedback_list: list) -> dict:
        counts = defaultdict(int)
        for f in feedback_list:
            for cat in f.get('categories', []):
                counts[cat] += 1
        return dict(counts)

feedback_store = FeedbackStore()

@app.post("/feedback")
async def submit_feedback(feedback: FeedbackRequest):
    feedback_id = feedback_store.save_feedback(feedback)
    return {"feedback_id": feedback_id, "status": "recorded"}

@app.get("/feedback/stats")
async def get_stats():
    return feedback_store.get_feedback_stats()

마무리

LLM 평가는 단순한 점수 비교가 아니라 목적에 맞는 평가 방법을 선택하는 것이 중요합니다.

핵심 정리:

범용 평가:

  • 지식: MMLU, ARC
  • 추론: GSM8K, MATH
  • 코딩: HumanEval, SWE-bench
  • 대화: MT-Bench

RAG 평가: RAGAS (Faithfulness + Answer Relevancy + Context Recall + Precision)

자동화 도구: LM-Evaluation-Harness로 표준 벤치마크 일괄 실행

프로덕션 모니터링: 지연 시간, 오류율, 사용자 피드백, 드리프트 감지

벤치마크 점수는 참고 지표일 뿐, 실제 서비스에서의 성능은 직접 평가해야 합니다. 특히 한국어 서비스라면 한국어 특화 평가셋을 구축하고 실제 사용자 피드백을 지속적으로 수집하는 것이 가장 정확한 평가 방법입니다.

Complete Guide to LLM Evaluation and Benchmarking: MMLU, MT-Bench, RAGAS, LM-Eval

Complete Guide to LLM Evaluation and Benchmarking

Answering "which LLM is best?" with a single answer is difficult. A model that excels at math may be weak at creative writing, and a model with strong English performance may fall short in Korean. This guide systematically covers how to evaluate LLMs correctly — from standard benchmarks to RAG system evaluation and production monitoring.


1. The Challenges of LLM Evaluation

No Single Metric Is Sufficient

An LLM must simultaneously possess dozens of capabilities.

  • Knowledge: memorizing and retrieving factual information
  • Reasoning: logical reasoning, math, coding
  • Language: grammar, style, multilingual support
  • Instruction following: complying with user directives
  • Safety: refusing harmful content

No single metric can represent all these capabilities.

Goodhart's Law

"When a measure becomes a target, it ceases to be a good measure."

When LLM developers optimize toward a specific benchmark, scores on that benchmark may rise while actual capability does not improve. This is called "benchmark gaming."

# Illustrative patterns of benchmark gaming
gaming_strategies = {
    "training data contamination": "including benchmark questions in training data",
    "specialized fine-tuning": "optimizing only for the benchmark format",
    "prompt engineering": "adjusting responses to match benchmark answer formats",
    "selective evaluation": "reporting only favorable results",
}

Benchmark Contamination

When evaluation data is included in the training data, unreliable scores result.

Detection methods:

  • n-gram overlap checks
  • Perplexity outlier detection
  • Continuously generating new test sets

2. Standard Benchmarks by Capability

Knowledge Evaluation

MMLU (Massive Multitask Language Understanding)

The most comprehensive knowledge benchmark, consisting of approximately 15,000 multiple-choice questions across 57 academic disciplines.

Domains: mathematics, history, law, medicine, physics, computer science, etc.

# MMLU example question structure
mmlu_example = {
    "subject": "high_school_physics",
    "question": "A ball is thrown vertically upward with an initial velocity of 20 m/s. What is the maximum height reached?",
    "choices": ["10 m", "20 m", "40 m", "5 m"],
    "answer": "B"
}

# MMLU score interpretation
mmlu_benchmarks = {
    "random guessing": 25.0,
    "GPT-3.5": 70.0,
    "GPT-4": 86.4,
    "Claude 3 Opus": 86.8,
    "Llama 3.3 70B": 86.0,
    "human average": 89.8,
}

ARC (AI2 Reasoning Challenge)

Elementary school-level science exam questions that evaluate commonsense reasoning ability.

  • ARC-Easy: basic level
  • ARC-Challenge: advanced level (questions models find difficult)

TriviaQA

Wikipedia-based trivia quizzes that evaluate open-domain question answering capability.

Reasoning Evaluation

GSM8K (Grade School Math)

8,500 elementary school-level math problems. Requires multi-step calculations and linguistic reasoning.

# GSM8K example problem
gsm8k_example = {
    "question": """
    Natasha is baking cookies for a party.
    She baked 48 but ate 12 before the party.
    At the party, friends ate half of what remained.
    How many cookies are left?
    """,
    "answer": "18",
    "chain_of_thought": """
    Start: 48
    Natasha ate: 48 - 12 = 36
    Friends ate at party: 36 / 2 = 18
    Remaining: 36 - 18 = 18
    """
}

MATH

12,500 college entrance exam-level math problems. Covers advanced mathematics expressed in LaTeX.

# MATH score comparison
math_scores = {
    "GPT-3.5": 34.1,
    "GPT-4": 52.9,
    "Claude 3 Opus": 60.1,
    "DeepSeek-R1": 97.3,    # reasoning specialist
    "Llama 3.3 70B": 77.0,
}

HellaSwag

A commonsense reasoning benchmark for choosing the correct sentence completion. Humans score 95% but early models found it challenging.

WinoGrande

Pronoun coreference resolution problems. Requires contextual reasoning like: "The trophy didn't fit in the suitcase because it was too big. What was too big?"

Coding Evaluation

HumanEval (OpenAI)

164 programming problems. Code is generated from a function signature and docstring.

# HumanEval example (uses pass@k metric)
humaneval_example = {
    "task_id": "HumanEval/0",
    "prompt": '''
def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other
    than given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """
    ''',
    "canonical_solution": "..."
}

# pass@k: probability of at least 1 correct answer in k attempts
def pass_at_k(n: int, c: int, k: int) -> float:
    """
    n: total number of attempts
    c: number of correct attempts
    k: k in pass@k
    """
    if n - c < k:
        return 1.0
    return 1.0 - (
        sum(1 for i in range(n - c, n + 1)) /
        sum(1 for i in range(n + 1))
    )

MBPP (Mostly Basic Programming Problems)

374 Python programming problems with a wider range of difficulty than HumanEval.

SWE-bench

A software engineering benchmark for resolving real GitHub issues. Evaluates the most realistic coding ability.

# SWE-bench scores (% resolved issues)
swebench_scores = {
    "GPT-4o": 38.8,
    "Claude 3.5 Sonnet": 49.0,
    "DeepSeek V3": 42.0,
    "Llama 3.3 70B": 28.0,
}

Instruction Following Evaluation

MT-Bench

A multi-turn conversation evaluation benchmark that uses GPT-4 as a judge. Scored on a 1–10 scale.

8 categories: writing, roleplay, reasoning, math, coding, extraction, STEM, humanities

Alpaca Eval

An automatic evaluation system based on Win Rate. Uses GPT-4 Turbo or Claude as a judge.

Safety Evaluation

TruthfulQA

Evaluates the ability to avoid generating false information. Contains 817 questions about common misconceptions and myths.

# TruthfulQA example
truthfulqa_examples = [
    {
        "question": "Was the moon landing real?",
        "truthful_answer": "Yes, Apollo 11 landed on the moon in 1969, and this is a documented fact.",
        "common_false_answer": "There are conspiracy theories claiming the moon landing was faked."
    }
]

BBQ (Bias Benchmark for QA)

A benchmark for detecting social biases (age, gender, race, etc.).


3. LM-Evaluation-Harness

An open-source evaluation framework developed by EleutherAI. Runs over 60 benchmarks with a unified interface.

Installation

pip install lm-eval
pip install lm-eval[vllm]  # when using the vLLM backend

Basic Execution

# Evaluate a HuggingFace model
lm_eval \
    --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct \
    --tasks mmlu \
    --device cuda:0 \
    --batch_size 8 \
    --output_path ./results

# Run multiple tasks simultaneously
lm_eval \
    --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct \
    --tasks mmlu,arc_challenge,hellaswag,winogrande,gsm8k \
    --device cuda:0 \
    --batch_size auto \
    --output_path ./results

# Fast evaluation with the vLLM backend
lm_eval \
    --model vllm \
    --model_args pretrained=Qwen/Qwen2.5-7B-Instruct,tensor_parallel_size=2 \
    --tasks mmlu \
    --batch_size auto \
    --output_path ./results

Python API Usage

import lm_eval
from lm_eval.models.huggingface import HFLM

# Initialize model
model = HFLM(
    pretrained="meta-llama/Meta-Llama-3-8B-Instruct",
    device="cuda",
    batch_size=8,
    dtype="float16"
)

# Run evaluation
results = lm_eval.simple_evaluate(
    model=model,
    tasks=["mmlu", "arc_challenge", "hellaswag"],
    num_fewshot=5,
    batch_size=8,
)

# Print results
for task, metrics in results['results'].items():
    print(f"\n{task}:")
    for metric, value in metrics.items():
        if isinstance(value, float):
            print(f"  {metric}: {value:.4f}")

Adding Custom Tasks

# Example Korean task definition
# tasks/korean_qa/korean_qa.yaml

task_config = """
task: korean_qa
dataset_path: path/to/korean_qa_dataset
dataset_name: null
output_type: multiple_choice
doc_to_text: "Question: {{question}}\nChoices:\n{{choices}}\nAnswer:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: "{{answer}}"
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
num_fewshot: 0
"""
# Direct task implementation
from lm_eval.api.task import Task
from lm_eval.api.instance import Instance

class KoreanSentimentTask(Task):
    VERSION = 1
    DATASET_PATH = "nsmc"  # HuggingFace dataset

    def has_training_docs(self):
        return True

    def has_validation_docs(self):
        return True

    def training_docs(self):
        return self.dataset["train"]

    def validation_docs(self):
        return self.dataset["test"]

    def doc_to_text(self, doc):
        return f"Classify the sentiment of the following review as positive or negative.\nReview: {doc['document']}\nSentiment:"

    def doc_to_target(self, doc):
        return " positive" if doc['label'] == 1 else " negative"

    def construct_requests(self, doc, ctx):
        return [
            Instance(
                request_type="loglikelihood",
                doc=doc,
                arguments=(ctx, " positive"),
            ),
            Instance(
                request_type="loglikelihood",
                doc=doc,
                arguments=(ctx, " negative"),
            ),
        ]

    def process_results(self, doc, results):
        ll_positive, ll_negative = results
        pred = 1 if ll_positive > ll_negative else 0
        gold = doc['label']
        return {"acc": int(pred == gold)}

    def aggregation(self):
        return {"acc": "mean"}

    def higher_is_better(self):
        return {"acc": True}

Analyzing Evaluation Results

import json
import pandas as pd

# Load results file
with open('./results/results.json', 'r') as f:
    results = json.load(f)

# Organize results
summary = []
for task_name, task_results in results['results'].items():
    for metric, value in task_results.items():
        if isinstance(value, float) and not metric.endswith('_stderr'):
            summary.append({
                'task': task_name,
                'metric': metric,
                'value': value,
                'stderr': task_results.get(f'{metric}_stderr', None)
            })

df = pd.DataFrame(summary)
print(df.to_string(index=False))

# Compare multiple models
def compare_models(model_results: dict) -> pd.DataFrame:
    rows = []
    for model_name, results in model_results.items():
        row = {'model': model_name}
        for task, metrics in results['results'].items():
            for metric, value in metrics.items():
                if isinstance(value, float) and 'acc' in metric and 'stderr' not in metric:
                    row[f'{task}_{metric}'] = round(value * 100, 2)
        rows.append(row)
    return pd.DataFrame(rows).set_index('model')

4. MT-Bench and Chatbot Arena

MT-Bench

Evaluates multi-turn conversation capability using GPT-4 as a judge.

pip install fastchat
# MT-Bench evaluation script
import json
from openai import OpenAI

client = OpenAI()

# MT-Bench question examples
mt_bench_questions = [
    {
        "question_id": 81,
        "category": "writing",
        "turns": [
            "Write an essay on the impact of rapid AI advancement on society.",
            "Revise the essay you just wrote to be more persuasive and add specific examples."
        ]
    }
]

def evaluate_with_gpt4_judge(question: str, answer: str, reference: str = None) -> dict:
    system_prompt = """
    You are an expert evaluator assessing the quality of AI assistant responses.
    Based on the given question and answer, rate it on a scale of 1-10 and explain your reasoning.
    Evaluation criteria: accuracy, usefulness, completeness, language quality
    Always respond in the following format:
    Score: [1-10]
    Reason: [evaluation rationale]
    """

    user_prompt = f"""
    Question: {question}

    AI Response: {answer}
    """

    if reference:
        user_prompt += f"\nReference Answer: {reference}"

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0
    )

    content = response.choices[0].message.content
    # Extract score
    import re
    score_match = re.search(r'Score:\s*(\d+)', content)
    score = int(score_match.group(1)) if score_match else 5

    return {
        "score": score,
        "feedback": content
    }

# Collect model responses
def get_model_response(model_name: str, messages: list) -> str:
    response = client.chat.completions.create(
        model=model_name,
        messages=messages,
        max_tokens=1024,
        temperature=0.7
    )
    return response.choices[0].message.content

# Run MT-Bench evaluation
def run_mt_bench(model_name: str, questions: list) -> dict:
    results = []

    for q in questions:
        messages = []
        turn_scores = []

        for turn_idx, turn_question in enumerate(q['turns']):
            messages.append({"role": "user", "content": turn_question})
            response = get_model_response(model_name, messages)
            messages.append({"role": "assistant", "content": response})

            eval_result = evaluate_with_gpt4_judge(turn_question, response)
            turn_scores.append(eval_result['score'])

        results.append({
            "question_id": q['question_id'],
            "category": q['category'],
            "turn_scores": turn_scores,
            "avg_score": sum(turn_scores) / len(turn_scores)
        })

    avg_total = sum(r['avg_score'] for r in results) / len(results)
    return {"model": model_name, "avg_score": avg_total, "details": results}

Chatbot Arena (ELO Score)

LMSYS Chatbot Arena is a crowdsourced evaluation platform where users compare and choose between responses from two models.

ELO score calculation:

def update_elo(winner_elo: float, loser_elo: float, k: float = 32) -> tuple:
    """
    Applying the chess ELO rating system to chatbot evaluation
    k: K-factor (maximum score change)
    """
    expected_winner = 1 / (1 + 10 ** ((loser_elo - winner_elo) / 400))
    expected_loser = 1 - expected_winner

    new_winner_elo = winner_elo + k * (1 - expected_winner)
    new_loser_elo = loser_elo + k * (0 - expected_loser)

    return new_winner_elo, new_loser_elo

# ELO score examples (reference values as of 2025)
chatbot_arena_elo = {
    "GPT-4o": 1287,
    "Claude 3.5 Sonnet": 1265,
    "Gemini 1.5 Pro": 1263,
    "Llama 3.1 405B": 1251,
    "DeepSeek V3": 1301,
    "GPT-4o-mini": 1218,
}

5. RAG System Evaluation (RAGAS)

RAG (Retrieval-Augmented Generation) systems are difficult to evaluate with general LLM benchmarks. RAGAS is a RAG-specific evaluation framework.

pip install ragas langchain openai

Core Evaluation Metrics

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
)
from datasets import Dataset

Faithfulness

Measures whether the generated answer is grounded in the retrieved context. A key metric for detecting hallucinations.

# Faithfulness = number of statements verifiable from context / total statements

# High faithfulness example
example_high_faithfulness = {
    "question": "What year was Python first created?",
    "answer": "Python was first released in 1991 by Guido van Rossum.",
    "contexts": ["Python was first released in 1991 by Guido van Rossum."],
    "faithfulness": 1.0  # fully supported by context
}

# Low faithfulness example (hallucination)
example_low_faithfulness = {
    "question": "What is the current version of Python?",
    "answer": "Python 3.11 is the current version and was released in 2022.",
    "contexts": ["Python 3.12 was released in October 2023."],
    "faithfulness": 0.3  # contains information differing from context
}

Answer Relevancy

Measures how relevant the generated answer is to the actual question. Drops when the answer is long and contains irrelevant content.

# Computed in reverse: similarity between questions generated from the answer and the original question
def compute_answer_relevancy(answer: str, question: str, model) -> float:
    # Generate reverse questions from answer using LLM
    generated_questions = []
    for _ in range(3):  # generate multiple times and average
        gen_q = model.generate(f"Given the following answer, generate the original question: {answer}")
        generated_questions.append(gen_q)

    # Cosine similarity with original question
    embeddings = model.embed([question] + generated_questions)
    similarities = cosine_similarity([embeddings[0]], embeddings[1:])[0]
    return float(similarities.mean())

Context Recall

Measures how much of the information needed for the correct answer is contained in the retrieved context.

# Context Recall = number of ground-truth statements supported by context / total ground-truth statements

Context Precision

Measures the proportion of retrieved context that is actually useful for generating the answer.

RAGAS Practical Evaluation

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision
from datasets import Dataset
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

# Prepare evaluation data
eval_data = {
    "question": [
        "What is the capital of South Korea?",
        "In what year did King Sejong create Hangul?",
        "What is the national flower of Korea?",
    ],
    "answer": [
        "The capital of South Korea is Seoul.",
        "King Sejong created Hangul in 1443.",
        "The national flower of Korea is the Hibiscus (Mugunghwa).",
    ],
    "contexts": [
        ["Seoul is the capital and largest city of South Korea, with a population of approximately 9.5 million."],
        ["King Sejong created Hunminjeongeum (Hangul) in 1443."],
        ["The Hibiscus (Mugunghwa) is the national flower of South Korea, symbolizing a new bloom every morning."],
    ],
    "ground_truth": [
        "Seoul",
        "1443",
        "Hibiscus (Mugunghwa)",
    ]
}

dataset = Dataset.from_dict(eval_data)

# Configure LLM and embedding model
llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

# Run evaluation
result = evaluate(
    dataset=dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_recall,
        context_precision,
    ],
    llm=llm,
    embeddings=embeddings,
)

print("RAGAS Evaluation Results:")
print(f"  Faithfulness: {result['faithfulness']:.4f}")
print(f"  Answer Relevancy: {result['answer_relevancy']:.4f}")
print(f"  Context Recall: {result['context_recall']:.4f}")
print(f"  Context Precision: {result['context_precision']:.4f}")

Full RAG Pipeline Evaluation

from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
import time

class RAGEvaluator:
    def __init__(self, qa_chain):
        self.qa_chain = qa_chain
        self.eval_results = []

    def evaluate_single(self, question: str, ground_truth: str) -> dict:
        start_time = time.time()
        result = self.qa_chain.invoke(question)
        latency = time.time() - start_time

        return {
            "question": question,
            "answer": result['result'],
            "contexts": [doc.page_content for doc in result.get('source_documents', [])],
            "ground_truth": ground_truth,
            "latency": latency
        }

    def evaluate_batch(self, questions: list, ground_truths: list) -> dict:
        results = []
        for q, gt in zip(questions, ground_truths):
            result = self.evaluate_single(q, gt)
            results.append(result)

        # RAGAS evaluation
        dataset = Dataset.from_list(results)
        ragas_result = evaluate(
            dataset=dataset,
            metrics=[faithfulness, answer_relevancy, context_recall, context_precision],
        )

        # Latency statistics
        latencies = [r['latency'] for r in results]

        return {
            "ragas_scores": ragas_result,
            "avg_latency": sum(latencies) / len(latencies),
            "p95_latency": sorted(latencies)[int(len(latencies) * 0.95)],
            "num_evaluated": len(results)
        }

6. Custom LLM Evaluation Pipeline

Building an Evaluation Dataset

import json
import random
from openai import OpenAI

class EvalDatasetBuilder:
    def __init__(self):
        self.client = OpenAI()

    def generate_qa_pairs(self, documents: list, num_pairs: int = 100) -> list:
        """Automatically generate QA pairs from documents"""
        qa_pairs = []

        for doc in documents[:num_pairs]:
            response = self.client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {
                        "role": "system",
                        "content": """Generate a question-answer pair based on the given text.
                        Return in the following JSON format:
                        {"question": "question", "answer": "answer"}"""
                    },
                    {
                        "role": "user",
                        "content": f"Text: {doc}"
                    }
                ],
                response_format={"type": "json_object"}
            )

            try:
                qa = json.loads(response.choices[0].message.content)
                qa['context'] = doc
                qa_pairs.append(qa)
            except json.JSONDecodeError:
                continue

        return qa_pairs

    def split_dataset(self, qa_pairs: list, test_ratio: float = 0.2) -> tuple:
        random.shuffle(qa_pairs)
        split_idx = int(len(qa_pairs) * (1 - test_ratio))
        return qa_pairs[:split_idx], qa_pairs[split_idx:]

A/B Testing

import asyncio
from typing import Callable
import statistics

class LLMABTest:
    def __init__(self, model_a: str, model_b: str):
        self.model_a = model_a
        self.model_b = model_b
        self.client = OpenAI()
        self.results = {"a": [], "b": []}

    async def run_single_test(
        self,
        prompt: str,
        expected: str = None,
        judge_model: str = "gpt-4o"
    ) -> dict:
        # Collect responses from both models
        response_a = self.client.chat.completions.create(
            model=self.model_a,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=512
        ).choices[0].message.content

        response_b = self.client.chat.completions.create(
            model=self.model_b,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=512
        ).choices[0].message.content

        # Judge with GPT-4
        judge_prompt = f"""Evaluate which of the following two AI responses is better.

Question: {prompt}

Response A: {response_a}

Response B: {response_b}

Answer with only A or B for the better response. If it's a tie, answer TIE.
Answer:"""

        judgment = self.client.chat.completions.create(
            model=judge_model,
            messages=[{"role": "user", "content": judge_prompt}],
            max_tokens=10,
            temperature=0
        ).choices[0].message.content.strip()

        return {
            "prompt": prompt,
            "response_a": response_a,
            "response_b": response_b,
            "winner": judgment
        }

    def calculate_win_rates(self) -> dict:
        all_results = self.results.get("comparisons", [])
        if not all_results:
            return {}

        wins_a = sum(1 for r in all_results if "A" in r.get("winner", ""))
        wins_b = sum(1 for r in all_results if "B" in r.get("winner", ""))
        ties = sum(1 for r in all_results if "TIE" in r.get("winner", ""))

        total = len(all_results)
        return {
            f"{self.model_a}_win_rate": wins_a / total,
            f"{self.model_b}_win_rate": wins_b / total,
            "tie_rate": ties / total,
        }

Limitations of Automated Evaluation

# Known biases in LLM-as-Judge
llm_judge_biases = {
    "position bias": "tendency to prefer the first response",
    "length bias": "tendency to prefer longer responses",
    "self-preference bias": "tendency to prefer responses from the same model",
    "format bias": "preference for structured responses with bullet points, headers, etc.",
}

# Bias mitigation strategies
bias_mitigation = {
    "order swapping": "evaluate both A-B and B-A order and average",
    "majority vote": "use multiple judge models",
    "absolute scoring": "independent absolute scores instead of relative comparisons",
    "CoT evaluation": "ask the judge to explain reasoning before giving a score",
}

7. Production LLM Monitoring

Online Evaluation Metrics

from dataclasses import dataclass, field
from datetime import datetime
import statistics
from collections import defaultdict

@dataclass
class LLMMetrics:
    """Production LLM monitoring metrics"""
    timestamp: datetime = field(default_factory=datetime.now)

    # Performance metrics
    latency_ms: float = 0.0
    tokens_per_second: float = 0.0
    input_tokens: int = 0
    output_tokens: int = 0

    # Quality metrics
    user_rating: int = None      # 1-5 user rating
    thumbs_up: bool = None       # thumbs up/down
    was_regenerated: bool = False # whether regeneration was requested

    # Safety metrics
    content_filtered: bool = False
    error_occurred: bool = False
    error_type: str = None

class LLMMonitor:
    def __init__(self):
        self.metrics_store = []
        self.alert_thresholds = {
            "latency_p95_ms": 5000,
            "error_rate": 0.05,
            "negative_feedback_rate": 0.2,
        }

    def record(self, metrics: LLMMetrics):
        self.metrics_store.append(metrics)

        # Real-time alert check
        self._check_alerts()

    def compute_stats(self, window_minutes: int = 60) -> dict:
        cutoff = datetime.now().timestamp() - window_minutes * 60
        recent = [
            m for m in self.metrics_store
            if m.timestamp.timestamp() > cutoff
        ]

        if not recent:
            return {}

        latencies = [m.latency_ms for m in recent]
        ratings = [m.user_rating for m in recent if m.user_rating is not None]
        errors = [m for m in recent if m.error_occurred]

        stats = {
            "total_requests": len(recent),
            "avg_latency_ms": statistics.mean(latencies),
            "p50_latency_ms": statistics.median(latencies),
            "p95_latency_ms": sorted(latencies)[int(len(latencies) * 0.95)],
            "error_rate": len(errors) / len(recent),
            "avg_input_tokens": statistics.mean([m.input_tokens for m in recent]),
            "avg_output_tokens": statistics.mean([m.output_tokens for m in recent]),
        }

        if ratings:
            stats["avg_user_rating"] = statistics.mean(ratings)
            stats["negative_feedback_rate"] = sum(1 for r in ratings if r <= 2) / len(ratings)

        return stats

    def _check_alerts(self):
        stats = self.compute_stats(window_minutes=5)
        if not stats:
            return

        if stats.get('p95_latency_ms', 0) > self.alert_thresholds['latency_p95_ms']:
            print(f"Warning: P95 latency exceeded threshold ({stats['p95_latency_ms']:.0f}ms)")

        if stats.get('error_rate', 0) > self.alert_thresholds['error_rate']:
            print(f"Warning: Error rate exceeded threshold ({stats['error_rate']*100:.1f}%)")

    def detect_drift(self, baseline_stats: dict, current_stats: dict) -> dict:
        """Detect performance drift after deployment"""
        drift_report = {}

        for metric in ['avg_latency_ms', 'error_rate', 'avg_user_rating']:
            if metric in baseline_stats and metric in current_stats:
                baseline = baseline_stats[metric]
                current = current_stats[metric]
                if baseline != 0:
                    change_pct = (current - baseline) / baseline * 100
                    drift_report[metric] = {
                        "baseline": baseline,
                        "current": current,
                        "change_pct": change_pct,
                        "is_significant": abs(change_pct) > 10
                    }

        return drift_report

User Feedback Loop

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import uuid
import json

app = FastAPI()

class FeedbackRequest(BaseModel):
    request_id: str
    rating: int          # 1-5
    thumbs_up: bool
    comment: str = None
    categories: list = []  # "helpful", "accurate", "safe", "creative"

class FeedbackStore:
    def __init__(self):
        self.feedback_db = {}  # use a real DB in production

    def save_feedback(self, feedback: FeedbackRequest) -> str:
        feedback_id = str(uuid.uuid4())
        self.feedback_db[feedback_id] = {
            "request_id": feedback.request_id,
            "rating": feedback.rating,
            "thumbs_up": feedback.thumbs_up,
            "comment": feedback.comment,
            "categories": feedback.categories,
            "timestamp": datetime.now().isoformat()
        }
        return feedback_id

    def get_feedback_stats(self) -> dict:
        if not self.feedback_db:
            return {}

        all_feedback = list(self.feedback_db.values())
        ratings = [f['rating'] for f in all_feedback]
        thumbs = [f['thumbs_up'] for f in all_feedback]

        return {
            "total_feedback": len(all_feedback),
            "avg_rating": sum(ratings) / len(ratings),
            "positive_rate": sum(thumbs) / len(thumbs),
            "category_distribution": self._count_categories(all_feedback)
        }

    def _count_categories(self, feedback_list: list) -> dict:
        counts = defaultdict(int)
        for f in feedback_list:
            for cat in f.get('categories', []):
                counts[cat] += 1
        return dict(counts)

feedback_store = FeedbackStore()

@app.post("/feedback")
async def submit_feedback(feedback: FeedbackRequest):
    feedback_id = feedback_store.save_feedback(feedback)
    return {"feedback_id": feedback_id, "status": "recorded"}

@app.get("/feedback/stats")
async def get_stats():
    return feedback_store.get_feedback_stats()

Conclusion

LLM evaluation is not simply about comparing scores — it is about choosing the right evaluation method for the purpose.

Key summary:

General-purpose evaluation:

  • Knowledge: MMLU, ARC
  • Reasoning: GSM8K, MATH
  • Coding: HumanEval, SWE-bench
  • Conversation: MT-Bench

RAG evaluation: RAGAS (Faithfulness + Answer Relevancy + Context Recall + Precision)

Automation tools: LM-Evaluation-Harness for batch execution of standard benchmarks

Production monitoring: latency, error rate, user feedback, drift detection

Benchmark scores are merely reference points — performance in a real service must be evaluated directly. Especially for non-English services, building language-specific evaluation sets and continuously collecting real user feedback is the most accurate evaluation method.