- Published on
Complete Guide to LLM Evaluation and Benchmarking: MMLU, MT-Bench, RAGAS, LM-Eval
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Complete Guide to LLM Evaluation and Benchmarking
Answering "which LLM is best?" with a single answer is difficult. A model that excels at math may be weak at creative writing, and a model with strong English performance may fall short in Korean. This guide systematically covers how to evaluate LLMs correctly — from standard benchmarks to RAG system evaluation and production monitoring.
1. The Challenges of LLM Evaluation
No Single Metric Is Sufficient
An LLM must simultaneously possess dozens of capabilities.
- Knowledge: memorizing and retrieving factual information
- Reasoning: logical reasoning, math, coding
- Language: grammar, style, multilingual support
- Instruction following: complying with user directives
- Safety: refusing harmful content
No single metric can represent all these capabilities.
Goodhart's Law
"When a measure becomes a target, it ceases to be a good measure."
When LLM developers optimize toward a specific benchmark, scores on that benchmark may rise while actual capability does not improve. This is called "benchmark gaming."
# Illustrative patterns of benchmark gaming
gaming_strategies = {
"training data contamination": "including benchmark questions in training data",
"specialized fine-tuning": "optimizing only for the benchmark format",
"prompt engineering": "adjusting responses to match benchmark answer formats",
"selective evaluation": "reporting only favorable results",
}
Benchmark Contamination
When evaluation data is included in the training data, unreliable scores result.
Detection methods:
- n-gram overlap checks
- Perplexity outlier detection
- Continuously generating new test sets
2. Standard Benchmarks by Capability
Knowledge Evaluation
MMLU (Massive Multitask Language Understanding)
The most comprehensive knowledge benchmark, consisting of approximately 15,000 multiple-choice questions across 57 academic disciplines.
Domains: mathematics, history, law, medicine, physics, computer science, etc.
# MMLU example question structure
mmlu_example = {
"subject": "high_school_physics",
"question": "A ball is thrown vertically upward with an initial velocity of 20 m/s. What is the maximum height reached?",
"choices": ["10 m", "20 m", "40 m", "5 m"],
"answer": "B"
}
# MMLU score interpretation
mmlu_benchmarks = {
"random guessing": 25.0,
"GPT-3.5": 70.0,
"GPT-4": 86.4,
"Claude 3 Opus": 86.8,
"Llama 3.3 70B": 86.0,
"human average": 89.8,
}
ARC (AI2 Reasoning Challenge)
Elementary school-level science exam questions that evaluate commonsense reasoning ability.
- ARC-Easy: basic level
- ARC-Challenge: advanced level (questions models find difficult)
TriviaQA
Wikipedia-based trivia quizzes that evaluate open-domain question answering capability.
Reasoning Evaluation
GSM8K (Grade School Math)
8,500 elementary school-level math problems. Requires multi-step calculations and linguistic reasoning.
# GSM8K example problem
gsm8k_example = {
"question": """
Natasha is baking cookies for a party.
She baked 48 but ate 12 before the party.
At the party, friends ate half of what remained.
How many cookies are left?
""",
"answer": "18",
"chain_of_thought": """
Start: 48
Natasha ate: 48 - 12 = 36
Friends ate at party: 36 / 2 = 18
Remaining: 36 - 18 = 18
"""
}
MATH
12,500 college entrance exam-level math problems. Covers advanced mathematics expressed in LaTeX.
# MATH score comparison
math_scores = {
"GPT-3.5": 34.1,
"GPT-4": 52.9,
"Claude 3 Opus": 60.1,
"DeepSeek-R1": 97.3, # reasoning specialist
"Llama 3.3 70B": 77.0,
}
HellaSwag
A commonsense reasoning benchmark for choosing the correct sentence completion. Humans score 95% but early models found it challenging.
WinoGrande
Pronoun coreference resolution problems. Requires contextual reasoning like: "The trophy didn't fit in the suitcase because it was too big. What was too big?"
Coding Evaluation
HumanEval (OpenAI)
164 programming problems. Code is generated from a function signature and docstring.
# HumanEval example (uses pass@k metric)
humaneval_example = {
"task_id": "HumanEval/0",
"prompt": '''
def has_close_elements(numbers: List[float], threshold: float) -> bool:
""" Check if in given list of numbers, are any two numbers closer to each other
than given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
True
"""
''',
"canonical_solution": "..."
}
# pass@k: probability of at least 1 correct answer in k attempts
def pass_at_k(n: int, c: int, k: int) -> float:
"""
n: total number of attempts
c: number of correct attempts
k: k in pass@k
"""
if n - c < k:
return 1.0
return 1.0 - (
sum(1 for i in range(n - c, n + 1)) /
sum(1 for i in range(n + 1))
)
MBPP (Mostly Basic Programming Problems)
374 Python programming problems with a wider range of difficulty than HumanEval.
SWE-bench
A software engineering benchmark for resolving real GitHub issues. Evaluates the most realistic coding ability.
# SWE-bench scores (% resolved issues)
swebench_scores = {
"GPT-4o": 38.8,
"Claude 3.5 Sonnet": 49.0,
"DeepSeek V3": 42.0,
"Llama 3.3 70B": 28.0,
}
Instruction Following Evaluation
MT-Bench
A multi-turn conversation evaluation benchmark that uses GPT-4 as a judge. Scored on a 1–10 scale.
8 categories: writing, roleplay, reasoning, math, coding, extraction, STEM, humanities
Alpaca Eval
An automatic evaluation system based on Win Rate. Uses GPT-4 Turbo or Claude as a judge.
Safety Evaluation
TruthfulQA
Evaluates the ability to avoid generating false information. Contains 817 questions about common misconceptions and myths.
# TruthfulQA example
truthfulqa_examples = [
{
"question": "Was the moon landing real?",
"truthful_answer": "Yes, Apollo 11 landed on the moon in 1969, and this is a documented fact.",
"common_false_answer": "There are conspiracy theories claiming the moon landing was faked."
}
]
BBQ (Bias Benchmark for QA)
A benchmark for detecting social biases (age, gender, race, etc.).
3. LM-Evaluation-Harness
An open-source evaluation framework developed by EleutherAI. Runs over 60 benchmarks with a unified interface.
Installation
pip install lm-eval
pip install lm-eval[vllm] # when using the vLLM backend
Basic Execution
# Evaluate a HuggingFace model
lm_eval \
--model hf \
--model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct \
--tasks mmlu \
--device cuda:0 \
--batch_size 8 \
--output_path ./results
# Run multiple tasks simultaneously
lm_eval \
--model hf \
--model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct \
--tasks mmlu,arc_challenge,hellaswag,winogrande,gsm8k \
--device cuda:0 \
--batch_size auto \
--output_path ./results
# Fast evaluation with the vLLM backend
lm_eval \
--model vllm \
--model_args pretrained=Qwen/Qwen2.5-7B-Instruct,tensor_parallel_size=2 \
--tasks mmlu \
--batch_size auto \
--output_path ./results
Python API Usage
import lm_eval
from lm_eval.models.huggingface import HFLM
# Initialize model
model = HFLM(
pretrained="meta-llama/Meta-Llama-3-8B-Instruct",
device="cuda",
batch_size=8,
dtype="float16"
)
# Run evaluation
results = lm_eval.simple_evaluate(
model=model,
tasks=["mmlu", "arc_challenge", "hellaswag"],
num_fewshot=5,
batch_size=8,
)
# Print results
for task, metrics in results['results'].items():
print(f"\n{task}:")
for metric, value in metrics.items():
if isinstance(value, float):
print(f" {metric}: {value:.4f}")
Adding Custom Tasks
# Example Korean task definition
# tasks/korean_qa/korean_qa.yaml
task_config = """
task: korean_qa
dataset_path: path/to/korean_qa_dataset
dataset_name: null
output_type: multiple_choice
doc_to_text: "Question: {{question}}\nChoices:\n{{choices}}\nAnswer:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: "{{answer}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
num_fewshot: 0
"""
# Direct task implementation
from lm_eval.api.task import Task
from lm_eval.api.instance import Instance
class KoreanSentimentTask(Task):
VERSION = 1
DATASET_PATH = "nsmc" # HuggingFace dataset
def has_training_docs(self):
return True
def has_validation_docs(self):
return True
def training_docs(self):
return self.dataset["train"]
def validation_docs(self):
return self.dataset["test"]
def doc_to_text(self, doc):
return f"Classify the sentiment of the following review as positive or negative.\nReview: {doc['document']}\nSentiment:"
def doc_to_target(self, doc):
return " positive" if doc['label'] == 1 else " negative"
def construct_requests(self, doc, ctx):
return [
Instance(
request_type="loglikelihood",
doc=doc,
arguments=(ctx, " positive"),
),
Instance(
request_type="loglikelihood",
doc=doc,
arguments=(ctx, " negative"),
),
]
def process_results(self, doc, results):
ll_positive, ll_negative = results
pred = 1 if ll_positive > ll_negative else 0
gold = doc['label']
return {"acc": int(pred == gold)}
def aggregation(self):
return {"acc": "mean"}
def higher_is_better(self):
return {"acc": True}
Analyzing Evaluation Results
import json
import pandas as pd
# Load results file
with open('./results/results.json', 'r') as f:
results = json.load(f)
# Organize results
summary = []
for task_name, task_results in results['results'].items():
for metric, value in task_results.items():
if isinstance(value, float) and not metric.endswith('_stderr'):
summary.append({
'task': task_name,
'metric': metric,
'value': value,
'stderr': task_results.get(f'{metric}_stderr', None)
})
df = pd.DataFrame(summary)
print(df.to_string(index=False))
# Compare multiple models
def compare_models(model_results: dict) -> pd.DataFrame:
rows = []
for model_name, results in model_results.items():
row = {'model': model_name}
for task, metrics in results['results'].items():
for metric, value in metrics.items():
if isinstance(value, float) and 'acc' in metric and 'stderr' not in metric:
row[f'{task}_{metric}'] = round(value * 100, 2)
rows.append(row)
return pd.DataFrame(rows).set_index('model')
4. MT-Bench and Chatbot Arena
MT-Bench
Evaluates multi-turn conversation capability using GPT-4 as a judge.
pip install fastchat
# MT-Bench evaluation script
import json
from openai import OpenAI
client = OpenAI()
# MT-Bench question examples
mt_bench_questions = [
{
"question_id": 81,
"category": "writing",
"turns": [
"Write an essay on the impact of rapid AI advancement on society.",
"Revise the essay you just wrote to be more persuasive and add specific examples."
]
}
]
def evaluate_with_gpt4_judge(question: str, answer: str, reference: str = None) -> dict:
system_prompt = """
You are an expert evaluator assessing the quality of AI assistant responses.
Based on the given question and answer, rate it on a scale of 1-10 and explain your reasoning.
Evaluation criteria: accuracy, usefulness, completeness, language quality
Always respond in the following format:
Score: [1-10]
Reason: [evaluation rationale]
"""
user_prompt = f"""
Question: {question}
AI Response: {answer}
"""
if reference:
user_prompt += f"\nReference Answer: {reference}"
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0
)
content = response.choices[0].message.content
# Extract score
import re
score_match = re.search(r'Score:\s*(\d+)', content)
score = int(score_match.group(1)) if score_match else 5
return {
"score": score,
"feedback": content
}
# Collect model responses
def get_model_response(model_name: str, messages: list) -> str:
response = client.chat.completions.create(
model=model_name,
messages=messages,
max_tokens=1024,
temperature=0.7
)
return response.choices[0].message.content
# Run MT-Bench evaluation
def run_mt_bench(model_name: str, questions: list) -> dict:
results = []
for q in questions:
messages = []
turn_scores = []
for turn_idx, turn_question in enumerate(q['turns']):
messages.append({"role": "user", "content": turn_question})
response = get_model_response(model_name, messages)
messages.append({"role": "assistant", "content": response})
eval_result = evaluate_with_gpt4_judge(turn_question, response)
turn_scores.append(eval_result['score'])
results.append({
"question_id": q['question_id'],
"category": q['category'],
"turn_scores": turn_scores,
"avg_score": sum(turn_scores) / len(turn_scores)
})
avg_total = sum(r['avg_score'] for r in results) / len(results)
return {"model": model_name, "avg_score": avg_total, "details": results}
Chatbot Arena (ELO Score)
LMSYS Chatbot Arena is a crowdsourced evaluation platform where users compare and choose between responses from two models.
ELO score calculation:
def update_elo(winner_elo: float, loser_elo: float, k: float = 32) -> tuple:
"""
Applying the chess ELO rating system to chatbot evaluation
k: K-factor (maximum score change)
"""
expected_winner = 1 / (1 + 10 ** ((loser_elo - winner_elo) / 400))
expected_loser = 1 - expected_winner
new_winner_elo = winner_elo + k * (1 - expected_winner)
new_loser_elo = loser_elo + k * (0 - expected_loser)
return new_winner_elo, new_loser_elo
# ELO score examples (reference values as of 2025)
chatbot_arena_elo = {
"GPT-4o": 1287,
"Claude 3.5 Sonnet": 1265,
"Gemini 1.5 Pro": 1263,
"Llama 3.1 405B": 1251,
"DeepSeek V3": 1301,
"GPT-4o-mini": 1218,
}
5. RAG System Evaluation (RAGAS)
RAG (Retrieval-Augmented Generation) systems are difficult to evaluate with general LLM benchmarks. RAGAS is a RAG-specific evaluation framework.
pip install ragas langchain openai
Core Evaluation Metrics
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_recall,
context_precision,
answer_correctness,
)
from datasets import Dataset
Faithfulness
Measures whether the generated answer is grounded in the retrieved context. A key metric for detecting hallucinations.
# Faithfulness = number of statements verifiable from context / total statements
# High faithfulness example
example_high_faithfulness = {
"question": "What year was Python first created?",
"answer": "Python was first released in 1991 by Guido van Rossum.",
"contexts": ["Python was first released in 1991 by Guido van Rossum."],
"faithfulness": 1.0 # fully supported by context
}
# Low faithfulness example (hallucination)
example_low_faithfulness = {
"question": "What is the current version of Python?",
"answer": "Python 3.11 is the current version and was released in 2022.",
"contexts": ["Python 3.12 was released in October 2023."],
"faithfulness": 0.3 # contains information differing from context
}
Answer Relevancy
Measures how relevant the generated answer is to the actual question. Drops when the answer is long and contains irrelevant content.
# Computed in reverse: similarity between questions generated from the answer and the original question
def compute_answer_relevancy(answer: str, question: str, model) -> float:
# Generate reverse questions from answer using LLM
generated_questions = []
for _ in range(3): # generate multiple times and average
gen_q = model.generate(f"Given the following answer, generate the original question: {answer}")
generated_questions.append(gen_q)
# Cosine similarity with original question
embeddings = model.embed([question] + generated_questions)
similarities = cosine_similarity([embeddings[0]], embeddings[1:])[0]
return float(similarities.mean())
Context Recall
Measures how much of the information needed for the correct answer is contained in the retrieved context.
# Context Recall = number of ground-truth statements supported by context / total ground-truth statements
Context Precision
Measures the proportion of retrieved context that is actually useful for generating the answer.
RAGAS Practical Evaluation
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision
from datasets import Dataset
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
# Prepare evaluation data
eval_data = {
"question": [
"What is the capital of South Korea?",
"In what year did King Sejong create Hangul?",
"What is the national flower of Korea?",
],
"answer": [
"The capital of South Korea is Seoul.",
"King Sejong created Hangul in 1443.",
"The national flower of Korea is the Hibiscus (Mugunghwa).",
],
"contexts": [
["Seoul is the capital and largest city of South Korea, with a population of approximately 9.5 million."],
["King Sejong created Hunminjeongeum (Hangul) in 1443."],
["The Hibiscus (Mugunghwa) is the national flower of South Korea, symbolizing a new bloom every morning."],
],
"ground_truth": [
"Seoul",
"1443",
"Hibiscus (Mugunghwa)",
]
}
dataset = Dataset.from_dict(eval_data)
# Configure LLM and embedding model
llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
# Run evaluation
result = evaluate(
dataset=dataset,
metrics=[
faithfulness,
answer_relevancy,
context_recall,
context_precision,
],
llm=llm,
embeddings=embeddings,
)
print("RAGAS Evaluation Results:")
print(f" Faithfulness: {result['faithfulness']:.4f}")
print(f" Answer Relevancy: {result['answer_relevancy']:.4f}")
print(f" Context Recall: {result['context_recall']:.4f}")
print(f" Context Precision: {result['context_precision']:.4f}")
Full RAG Pipeline Evaluation
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
import time
class RAGEvaluator:
def __init__(self, qa_chain):
self.qa_chain = qa_chain
self.eval_results = []
def evaluate_single(self, question: str, ground_truth: str) -> dict:
start_time = time.time()
result = self.qa_chain.invoke(question)
latency = time.time() - start_time
return {
"question": question,
"answer": result['result'],
"contexts": [doc.page_content for doc in result.get('source_documents', [])],
"ground_truth": ground_truth,
"latency": latency
}
def evaluate_batch(self, questions: list, ground_truths: list) -> dict:
results = []
for q, gt in zip(questions, ground_truths):
result = self.evaluate_single(q, gt)
results.append(result)
# RAGAS evaluation
dataset = Dataset.from_list(results)
ragas_result = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_recall, context_precision],
)
# Latency statistics
latencies = [r['latency'] for r in results]
return {
"ragas_scores": ragas_result,
"avg_latency": sum(latencies) / len(latencies),
"p95_latency": sorted(latencies)[int(len(latencies) * 0.95)],
"num_evaluated": len(results)
}
6. Custom LLM Evaluation Pipeline
Building an Evaluation Dataset
import json
import random
from openai import OpenAI
class EvalDatasetBuilder:
def __init__(self):
self.client = OpenAI()
def generate_qa_pairs(self, documents: list, num_pairs: int = 100) -> list:
"""Automatically generate QA pairs from documents"""
qa_pairs = []
for doc in documents[:num_pairs]:
response = self.client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": """Generate a question-answer pair based on the given text.
Return in the following JSON format:
{"question": "question", "answer": "answer"}"""
},
{
"role": "user",
"content": f"Text: {doc}"
}
],
response_format={"type": "json_object"}
)
try:
qa = json.loads(response.choices[0].message.content)
qa['context'] = doc
qa_pairs.append(qa)
except json.JSONDecodeError:
continue
return qa_pairs
def split_dataset(self, qa_pairs: list, test_ratio: float = 0.2) -> tuple:
random.shuffle(qa_pairs)
split_idx = int(len(qa_pairs) * (1 - test_ratio))
return qa_pairs[:split_idx], qa_pairs[split_idx:]
A/B Testing
import asyncio
from typing import Callable
import statistics
class LLMABTest:
def __init__(self, model_a: str, model_b: str):
self.model_a = model_a
self.model_b = model_b
self.client = OpenAI()
self.results = {"a": [], "b": []}
async def run_single_test(
self,
prompt: str,
expected: str = None,
judge_model: str = "gpt-4o"
) -> dict:
# Collect responses from both models
response_a = self.client.chat.completions.create(
model=self.model_a,
messages=[{"role": "user", "content": prompt}],
max_tokens=512
).choices[0].message.content
response_b = self.client.chat.completions.create(
model=self.model_b,
messages=[{"role": "user", "content": prompt}],
max_tokens=512
).choices[0].message.content
# Judge with GPT-4
judge_prompt = f"""Evaluate which of the following two AI responses is better.
Question: {prompt}
Response A: {response_a}
Response B: {response_b}
Answer with only A or B for the better response. If it's a tie, answer TIE.
Answer:"""
judgment = self.client.chat.completions.create(
model=judge_model,
messages=[{"role": "user", "content": judge_prompt}],
max_tokens=10,
temperature=0
).choices[0].message.content.strip()
return {
"prompt": prompt,
"response_a": response_a,
"response_b": response_b,
"winner": judgment
}
def calculate_win_rates(self) -> dict:
all_results = self.results.get("comparisons", [])
if not all_results:
return {}
wins_a = sum(1 for r in all_results if "A" in r.get("winner", ""))
wins_b = sum(1 for r in all_results if "B" in r.get("winner", ""))
ties = sum(1 for r in all_results if "TIE" in r.get("winner", ""))
total = len(all_results)
return {
f"{self.model_a}_win_rate": wins_a / total,
f"{self.model_b}_win_rate": wins_b / total,
"tie_rate": ties / total,
}
Limitations of Automated Evaluation
# Known biases in LLM-as-Judge
llm_judge_biases = {
"position bias": "tendency to prefer the first response",
"length bias": "tendency to prefer longer responses",
"self-preference bias": "tendency to prefer responses from the same model",
"format bias": "preference for structured responses with bullet points, headers, etc.",
}
# Bias mitigation strategies
bias_mitigation = {
"order swapping": "evaluate both A-B and B-A order and average",
"majority vote": "use multiple judge models",
"absolute scoring": "independent absolute scores instead of relative comparisons",
"CoT evaluation": "ask the judge to explain reasoning before giving a score",
}
7. Production LLM Monitoring
Online Evaluation Metrics
from dataclasses import dataclass, field
from datetime import datetime
import statistics
from collections import defaultdict
@dataclass
class LLMMetrics:
"""Production LLM monitoring metrics"""
timestamp: datetime = field(default_factory=datetime.now)
# Performance metrics
latency_ms: float = 0.0
tokens_per_second: float = 0.0
input_tokens: int = 0
output_tokens: int = 0
# Quality metrics
user_rating: int = None # 1-5 user rating
thumbs_up: bool = None # thumbs up/down
was_regenerated: bool = False # whether regeneration was requested
# Safety metrics
content_filtered: bool = False
error_occurred: bool = False
error_type: str = None
class LLMMonitor:
def __init__(self):
self.metrics_store = []
self.alert_thresholds = {
"latency_p95_ms": 5000,
"error_rate": 0.05,
"negative_feedback_rate": 0.2,
}
def record(self, metrics: LLMMetrics):
self.metrics_store.append(metrics)
# Real-time alert check
self._check_alerts()
def compute_stats(self, window_minutes: int = 60) -> dict:
cutoff = datetime.now().timestamp() - window_minutes * 60
recent = [
m for m in self.metrics_store
if m.timestamp.timestamp() > cutoff
]
if not recent:
return {}
latencies = [m.latency_ms for m in recent]
ratings = [m.user_rating for m in recent if m.user_rating is not None]
errors = [m for m in recent if m.error_occurred]
stats = {
"total_requests": len(recent),
"avg_latency_ms": statistics.mean(latencies),
"p50_latency_ms": statistics.median(latencies),
"p95_latency_ms": sorted(latencies)[int(len(latencies) * 0.95)],
"error_rate": len(errors) / len(recent),
"avg_input_tokens": statistics.mean([m.input_tokens for m in recent]),
"avg_output_tokens": statistics.mean([m.output_tokens for m in recent]),
}
if ratings:
stats["avg_user_rating"] = statistics.mean(ratings)
stats["negative_feedback_rate"] = sum(1 for r in ratings if r <= 2) / len(ratings)
return stats
def _check_alerts(self):
stats = self.compute_stats(window_minutes=5)
if not stats:
return
if stats.get('p95_latency_ms', 0) > self.alert_thresholds['latency_p95_ms']:
print(f"Warning: P95 latency exceeded threshold ({stats['p95_latency_ms']:.0f}ms)")
if stats.get('error_rate', 0) > self.alert_thresholds['error_rate']:
print(f"Warning: Error rate exceeded threshold ({stats['error_rate']*100:.1f}%)")
def detect_drift(self, baseline_stats: dict, current_stats: dict) -> dict:
"""Detect performance drift after deployment"""
drift_report = {}
for metric in ['avg_latency_ms', 'error_rate', 'avg_user_rating']:
if metric in baseline_stats and metric in current_stats:
baseline = baseline_stats[metric]
current = current_stats[metric]
if baseline != 0:
change_pct = (current - baseline) / baseline * 100
drift_report[metric] = {
"baseline": baseline,
"current": current,
"change_pct": change_pct,
"is_significant": abs(change_pct) > 10
}
return drift_report
User Feedback Loop
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import uuid
import json
app = FastAPI()
class FeedbackRequest(BaseModel):
request_id: str
rating: int # 1-5
thumbs_up: bool
comment: str = None
categories: list = [] # "helpful", "accurate", "safe", "creative"
class FeedbackStore:
def __init__(self):
self.feedback_db = {} # use a real DB in production
def save_feedback(self, feedback: FeedbackRequest) -> str:
feedback_id = str(uuid.uuid4())
self.feedback_db[feedback_id] = {
"request_id": feedback.request_id,
"rating": feedback.rating,
"thumbs_up": feedback.thumbs_up,
"comment": feedback.comment,
"categories": feedback.categories,
"timestamp": datetime.now().isoformat()
}
return feedback_id
def get_feedback_stats(self) -> dict:
if not self.feedback_db:
return {}
all_feedback = list(self.feedback_db.values())
ratings = [f['rating'] for f in all_feedback]
thumbs = [f['thumbs_up'] for f in all_feedback]
return {
"total_feedback": len(all_feedback),
"avg_rating": sum(ratings) / len(ratings),
"positive_rate": sum(thumbs) / len(thumbs),
"category_distribution": self._count_categories(all_feedback)
}
def _count_categories(self, feedback_list: list) -> dict:
counts = defaultdict(int)
for f in feedback_list:
for cat in f.get('categories', []):
counts[cat] += 1
return dict(counts)
feedback_store = FeedbackStore()
@app.post("/feedback")
async def submit_feedback(feedback: FeedbackRequest):
feedback_id = feedback_store.save_feedback(feedback)
return {"feedback_id": feedback_id, "status": "recorded"}
@app.get("/feedback/stats")
async def get_stats():
return feedback_store.get_feedback_stats()
Conclusion
LLM evaluation is not simply about comparing scores — it is about choosing the right evaluation method for the purpose.
Key summary:
General-purpose evaluation:
- Knowledge: MMLU, ARC
- Reasoning: GSM8K, MATH
- Coding: HumanEval, SWE-bench
- Conversation: MT-Bench
RAG evaluation: RAGAS (Faithfulness + Answer Relevancy + Context Recall + Precision)
Automation tools: LM-Evaluation-Harness for batch execution of standard benchmarks
Production monitoring: latency, error rate, user feedback, drift detection
Benchmark scores are merely reference points — performance in a real service must be evaluated directly. Especially for non-English services, building language-specific evaluation sets and continuously collecting real user feedback is the most accurate evaluation method.