Skip to content
Published on

Complete Guide to LLM Evaluation and Benchmarking: MMLU, MT-Bench, RAGAS, LM-Eval

Authors

Complete Guide to LLM Evaluation and Benchmarking

Answering "which LLM is best?" with a single answer is difficult. A model that excels at math may be weak at creative writing, and a model with strong English performance may fall short in Korean. This guide systematically covers how to evaluate LLMs correctly — from standard benchmarks to RAG system evaluation and production monitoring.


1. The Challenges of LLM Evaluation

No Single Metric Is Sufficient

An LLM must simultaneously possess dozens of capabilities.

  • Knowledge: memorizing and retrieving factual information
  • Reasoning: logical reasoning, math, coding
  • Language: grammar, style, multilingual support
  • Instruction following: complying with user directives
  • Safety: refusing harmful content

No single metric can represent all these capabilities.

Goodhart's Law

"When a measure becomes a target, it ceases to be a good measure."

When LLM developers optimize toward a specific benchmark, scores on that benchmark may rise while actual capability does not improve. This is called "benchmark gaming."

# Illustrative patterns of benchmark gaming
gaming_strategies = {
    "training data contamination": "including benchmark questions in training data",
    "specialized fine-tuning": "optimizing only for the benchmark format",
    "prompt engineering": "adjusting responses to match benchmark answer formats",
    "selective evaluation": "reporting only favorable results",
}

Benchmark Contamination

When evaluation data is included in the training data, unreliable scores result.

Detection methods:

  • n-gram overlap checks
  • Perplexity outlier detection
  • Continuously generating new test sets

2. Standard Benchmarks by Capability

Knowledge Evaluation

MMLU (Massive Multitask Language Understanding)

The most comprehensive knowledge benchmark, consisting of approximately 15,000 multiple-choice questions across 57 academic disciplines.

Domains: mathematics, history, law, medicine, physics, computer science, etc.

# MMLU example question structure
mmlu_example = {
    "subject": "high_school_physics",
    "question": "A ball is thrown vertically upward with an initial velocity of 20 m/s. What is the maximum height reached?",
    "choices": ["10 m", "20 m", "40 m", "5 m"],
    "answer": "B"
}

# MMLU score interpretation
mmlu_benchmarks = {
    "random guessing": 25.0,
    "GPT-3.5": 70.0,
    "GPT-4": 86.4,
    "Claude 3 Opus": 86.8,
    "Llama 3.3 70B": 86.0,
    "human average": 89.8,
}

ARC (AI2 Reasoning Challenge)

Elementary school-level science exam questions that evaluate commonsense reasoning ability.

  • ARC-Easy: basic level
  • ARC-Challenge: advanced level (questions models find difficult)

TriviaQA

Wikipedia-based trivia quizzes that evaluate open-domain question answering capability.

Reasoning Evaluation

GSM8K (Grade School Math)

8,500 elementary school-level math problems. Requires multi-step calculations and linguistic reasoning.

# GSM8K example problem
gsm8k_example = {
    "question": """
    Natasha is baking cookies for a party.
    She baked 48 but ate 12 before the party.
    At the party, friends ate half of what remained.
    How many cookies are left?
    """,
    "answer": "18",
    "chain_of_thought": """
    Start: 48
    Natasha ate: 48 - 12 = 36
    Friends ate at party: 36 / 2 = 18
    Remaining: 36 - 18 = 18
    """
}

MATH

12,500 college entrance exam-level math problems. Covers advanced mathematics expressed in LaTeX.

# MATH score comparison
math_scores = {
    "GPT-3.5": 34.1,
    "GPT-4": 52.9,
    "Claude 3 Opus": 60.1,
    "DeepSeek-R1": 97.3,    # reasoning specialist
    "Llama 3.3 70B": 77.0,
}

HellaSwag

A commonsense reasoning benchmark for choosing the correct sentence completion. Humans score 95% but early models found it challenging.

WinoGrande

Pronoun coreference resolution problems. Requires contextual reasoning like: "The trophy didn't fit in the suitcase because it was too big. What was too big?"

Coding Evaluation

HumanEval (OpenAI)

164 programming problems. Code is generated from a function signature and docstring.

# HumanEval example (uses pass@k metric)
humaneval_example = {
    "task_id": "HumanEval/0",
    "prompt": '''
def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other
    than given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """
    ''',
    "canonical_solution": "..."
}

# pass@k: probability of at least 1 correct answer in k attempts
def pass_at_k(n: int, c: int, k: int) -> float:
    """
    n: total number of attempts
    c: number of correct attempts
    k: k in pass@k
    """
    if n - c < k:
        return 1.0
    return 1.0 - (
        sum(1 for i in range(n - c, n + 1)) /
        sum(1 for i in range(n + 1))
    )

MBPP (Mostly Basic Programming Problems)

374 Python programming problems with a wider range of difficulty than HumanEval.

SWE-bench

A software engineering benchmark for resolving real GitHub issues. Evaluates the most realistic coding ability.

# SWE-bench scores (% resolved issues)
swebench_scores = {
    "GPT-4o": 38.8,
    "Claude 3.5 Sonnet": 49.0,
    "DeepSeek V3": 42.0,
    "Llama 3.3 70B": 28.0,
}

Instruction Following Evaluation

MT-Bench

A multi-turn conversation evaluation benchmark that uses GPT-4 as a judge. Scored on a 1–10 scale.

8 categories: writing, roleplay, reasoning, math, coding, extraction, STEM, humanities

Alpaca Eval

An automatic evaluation system based on Win Rate. Uses GPT-4 Turbo or Claude as a judge.

Safety Evaluation

TruthfulQA

Evaluates the ability to avoid generating false information. Contains 817 questions about common misconceptions and myths.

# TruthfulQA example
truthfulqa_examples = [
    {
        "question": "Was the moon landing real?",
        "truthful_answer": "Yes, Apollo 11 landed on the moon in 1969, and this is a documented fact.",
        "common_false_answer": "There are conspiracy theories claiming the moon landing was faked."
    }
]

BBQ (Bias Benchmark for QA)

A benchmark for detecting social biases (age, gender, race, etc.).


3. LM-Evaluation-Harness

An open-source evaluation framework developed by EleutherAI. Runs over 60 benchmarks with a unified interface.

Installation

pip install lm-eval
pip install lm-eval[vllm]  # when using the vLLM backend

Basic Execution

# Evaluate a HuggingFace model
lm_eval \
    --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct \
    --tasks mmlu \
    --device cuda:0 \
    --batch_size 8 \
    --output_path ./results

# Run multiple tasks simultaneously
lm_eval \
    --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct \
    --tasks mmlu,arc_challenge,hellaswag,winogrande,gsm8k \
    --device cuda:0 \
    --batch_size auto \
    --output_path ./results

# Fast evaluation with the vLLM backend
lm_eval \
    --model vllm \
    --model_args pretrained=Qwen/Qwen2.5-7B-Instruct,tensor_parallel_size=2 \
    --tasks mmlu \
    --batch_size auto \
    --output_path ./results

Python API Usage

import lm_eval
from lm_eval.models.huggingface import HFLM

# Initialize model
model = HFLM(
    pretrained="meta-llama/Meta-Llama-3-8B-Instruct",
    device="cuda",
    batch_size=8,
    dtype="float16"
)

# Run evaluation
results = lm_eval.simple_evaluate(
    model=model,
    tasks=["mmlu", "arc_challenge", "hellaswag"],
    num_fewshot=5,
    batch_size=8,
)

# Print results
for task, metrics in results['results'].items():
    print(f"\n{task}:")
    for metric, value in metrics.items():
        if isinstance(value, float):
            print(f"  {metric}: {value:.4f}")

Adding Custom Tasks

# Example Korean task definition
# tasks/korean_qa/korean_qa.yaml

task_config = """
task: korean_qa
dataset_path: path/to/korean_qa_dataset
dataset_name: null
output_type: multiple_choice
doc_to_text: "Question: {{question}}\nChoices:\n{{choices}}\nAnswer:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: "{{answer}}"
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
num_fewshot: 0
"""
# Direct task implementation
from lm_eval.api.task import Task
from lm_eval.api.instance import Instance

class KoreanSentimentTask(Task):
    VERSION = 1
    DATASET_PATH = "nsmc"  # HuggingFace dataset

    def has_training_docs(self):
        return True

    def has_validation_docs(self):
        return True

    def training_docs(self):
        return self.dataset["train"]

    def validation_docs(self):
        return self.dataset["test"]

    def doc_to_text(self, doc):
        return f"Classify the sentiment of the following review as positive or negative.\nReview: {doc['document']}\nSentiment:"

    def doc_to_target(self, doc):
        return " positive" if doc['label'] == 1 else " negative"

    def construct_requests(self, doc, ctx):
        return [
            Instance(
                request_type="loglikelihood",
                doc=doc,
                arguments=(ctx, " positive"),
            ),
            Instance(
                request_type="loglikelihood",
                doc=doc,
                arguments=(ctx, " negative"),
            ),
        ]

    def process_results(self, doc, results):
        ll_positive, ll_negative = results
        pred = 1 if ll_positive > ll_negative else 0
        gold = doc['label']
        return {"acc": int(pred == gold)}

    def aggregation(self):
        return {"acc": "mean"}

    def higher_is_better(self):
        return {"acc": True}

Analyzing Evaluation Results

import json
import pandas as pd

# Load results file
with open('./results/results.json', 'r') as f:
    results = json.load(f)

# Organize results
summary = []
for task_name, task_results in results['results'].items():
    for metric, value in task_results.items():
        if isinstance(value, float) and not metric.endswith('_stderr'):
            summary.append({
                'task': task_name,
                'metric': metric,
                'value': value,
                'stderr': task_results.get(f'{metric}_stderr', None)
            })

df = pd.DataFrame(summary)
print(df.to_string(index=False))

# Compare multiple models
def compare_models(model_results: dict) -> pd.DataFrame:
    rows = []
    for model_name, results in model_results.items():
        row = {'model': model_name}
        for task, metrics in results['results'].items():
            for metric, value in metrics.items():
                if isinstance(value, float) and 'acc' in metric and 'stderr' not in metric:
                    row[f'{task}_{metric}'] = round(value * 100, 2)
        rows.append(row)
    return pd.DataFrame(rows).set_index('model')

4. MT-Bench and Chatbot Arena

MT-Bench

Evaluates multi-turn conversation capability using GPT-4 as a judge.

pip install fastchat
# MT-Bench evaluation script
import json
from openai import OpenAI

client = OpenAI()

# MT-Bench question examples
mt_bench_questions = [
    {
        "question_id": 81,
        "category": "writing",
        "turns": [
            "Write an essay on the impact of rapid AI advancement on society.",
            "Revise the essay you just wrote to be more persuasive and add specific examples."
        ]
    }
]

def evaluate_with_gpt4_judge(question: str, answer: str, reference: str = None) -> dict:
    system_prompt = """
    You are an expert evaluator assessing the quality of AI assistant responses.
    Based on the given question and answer, rate it on a scale of 1-10 and explain your reasoning.
    Evaluation criteria: accuracy, usefulness, completeness, language quality
    Always respond in the following format:
    Score: [1-10]
    Reason: [evaluation rationale]
    """

    user_prompt = f"""
    Question: {question}

    AI Response: {answer}
    """

    if reference:
        user_prompt += f"\nReference Answer: {reference}"

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0
    )

    content = response.choices[0].message.content
    # Extract score
    import re
    score_match = re.search(r'Score:\s*(\d+)', content)
    score = int(score_match.group(1)) if score_match else 5

    return {
        "score": score,
        "feedback": content
    }

# Collect model responses
def get_model_response(model_name: str, messages: list) -> str:
    response = client.chat.completions.create(
        model=model_name,
        messages=messages,
        max_tokens=1024,
        temperature=0.7
    )
    return response.choices[0].message.content

# Run MT-Bench evaluation
def run_mt_bench(model_name: str, questions: list) -> dict:
    results = []

    for q in questions:
        messages = []
        turn_scores = []

        for turn_idx, turn_question in enumerate(q['turns']):
            messages.append({"role": "user", "content": turn_question})
            response = get_model_response(model_name, messages)
            messages.append({"role": "assistant", "content": response})

            eval_result = evaluate_with_gpt4_judge(turn_question, response)
            turn_scores.append(eval_result['score'])

        results.append({
            "question_id": q['question_id'],
            "category": q['category'],
            "turn_scores": turn_scores,
            "avg_score": sum(turn_scores) / len(turn_scores)
        })

    avg_total = sum(r['avg_score'] for r in results) / len(results)
    return {"model": model_name, "avg_score": avg_total, "details": results}

Chatbot Arena (ELO Score)

LMSYS Chatbot Arena is a crowdsourced evaluation platform where users compare and choose between responses from two models.

ELO score calculation:

def update_elo(winner_elo: float, loser_elo: float, k: float = 32) -> tuple:
    """
    Applying the chess ELO rating system to chatbot evaluation
    k: K-factor (maximum score change)
    """
    expected_winner = 1 / (1 + 10 ** ((loser_elo - winner_elo) / 400))
    expected_loser = 1 - expected_winner

    new_winner_elo = winner_elo + k * (1 - expected_winner)
    new_loser_elo = loser_elo + k * (0 - expected_loser)

    return new_winner_elo, new_loser_elo

# ELO score examples (reference values as of 2025)
chatbot_arena_elo = {
    "GPT-4o": 1287,
    "Claude 3.5 Sonnet": 1265,
    "Gemini 1.5 Pro": 1263,
    "Llama 3.1 405B": 1251,
    "DeepSeek V3": 1301,
    "GPT-4o-mini": 1218,
}

5. RAG System Evaluation (RAGAS)

RAG (Retrieval-Augmented Generation) systems are difficult to evaluate with general LLM benchmarks. RAGAS is a RAG-specific evaluation framework.

pip install ragas langchain openai

Core Evaluation Metrics

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
)
from datasets import Dataset

Faithfulness

Measures whether the generated answer is grounded in the retrieved context. A key metric for detecting hallucinations.

# Faithfulness = number of statements verifiable from context / total statements

# High faithfulness example
example_high_faithfulness = {
    "question": "What year was Python first created?",
    "answer": "Python was first released in 1991 by Guido van Rossum.",
    "contexts": ["Python was first released in 1991 by Guido van Rossum."],
    "faithfulness": 1.0  # fully supported by context
}

# Low faithfulness example (hallucination)
example_low_faithfulness = {
    "question": "What is the current version of Python?",
    "answer": "Python 3.11 is the current version and was released in 2022.",
    "contexts": ["Python 3.12 was released in October 2023."],
    "faithfulness": 0.3  # contains information differing from context
}

Answer Relevancy

Measures how relevant the generated answer is to the actual question. Drops when the answer is long and contains irrelevant content.

# Computed in reverse: similarity between questions generated from the answer and the original question
def compute_answer_relevancy(answer: str, question: str, model) -> float:
    # Generate reverse questions from answer using LLM
    generated_questions = []
    for _ in range(3):  # generate multiple times and average
        gen_q = model.generate(f"Given the following answer, generate the original question: {answer}")
        generated_questions.append(gen_q)

    # Cosine similarity with original question
    embeddings = model.embed([question] + generated_questions)
    similarities = cosine_similarity([embeddings[0]], embeddings[1:])[0]
    return float(similarities.mean())

Context Recall

Measures how much of the information needed for the correct answer is contained in the retrieved context.

# Context Recall = number of ground-truth statements supported by context / total ground-truth statements

Context Precision

Measures the proportion of retrieved context that is actually useful for generating the answer.

RAGAS Practical Evaluation

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision
from datasets import Dataset
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

# Prepare evaluation data
eval_data = {
    "question": [
        "What is the capital of South Korea?",
        "In what year did King Sejong create Hangul?",
        "What is the national flower of Korea?",
    ],
    "answer": [
        "The capital of South Korea is Seoul.",
        "King Sejong created Hangul in 1443.",
        "The national flower of Korea is the Hibiscus (Mugunghwa).",
    ],
    "contexts": [
        ["Seoul is the capital and largest city of South Korea, with a population of approximately 9.5 million."],
        ["King Sejong created Hunminjeongeum (Hangul) in 1443."],
        ["The Hibiscus (Mugunghwa) is the national flower of South Korea, symbolizing a new bloom every morning."],
    ],
    "ground_truth": [
        "Seoul",
        "1443",
        "Hibiscus (Mugunghwa)",
    ]
}

dataset = Dataset.from_dict(eval_data)

# Configure LLM and embedding model
llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

# Run evaluation
result = evaluate(
    dataset=dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_recall,
        context_precision,
    ],
    llm=llm,
    embeddings=embeddings,
)

print("RAGAS Evaluation Results:")
print(f"  Faithfulness: {result['faithfulness']:.4f}")
print(f"  Answer Relevancy: {result['answer_relevancy']:.4f}")
print(f"  Context Recall: {result['context_recall']:.4f}")
print(f"  Context Precision: {result['context_precision']:.4f}")

Full RAG Pipeline Evaluation

from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
import time

class RAGEvaluator:
    def __init__(self, qa_chain):
        self.qa_chain = qa_chain
        self.eval_results = []

    def evaluate_single(self, question: str, ground_truth: str) -> dict:
        start_time = time.time()
        result = self.qa_chain.invoke(question)
        latency = time.time() - start_time

        return {
            "question": question,
            "answer": result['result'],
            "contexts": [doc.page_content for doc in result.get('source_documents', [])],
            "ground_truth": ground_truth,
            "latency": latency
        }

    def evaluate_batch(self, questions: list, ground_truths: list) -> dict:
        results = []
        for q, gt in zip(questions, ground_truths):
            result = self.evaluate_single(q, gt)
            results.append(result)

        # RAGAS evaluation
        dataset = Dataset.from_list(results)
        ragas_result = evaluate(
            dataset=dataset,
            metrics=[faithfulness, answer_relevancy, context_recall, context_precision],
        )

        # Latency statistics
        latencies = [r['latency'] for r in results]

        return {
            "ragas_scores": ragas_result,
            "avg_latency": sum(latencies) / len(latencies),
            "p95_latency": sorted(latencies)[int(len(latencies) * 0.95)],
            "num_evaluated": len(results)
        }

6. Custom LLM Evaluation Pipeline

Building an Evaluation Dataset

import json
import random
from openai import OpenAI

class EvalDatasetBuilder:
    def __init__(self):
        self.client = OpenAI()

    def generate_qa_pairs(self, documents: list, num_pairs: int = 100) -> list:
        """Automatically generate QA pairs from documents"""
        qa_pairs = []

        for doc in documents[:num_pairs]:
            response = self.client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {
                        "role": "system",
                        "content": """Generate a question-answer pair based on the given text.
                        Return in the following JSON format:
                        {"question": "question", "answer": "answer"}"""
                    },
                    {
                        "role": "user",
                        "content": f"Text: {doc}"
                    }
                ],
                response_format={"type": "json_object"}
            )

            try:
                qa = json.loads(response.choices[0].message.content)
                qa['context'] = doc
                qa_pairs.append(qa)
            except json.JSONDecodeError:
                continue

        return qa_pairs

    def split_dataset(self, qa_pairs: list, test_ratio: float = 0.2) -> tuple:
        random.shuffle(qa_pairs)
        split_idx = int(len(qa_pairs) * (1 - test_ratio))
        return qa_pairs[:split_idx], qa_pairs[split_idx:]

A/B Testing

import asyncio
from typing import Callable
import statistics

class LLMABTest:
    def __init__(self, model_a: str, model_b: str):
        self.model_a = model_a
        self.model_b = model_b
        self.client = OpenAI()
        self.results = {"a": [], "b": []}

    async def run_single_test(
        self,
        prompt: str,
        expected: str = None,
        judge_model: str = "gpt-4o"
    ) -> dict:
        # Collect responses from both models
        response_a = self.client.chat.completions.create(
            model=self.model_a,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=512
        ).choices[0].message.content

        response_b = self.client.chat.completions.create(
            model=self.model_b,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=512
        ).choices[0].message.content

        # Judge with GPT-4
        judge_prompt = f"""Evaluate which of the following two AI responses is better.

Question: {prompt}

Response A: {response_a}

Response B: {response_b}

Answer with only A or B for the better response. If it's a tie, answer TIE.
Answer:"""

        judgment = self.client.chat.completions.create(
            model=judge_model,
            messages=[{"role": "user", "content": judge_prompt}],
            max_tokens=10,
            temperature=0
        ).choices[0].message.content.strip()

        return {
            "prompt": prompt,
            "response_a": response_a,
            "response_b": response_b,
            "winner": judgment
        }

    def calculate_win_rates(self) -> dict:
        all_results = self.results.get("comparisons", [])
        if not all_results:
            return {}

        wins_a = sum(1 for r in all_results if "A" in r.get("winner", ""))
        wins_b = sum(1 for r in all_results if "B" in r.get("winner", ""))
        ties = sum(1 for r in all_results if "TIE" in r.get("winner", ""))

        total = len(all_results)
        return {
            f"{self.model_a}_win_rate": wins_a / total,
            f"{self.model_b}_win_rate": wins_b / total,
            "tie_rate": ties / total,
        }

Limitations of Automated Evaluation

# Known biases in LLM-as-Judge
llm_judge_biases = {
    "position bias": "tendency to prefer the first response",
    "length bias": "tendency to prefer longer responses",
    "self-preference bias": "tendency to prefer responses from the same model",
    "format bias": "preference for structured responses with bullet points, headers, etc.",
}

# Bias mitigation strategies
bias_mitigation = {
    "order swapping": "evaluate both A-B and B-A order and average",
    "majority vote": "use multiple judge models",
    "absolute scoring": "independent absolute scores instead of relative comparisons",
    "CoT evaluation": "ask the judge to explain reasoning before giving a score",
}

7. Production LLM Monitoring

Online Evaluation Metrics

from dataclasses import dataclass, field
from datetime import datetime
import statistics
from collections import defaultdict

@dataclass
class LLMMetrics:
    """Production LLM monitoring metrics"""
    timestamp: datetime = field(default_factory=datetime.now)

    # Performance metrics
    latency_ms: float = 0.0
    tokens_per_second: float = 0.0
    input_tokens: int = 0
    output_tokens: int = 0

    # Quality metrics
    user_rating: int = None      # 1-5 user rating
    thumbs_up: bool = None       # thumbs up/down
    was_regenerated: bool = False # whether regeneration was requested

    # Safety metrics
    content_filtered: bool = False
    error_occurred: bool = False
    error_type: str = None

class LLMMonitor:
    def __init__(self):
        self.metrics_store = []
        self.alert_thresholds = {
            "latency_p95_ms": 5000,
            "error_rate": 0.05,
            "negative_feedback_rate": 0.2,
        }

    def record(self, metrics: LLMMetrics):
        self.metrics_store.append(metrics)

        # Real-time alert check
        self._check_alerts()

    def compute_stats(self, window_minutes: int = 60) -> dict:
        cutoff = datetime.now().timestamp() - window_minutes * 60
        recent = [
            m for m in self.metrics_store
            if m.timestamp.timestamp() > cutoff
        ]

        if not recent:
            return {}

        latencies = [m.latency_ms for m in recent]
        ratings = [m.user_rating for m in recent if m.user_rating is not None]
        errors = [m for m in recent if m.error_occurred]

        stats = {
            "total_requests": len(recent),
            "avg_latency_ms": statistics.mean(latencies),
            "p50_latency_ms": statistics.median(latencies),
            "p95_latency_ms": sorted(latencies)[int(len(latencies) * 0.95)],
            "error_rate": len(errors) / len(recent),
            "avg_input_tokens": statistics.mean([m.input_tokens for m in recent]),
            "avg_output_tokens": statistics.mean([m.output_tokens for m in recent]),
        }

        if ratings:
            stats["avg_user_rating"] = statistics.mean(ratings)
            stats["negative_feedback_rate"] = sum(1 for r in ratings if r <= 2) / len(ratings)

        return stats

    def _check_alerts(self):
        stats = self.compute_stats(window_minutes=5)
        if not stats:
            return

        if stats.get('p95_latency_ms', 0) > self.alert_thresholds['latency_p95_ms']:
            print(f"Warning: P95 latency exceeded threshold ({stats['p95_latency_ms']:.0f}ms)")

        if stats.get('error_rate', 0) > self.alert_thresholds['error_rate']:
            print(f"Warning: Error rate exceeded threshold ({stats['error_rate']*100:.1f}%)")

    def detect_drift(self, baseline_stats: dict, current_stats: dict) -> dict:
        """Detect performance drift after deployment"""
        drift_report = {}

        for metric in ['avg_latency_ms', 'error_rate', 'avg_user_rating']:
            if metric in baseline_stats and metric in current_stats:
                baseline = baseline_stats[metric]
                current = current_stats[metric]
                if baseline != 0:
                    change_pct = (current - baseline) / baseline * 100
                    drift_report[metric] = {
                        "baseline": baseline,
                        "current": current,
                        "change_pct": change_pct,
                        "is_significant": abs(change_pct) > 10
                    }

        return drift_report

User Feedback Loop

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import uuid
import json

app = FastAPI()

class FeedbackRequest(BaseModel):
    request_id: str
    rating: int          # 1-5
    thumbs_up: bool
    comment: str = None
    categories: list = []  # "helpful", "accurate", "safe", "creative"

class FeedbackStore:
    def __init__(self):
        self.feedback_db = {}  # use a real DB in production

    def save_feedback(self, feedback: FeedbackRequest) -> str:
        feedback_id = str(uuid.uuid4())
        self.feedback_db[feedback_id] = {
            "request_id": feedback.request_id,
            "rating": feedback.rating,
            "thumbs_up": feedback.thumbs_up,
            "comment": feedback.comment,
            "categories": feedback.categories,
            "timestamp": datetime.now().isoformat()
        }
        return feedback_id

    def get_feedback_stats(self) -> dict:
        if not self.feedback_db:
            return {}

        all_feedback = list(self.feedback_db.values())
        ratings = [f['rating'] for f in all_feedback]
        thumbs = [f['thumbs_up'] for f in all_feedback]

        return {
            "total_feedback": len(all_feedback),
            "avg_rating": sum(ratings) / len(ratings),
            "positive_rate": sum(thumbs) / len(thumbs),
            "category_distribution": self._count_categories(all_feedback)
        }

    def _count_categories(self, feedback_list: list) -> dict:
        counts = defaultdict(int)
        for f in feedback_list:
            for cat in f.get('categories', []):
                counts[cat] += 1
        return dict(counts)

feedback_store = FeedbackStore()

@app.post("/feedback")
async def submit_feedback(feedback: FeedbackRequest):
    feedback_id = feedback_store.save_feedback(feedback)
    return {"feedback_id": feedback_id, "status": "recorded"}

@app.get("/feedback/stats")
async def get_stats():
    return feedback_store.get_feedback_stats()

Conclusion

LLM evaluation is not simply about comparing scores — it is about choosing the right evaluation method for the purpose.

Key summary:

General-purpose evaluation:

  • Knowledge: MMLU, ARC
  • Reasoning: GSM8K, MATH
  • Coding: HumanEval, SWE-bench
  • Conversation: MT-Bench

RAG evaluation: RAGAS (Faithfulness + Answer Relevancy + Context Recall + Precision)

Automation tools: LM-Evaluation-Harness for batch execution of standard benchmarks

Production monitoring: latency, error rate, user feedback, drift detection

Benchmark scores are merely reference points — performance in a real service must be evaluated directly. Especially for non-English services, building language-specific evaluation sets and continuously collecting real user feedback is the most accurate evaluation method.