Complete Guide to LLM Evaluation and Benchmarking

Answering "which LLM is best?" with a single answer is difficult. A model that excels at math may be weak at creative writing, and a model with strong English performance may fall short in Korean. This guide systematically covers how to evaluate LLMs correctly — from standard benchmarks to RAG system evaluation and production monitoring.

1. The Challenges of LLM Evaluation

No Single Metric Is Sufficient

An LLM must simultaneously possess dozens of capabilities.

Knowledge: memorizing and retrieving factual information
Reasoning: logical reasoning, math, coding
Language: grammar, style, multilingual support
Instruction following: complying with user directives
Safety: refusing harmful content

No single metric can represent all these capabilities.

Goodhart's Law

"When a measure becomes a target, it ceases to be a good measure."

When LLM developers optimize toward a specific benchmark, scores on that benchmark may rise while actual capability does not improve. This is called "benchmark gaming."

# Illustrative patterns of benchmark gaming
gaming_strategies = {
    "training data contamination": "including benchmark questions in training data",
    "specialized fine-tuning": "optimizing only for the benchmark format",
    "prompt engineering": "adjusting responses to match benchmark answer formats",
    "selective evaluation": "reporting only favorable results",
}

Benchmark Contamination

When evaluation data is included in the training data, unreliable scores result.

Detection methods:

n-gram overlap checks
Perplexity outlier detection
Continuously generating new test sets

2. Standard Benchmarks by Capability

Knowledge Evaluation

MMLU (Massive Multitask Language Understanding)

The most comprehensive knowledge benchmark, consisting of approximately 15,000 multiple-choice questions across 57 academic disciplines.

Domains: mathematics, history, law, medicine, physics, computer science, etc.

# MMLU example question structure
mmlu_example = {
    "subject": "high_school_physics",
    "question": "A ball is thrown vertically upward with an initial velocity of 20 m/s. What is the maximum height reached?",
    "choices": ["10 m", "20 m", "40 m", "5 m"],
    "answer": "B"
}

# MMLU score interpretation
mmlu_benchmarks = {
    "random guessing": 25.0,
    "GPT-3.5": 70.0,
    "GPT-4": 86.4,
    "Claude 3 Opus": 86.8,
    "Llama 3.3 70B": 86.0,
    "human average": 89.8,
}

ARC (AI2 Reasoning Challenge)

Elementary school-level science exam questions that evaluate commonsense reasoning ability.

ARC-Easy: basic level
ARC-Challenge: advanced level (questions models find difficult)

TriviaQA

Wikipedia-based trivia quizzes that evaluate open-domain question answering capability.

Reasoning Evaluation

GSM8K (Grade School Math)

8,500 elementary school-level math problems. Requires multi-step calculations and linguistic reasoning.

# GSM8K example problem
gsm8k_example = {
    "question": """
    Natasha is baking cookies for a party.
    She baked 48 but ate 12 before the party.
    At the party, friends ate half of what remained.
    How many cookies are left?
    """,
    "answer": "18",
    "chain_of_thought": """
    Start: 48
    Natasha ate: 48 - 12 = 36
    Friends ate at party: 36 / 2 = 18
    Remaining: 36 - 18 = 18
    """
}

MATH

12,500 college entrance exam-level math problems. Covers advanced mathematics expressed in LaTeX.

# MATH score comparison
math_scores = {
    "GPT-3.5": 34.1,
    "GPT-4": 52.9,
    "Claude 3 Opus": 60.1,
    "DeepSeek-R1": 97.3,    # reasoning specialist
    "Llama 3.3 70B": 77.0,
}

HellaSwag

A commonsense reasoning benchmark for choosing the correct sentence completion. Humans score 95% but early models found it challenging.

WinoGrande

Pronoun coreference resolution problems. Requires contextual reasoning like: "The trophy didn't fit in the suitcase because it was too big. What was too big?"

Coding Evaluation

HumanEval (OpenAI)

164 programming problems. Code is generated from a function signature and docstring.

# HumanEval example (uses pass@k metric)
humaneval_example = {
    "task_id": "HumanEval/0",
    "prompt": '''
def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other
    than given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """
    ''',
    "canonical_solution": "..."
}

# pass@k: probability of at least 1 correct answer in k attempts
def pass_at_k(n: int, c: int, k: int) -> float:
    """
    n: total number of attempts
    c: number of correct attempts
    k: k in pass@k
    """
    if n - c < k:
        return 1.0
    return 1.0 - (
        sum(1 for i in range(n - c, n + 1)) /
        sum(1 for i in range(n + 1))
    )

MBPP (Mostly Basic Programming Problems)

374 Python programming problems with a wider range of difficulty than HumanEval.

SWE-bench

A software engineering benchmark for resolving real GitHub issues. Evaluates the most realistic coding ability.

# SWE-bench scores (% resolved issues)
swebench_scores = {
    "GPT-4o": 38.8,
    "Claude 3.5 Sonnet": 49.0,
    "DeepSeek V3": 42.0,
    "Llama 3.3 70B": 28.0,
}

Instruction Following Evaluation

MT-Bench

A multi-turn conversation evaluation benchmark that uses GPT-4 as a judge. Scored on a 1–10 scale.

8 categories: writing, roleplay, reasoning, math, coding, extraction, STEM, humanities

Alpaca Eval

An automatic evaluation system based on Win Rate. Uses GPT-4 Turbo or Claude as a judge.

Safety Evaluation

TruthfulQA

Evaluates the ability to avoid generating false information. Contains 817 questions about common misconceptions and myths.

# TruthfulQA example
truthfulqa_examples = [
    {
        "question": "Was the moon landing real?",
        "truthful_answer": "Yes, Apollo 11 landed on the moon in 1969, and this is a documented fact.",
        "common_false_answer": "There are conspiracy theories claiming the moon landing was faked."
    }
]

BBQ (Bias Benchmark for QA)

A benchmark for detecting social biases (age, gender, race, etc.).

3. LM-Evaluation-Harness

An open-source evaluation framework developed by EleutherAI. Runs over 60 benchmarks with a unified interface.

Installation

pip install lm-eval
pip install lm-eval[vllm]  # when using the vLLM backend

Basic Execution

# Evaluate a HuggingFace model
lm_eval \
    --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct \
    --tasks mmlu \
    --device cuda:0 \
    --batch_size 8 \
    --output_path ./results

# Run multiple tasks simultaneously
lm_eval \
    --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct \
    --tasks mmlu,arc_challenge,hellaswag,winogrande,gsm8k \
    --device cuda:0 \
    --batch_size auto \
    --output_path ./results

# Fast evaluation with the vLLM backend
lm_eval \
    --model vllm \
    --model_args pretrained=Qwen/Qwen2.5-7B-Instruct,tensor_parallel_size=2 \
    --tasks mmlu \
    --batch_size auto \
    --output_path ./results

Python API Usage

import lm_eval
from lm_eval.models.huggingface import HFLM

# Initialize model
model = HFLM(
    pretrained="meta-llama/Meta-Llama-3-8B-Instruct",
    device="cuda",
    batch_size=8,
    dtype="float16"
)

# Run evaluation
results = lm_eval.simple_evaluate(
    model=model,
    tasks=["mmlu", "arc_challenge", "hellaswag"],
    num_fewshot=5,
    batch_size=8,
)

# Print results
for task, metrics in results['results'].items():
    print(f"\n{task}:")
    for metric, value in metrics.items():
        if isinstance(value, float):
            print(f"  {metric}: {value:.4f}")

Adding Custom Tasks

# Example Korean task definition
# tasks/korean_qa/korean_qa.yaml

task_config = """
task: korean_qa
dataset_path: path/to/korean_qa_dataset
dataset_name: null
output_type: multiple_choice
doc_to_text: "Question: {{question}}\nChoices:\n{{choices}}\nAnswer:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: "{{answer}}"
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
num_fewshot: 0
"""

# Direct task implementation
from lm_eval.api.task import Task
from lm_eval.api.instance import Instance

class KoreanSentimentTask(Task):
    VERSION = 1
    DATASET_PATH = "nsmc"  # HuggingFace dataset

    def has_training_docs(self):
        return True

    def has_validation_docs(self):
        return True

    def training_docs(self):
        return self.dataset["train"]

    def validation_docs(self):
        return self.dataset["test"]

    def doc_to_text(self, doc):
        return f"Classify the sentiment of the following review as positive or negative.\nReview: {doc['document']}\nSentiment:"

    def doc_to_target(self, doc):
        return " positive" if doc['label'] == 1 else " negative"

    def construct_requests(self, doc, ctx):
        return [
            Instance(
                request_type="loglikelihood",
                doc=doc,
                arguments=(ctx, " positive"),
            ),
            Instance(
                request_type="loglikelihood",
                doc=doc,
                arguments=(ctx, " negative"),
            ),
        ]

    def process_results(self, doc, results):
        ll_positive, ll_negative = results
        pred = 1 if ll_positive > ll_negative else 0
        gold = doc['label']
        return {"acc": int(pred == gold)}

    def aggregation(self):
        return {"acc": "mean"}

    def higher_is_better(self):
        return {"acc": True}

Analyzing Evaluation Results

import json
import pandas as pd

# Load results file
with open('./results/results.json', 'r') as f:
    results = json.load(f)

# Organize results
summary = []
for task_name, task_results in results['results'].items():
    for metric, value in task_results.items():
        if isinstance(value, float) and not metric.endswith('_stderr'):
            summary.append({
                'task': task_name,
                'metric': metric,
                'value': value,
                'stderr': task_results.get(f'{metric}_stderr', None)
            })

df = pd.DataFrame(summary)
print(df.to_string(index=False))

# Compare multiple models
def compare_models(model_results: dict) -> pd.DataFrame:
    rows = []
    for model_name, results in model_results.items():
        row = {'model': model_name}
        for task, metrics in results['results'].items():
            for metric, value in metrics.items():
                if isinstance(value, float) and 'acc' in metric and 'stderr' not in metric:
                    row[f'{task}_{metric}'] = round(value * 100, 2)
        rows.append(row)
    return pd.DataFrame(rows).set_index('model')

4. MT-Bench and Chatbot Arena

MT-Bench

Evaluates multi-turn conversation capability using GPT-4 as a judge.

pip install fastchat

# MT-Bench evaluation script
import json
from openai import OpenAI

client = OpenAI()

# MT-Bench question examples
mt_bench_questions = [
    {
        "question_id": 81,
        "category": "writing",
        "turns": [
            "Write an essay on the impact of rapid AI advancement on society.",
            "Revise the essay you just wrote to be more persuasive and add specific examples."
        ]
    }
]

def evaluate_with_gpt4_judge(question: str, answer: str, reference: str = None) -> dict:
    system_prompt = """
    You are an expert evaluator assessing the quality of AI assistant responses.
    Based on the given question and answer, rate it on a scale of 1-10 and explain your reasoning.
    Evaluation criteria: accuracy, usefulness, completeness, language quality
    Always respond in the following format:
    Score: [1-10]
    Reason: [evaluation rationale]
    """

    user_prompt = f"""
    Question: {question}

    AI Response: {answer}
    """

    if reference:
        user_prompt += f"\nReference Answer: {reference}"

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0
    )

    content = response.choices[0].message.content
    # Extract score
    import re
    score_match = re.search(r'Score:\s*(\d+)', content)
    score = int(score_match.group(1)) if score_match else 5

    return {
        "score": score,
        "feedback": content
    }

# Collect model responses
def get_model_response(model_name: str, messages: list) -> str:
    response = client.chat.completions.create(
        model=model_name,
        messages=messages,
        max_tokens=1024,
        temperature=0.7
    )
    return response.choices[0].message.content

# Run MT-Bench evaluation
def run_mt_bench(model_name: str, questions: list) -> dict:
    results = []

    for q in questions:
        messages = []
        turn_scores = []

        for turn_idx, turn_question in enumerate(q['turns']):
            messages.append({"role": "user", "content": turn_question})
            response = get_model_response(model_name, messages)
            messages.append({"role": "assistant", "content": response})

            eval_result = evaluate_with_gpt4_judge(turn_question, response)
            turn_scores.append(eval_result['score'])

        results.append({
            "question_id": q['question_id'],
            "category": q['category'],
            "turn_scores": turn_scores,
            "avg_score": sum(turn_scores) / len(turn_scores)
        })

    avg_total = sum(r['avg_score'] for r in results) / len(results)
    return {"model": model_name, "avg_score": avg_total, "details": results}

Chatbot Arena (ELO Score)

LMSYS Chatbot Arena is a crowdsourced evaluation platform where users compare and choose between responses from two models.

ELO score calculation:

def update_elo(winner_elo: float, loser_elo: float, k: float = 32) -> tuple:
    """
    Applying the chess ELO rating system to chatbot evaluation
    k: K-factor (maximum score change)
    """
    expected_winner = 1 / (1 + 10 ** ((loser_elo - winner_elo) / 400))
    expected_loser = 1 - expected_winner

    new_winner_elo = winner_elo + k * (1 - expected_winner)
    new_loser_elo = loser_elo + k * (0 - expected_loser)

    return new_winner_elo, new_loser_elo

# ELO score examples (reference values as of 2025)
chatbot_arena_elo = {
    "GPT-4o": 1287,
    "Claude 3.5 Sonnet": 1265,
    "Gemini 1.5 Pro": 1263,
    "Llama 3.1 405B": 1251,
    "DeepSeek V3": 1301,
    "GPT-4o-mini": 1218,
}

5. RAG System Evaluation (RAGAS)

RAG (Retrieval-Augmented Generation) systems are difficult to evaluate with general LLM benchmarks. RAGAS is a RAG-specific evaluation framework.

pip install ragas langchain openai

Core Evaluation Metrics

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
)
from datasets import Dataset

Faithfulness

Measures whether the generated answer is grounded in the retrieved context. A key metric for detecting hallucinations.

# Faithfulness = number of statements verifiable from context / total statements

# High faithfulness example
example_high_faithfulness = {
    "question": "What year was Python first created?",
    "answer": "Python was first released in 1991 by Guido van Rossum.",
    "contexts": ["Python was first released in 1991 by Guido van Rossum."],
    "faithfulness": 1.0  # fully supported by context
}

# Low faithfulness example (hallucination)
example_low_faithfulness = {
    "question": "What is the current version of Python?",
    "answer": "Python 3.11 is the current version and was released in 2022.",
    "contexts": ["Python 3.12 was released in October 2023."],
    "faithfulness": 0.3  # contains information differing from context
}

Answer Relevancy

Measures how relevant the generated answer is to the actual question. Drops when the answer is long and contains irrelevant content.

# Computed in reverse: similarity between questions generated from the answer and the original question
def compute_answer_relevancy(answer: str, question: str, model) -> float:
    # Generate reverse questions from answer using LLM
    generated_questions = []
    for _ in range(3):  # generate multiple times and average
        gen_q = model.generate(f"Given the following answer, generate the original question: {answer}")
        generated_questions.append(gen_q)

    # Cosine similarity with original question
    embeddings = model.embed([question] + generated_questions)
    similarities = cosine_similarity([embeddings[0]], embeddings[1:])[0]
    return float(similarities.mean())

Context Recall

Measures how much of the information needed for the correct answer is contained in the retrieved context.

# Context Recall = number of ground-truth statements supported by context / total ground-truth statements

Context Precision

Measures the proportion of retrieved context that is actually useful for generating the answer.

RAGAS Practical Evaluation

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision
from datasets import Dataset
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

# Prepare evaluation data
eval_data = {
    "question": [
        "What is the capital of South Korea?",
        "In what year did King Sejong create Hangul?",
        "What is the national flower of Korea?",
    ],
    "answer": [
        "The capital of South Korea is Seoul.",
        "King Sejong created Hangul in 1443.",
        "The national flower of Korea is the Hibiscus (Mugunghwa).",
    ],
    "contexts": [
        ["Seoul is the capital and largest city of South Korea, with a population of approximately 9.5 million."],
        ["King Sejong created Hunminjeongeum (Hangul) in 1443."],
        ["The Hibiscus (Mugunghwa) is the national flower of South Korea, symbolizing a new bloom every morning."],
    ],
    "ground_truth": [
        "Seoul",
        "1443",
        "Hibiscus (Mugunghwa)",
    ]
}

dataset = Dataset.from_dict(eval_data)

# Configure LLM and embedding model
llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

# Run evaluation
result = evaluate(
    dataset=dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_recall,
        context_precision,
    ],
    llm=llm,
    embeddings=embeddings,
)

print("RAGAS Evaluation Results:")
print(f"  Faithfulness: {result['faithfulness']:.4f}")
print(f"  Answer Relevancy: {result['answer_relevancy']:.4f}")
print(f"  Context Recall: {result['context_recall']:.4f}")
print(f"  Context Precision: {result['context_precision']:.4f}")

Full RAG Pipeline Evaluation

from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
import time

class RAGEvaluator:
    def __init__(self, qa_chain):
        self.qa_chain = qa_chain
        self.eval_results = []

    def evaluate_single(self, question: str, ground_truth: str) -> dict:
        start_time = time.time()
        result = self.qa_chain.invoke(question)
        latency = time.time() - start_time

        return {
            "question": question,
            "answer": result['result'],
            "contexts": [doc.page_content for doc in result.get('source_documents', [])],
            "ground_truth": ground_truth,
            "latency": latency
        }

    def evaluate_batch(self, questions: list, ground_truths: list) -> dict:
        results = []
        for q, gt in zip(questions, ground_truths):
            result = self.evaluate_single(q, gt)
            results.append(result)

        # RAGAS evaluation
        dataset = Dataset.from_list(results)
        ragas_result = evaluate(
            dataset=dataset,
            metrics=[faithfulness, answer_relevancy, context_recall, context_precision],
        )

        # Latency statistics
        latencies = [r['latency'] for r in results]

        return {
            "ragas_scores": ragas_result,
            "avg_latency": sum(latencies) / len(latencies),
            "p95_latency": sorted(latencies)[int(len(latencies) * 0.95)],
            "num_evaluated": len(results)
        }

6. Custom LLM Evaluation Pipeline

Building an Evaluation Dataset

import json
import random
from openai import OpenAI

class EvalDatasetBuilder:
    def __init__(self):
        self.client = OpenAI()

    def generate_qa_pairs(self, documents: list, num_pairs: int = 100) -> list:
        """Automatically generate QA pairs from documents"""
        qa_pairs = []

        for doc in documents[:num_pairs]:
            response = self.client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {
                        "role": "system",
                        "content": """Generate a question-answer pair based on the given text.
                        Return in the following JSON format:
                        {"question": "question", "answer": "answer"}"""
                    },
                    {
                        "role": "user",
                        "content": f"Text: {doc}"
                    }
                ],
                response_format={"type": "json_object"}
            )

            try:
                qa = json.loads(response.choices[0].message.content)
                qa['context'] = doc
                qa_pairs.append(qa)
            except json.JSONDecodeError:
                continue

        return qa_pairs

    def split_dataset(self, qa_pairs: list, test_ratio: float = 0.2) -> tuple:
        random.shuffle(qa_pairs)
        split_idx = int(len(qa_pairs) * (1 - test_ratio))
        return qa_pairs[:split_idx], qa_pairs[split_idx:]

A/B Testing

import asyncio
from typing import Callable
import statistics

class LLMABTest:
    def __init__(self, model_a: str, model_b: str):
        self.model_a = model_a
        self.model_b = model_b
        self.client = OpenAI()
        self.results = {"a": [], "b": []}

    async def run_single_test(
        self,
        prompt: str,
        expected: str = None,
        judge_model: str = "gpt-4o"
    ) -> dict:
        # Collect responses from both models
        response_a = self.client.chat.completions.create(
            model=self.model_a,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=512
        ).choices[0].message.content

        response_b = self.client.chat.completions.create(
            model=self.model_b,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=512
        ).choices[0].message.content

        # Judge with GPT-4
        judge_prompt = f"""Evaluate which of the following two AI responses is better.

Question: {prompt}

Response A: {response_a}

Response B: {response_b}

Answer with only A or B for the better response. If it's a tie, answer TIE.
Answer:"""

        judgment = self.client.chat.completions.create(
            model=judge_model,
            messages=[{"role": "user", "content": judge_prompt}],
            max_tokens=10,
            temperature=0
        ).choices[0].message.content.strip()

        return {
            "prompt": prompt,
            "response_a": response_a,
            "response_b": response_b,
            "winner": judgment
        }

    def calculate_win_rates(self) -> dict:
        all_results = self.results.get("comparisons", [])
        if not all_results:
            return {}

        wins_a = sum(1 for r in all_results if "A" in r.get("winner", ""))
        wins_b = sum(1 for r in all_results if "B" in r.get("winner", ""))
        ties = sum(1 for r in all_results if "TIE" in r.get("winner", ""))

        total = len(all_results)
        return {
            f"{self.model_a}_win_rate": wins_a / total,
            f"{self.model_b}_win_rate": wins_b / total,
            "tie_rate": ties / total,
        }

Limitations of Automated Evaluation

# Known biases in LLM-as-Judge
llm_judge_biases = {
    "position bias": "tendency to prefer the first response",
    "length bias": "tendency to prefer longer responses",
    "self-preference bias": "tendency to prefer responses from the same model",
    "format bias": "preference for structured responses with bullet points, headers, etc.",
}

# Bias mitigation strategies
bias_mitigation = {
    "order swapping": "evaluate both A-B and B-A order and average",
    "majority vote": "use multiple judge models",
    "absolute scoring": "independent absolute scores instead of relative comparisons",
    "CoT evaluation": "ask the judge to explain reasoning before giving a score",
}

7. Production LLM Monitoring

Online Evaluation Metrics

from dataclasses import dataclass, field
from datetime import datetime
import statistics
from collections import defaultdict

@dataclass
class LLMMetrics:
    """Production LLM monitoring metrics"""
    timestamp: datetime = field(default_factory=datetime.now)

    # Performance metrics
    latency_ms: float = 0.0
    tokens_per_second: float = 0.0
    input_tokens: int = 0
    output_tokens: int = 0

    # Quality metrics
    user_rating: int = None      # 1-5 user rating
    thumbs_up: bool = None       # thumbs up/down
    was_regenerated: bool = False # whether regeneration was requested

    # Safety metrics
    content_filtered: bool = False
    error_occurred: bool = False
    error_type: str = None

class LLMMonitor:
    def __init__(self):
        self.metrics_store = []
        self.alert_thresholds = {
            "latency_p95_ms": 5000,
            "error_rate": 0.05,
            "negative_feedback_rate": 0.2,
        }

    def record(self, metrics: LLMMetrics):
        self.metrics_store.append(metrics)

        # Real-time alert check
        self._check_alerts()

    def compute_stats(self, window_minutes: int = 60) -> dict:
        cutoff = datetime.now().timestamp() - window_minutes * 60
        recent = [
            m for m in self.metrics_store
            if m.timestamp.timestamp() > cutoff
        ]

        if not recent:
            return {}

        latencies = [m.latency_ms for m in recent]
        ratings = [m.user_rating for m in recent if m.user_rating is not None]
        errors = [m for m in recent if m.error_occurred]

        stats = {
            "total_requests": len(recent),
            "avg_latency_ms": statistics.mean(latencies),
            "p50_latency_ms": statistics.median(latencies),
            "p95_latency_ms": sorted(latencies)[int(len(latencies) * 0.95)],
            "error_rate": len(errors) / len(recent),
            "avg_input_tokens": statistics.mean([m.input_tokens for m in recent]),
            "avg_output_tokens": statistics.mean([m.output_tokens for m in recent]),
        }

        if ratings:
            stats["avg_user_rating"] = statistics.mean(ratings)
            stats["negative_feedback_rate"] = sum(1 for r in ratings if r <= 2) / len(ratings)

        return stats

    def _check_alerts(self):
        stats = self.compute_stats(window_minutes=5)
        if not stats:
            return

        if stats.get('p95_latency_ms', 0) > self.alert_thresholds['latency_p95_ms']:
            print(f"Warning: P95 latency exceeded threshold ({stats['p95_latency_ms']:.0f}ms)")

        if stats.get('error_rate', 0) > self.alert_thresholds['error_rate']:
            print(f"Warning: Error rate exceeded threshold ({stats['error_rate']*100:.1f}%)")

    def detect_drift(self, baseline_stats: dict, current_stats: dict) -> dict:
        """Detect performance drift after deployment"""
        drift_report = {}

        for metric in ['avg_latency_ms', 'error_rate', 'avg_user_rating']:
            if metric in baseline_stats and metric in current_stats:
                baseline = baseline_stats[metric]
                current = current_stats[metric]
                if baseline != 0:
                    change_pct = (current - baseline) / baseline * 100
                    drift_report[metric] = {
                        "baseline": baseline,
                        "current": current,
                        "change_pct": change_pct,
                        "is_significant": abs(change_pct) > 10
                    }

        return drift_report

User Feedback Loop

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import uuid
import json

app = FastAPI()

class FeedbackRequest(BaseModel):
    request_id: str
    rating: int          # 1-5
    thumbs_up: bool
    comment: str = None
    categories: list = []  # "helpful", "accurate", "safe", "creative"

class FeedbackStore:
    def __init__(self):
        self.feedback_db = {}  # use a real DB in production

    def save_feedback(self, feedback: FeedbackRequest) -> str:
        feedback_id = str(uuid.uuid4())
        self.feedback_db[feedback_id] = {
            "request_id": feedback.request_id,
            "rating": feedback.rating,
            "thumbs_up": feedback.thumbs_up,
            "comment": feedback.comment,
            "categories": feedback.categories,
            "timestamp": datetime.now().isoformat()
        }
        return feedback_id

    def get_feedback_stats(self) -> dict:
        if not self.feedback_db:
            return {}

        all_feedback = list(self.feedback_db.values())
        ratings = [f['rating'] for f in all_feedback]
        thumbs = [f['thumbs_up'] for f in all_feedback]

        return {
            "total_feedback": len(all_feedback),
            "avg_rating": sum(ratings) / len(ratings),
            "positive_rate": sum(thumbs) / len(thumbs),
            "category_distribution": self._count_categories(all_feedback)
        }

    def _count_categories(self, feedback_list: list) -> dict:
        counts = defaultdict(int)
        for f in feedback_list:
            for cat in f.get('categories', []):
                counts[cat] += 1
        return dict(counts)

feedback_store = FeedbackStore()

@app.post("/feedback")
async def submit_feedback(feedback: FeedbackRequest):
    feedback_id = feedback_store.save_feedback(feedback)
    return {"feedback_id": feedback_id, "status": "recorded"}

@app.get("/feedback/stats")
async def get_stats():
    return feedback_store.get_feedback_stats()

Conclusion

LLM evaluation is not simply about comparing scores — it is about choosing the right evaluation method for the purpose.

Key summary:

General-purpose evaluation:

Knowledge: MMLU, ARC
Reasoning: GSM8K, MATH
Coding: HumanEval, SWE-bench
Conversation: MT-Bench

RAG evaluation: RAGAS (Faithfulness + Answer Relevancy + Context Recall + Precision)

Automation tools: LM-Evaluation-Harness for batch execution of standard benchmarks

Production monitoring: latency, error rate, user feedback, drift detection

Benchmark scores are merely reference points — performance in a real service must be evaluated directly. Especially for non-English services, building language-specific evaluation sets and continuously collecting real user feedback is the most accurate evaluation method.