- Authors
- Name
- Introduction
- Major LLM Benchmarks: What They Measure and Where They Fall Short
- Evaluation Pipeline Architecture
- Building a Custom Evaluation Pipeline
- Statistical Significance Testing for Model Comparison
- Common Pitfalls and Operational Warnings
- CI/CD Integration: Automated Eval Workflows
- Production Monitoring: Continuous Evaluation
- Summary
- References

Introduction
Choosing the right LLM for a production workload is not a matter of reading a leaderboard. Public benchmarks like MMLU and HumanEval provide useful starting points, but they rarely capture the nuances of your specific domain, user population, or latency requirements. Meanwhile, blindly trusting a single score can lead to deploying a model that excels at trivia but fails at the exact task your application needs.
This guide covers the entire evaluation lifecycle: understanding what the major benchmarks actually measure, recognizing their limitations, building custom evaluation pipelines tailored to your use case, applying statistical rigor to model comparisons, and integrating evaluations into CI/CD workflows for continuous quality assurance.
Major LLM Benchmarks: What They Measure and Where They Fall Short
Benchmark Comparison Table
| Benchmark | Domain | Format | Size | Key Metric | Strengths | Limitations |
|---|---|---|---|---|---|---|
| MMLU | 57 academic subjects | Multiple choice (4-way) | 15,908 questions | Accuracy (%) | Broad knowledge coverage, widely adopted | Saturated (~90%+ for top models), known ground-truth errors |
| HumanEval | Python coding | Code generation | 164 problems | pass@k | Functional correctness via tests | Small dataset, Python-only, no complex system design |
| HELM | 16+ core scenarios | Mixed (MC, generation) | Varies per scenario | 7 metrics (accuracy, fairness, etc.) | Holistic multi-metric evaluation | Computationally expensive, complex setup |
| MT-Bench | 8 conversation categories | Multi-turn dialogue | 80 questions | GPT-4 judge score (1-10) | Tests conversation quality | Judge model bias, limited question count |
| GPQA | Graduate-level STEM | Multiple choice | 448 questions | Accuracy (%) | Expert-level difficulty | Very small, narrow domain coverage |
| MMLU-Pro | Extended academic | 10-way multiple choice | 12,032 questions | Accuracy (%) | Harder than MMLU, less saturation | Still multiple-choice format |
| BigCodeBench | Complex coding tasks | Code generation | 1,140 tasks | pass@1 | Real-world library usage | Execution environment complexity |
| IFEval | Instruction following | Constrained generation | 541 prompts | Strict/Loose accuracy | Tests precise instruction adherence | Narrow evaluation scope |
MMLU: The Standard (and Its Problems)
MMLU (Measuring Massive Multitask Language Understanding) was introduced by Hendrycks et al. in 2020 as a challenge benchmark. When released, GPT-3 175B scored just 43.9%. By mid-2024, leading models like Claude 3.5 Sonnet, GPT-4o, and Llama 3.1 405B consistently scored above 88%.
A critical 2024 analysis of 5,700 MMLU questions revealed a significant number of ground-truth errors in the dataset. This means that models are sometimes penalized for giving correct answers, or rewarded for memorizing incorrect ones.
Key takeaway: MMLU is useful for rough capability triage but should never be the sole evaluation criterion.
HumanEval: Measuring Code Generation
HumanEval, introduced by Chen et al. at OpenAI in 2021, consists of 164 hand-written Python programming problems. Each problem includes a function signature, docstring, reference implementation, and an average of 7.7 unit tests. The pass@k metric measures the probability that at least one of k generated samples passes all tests.
# Example: How HumanEval pass@k is calculated
import numpy as np
from typing import List
def estimate_pass_at_k(
num_samples: int,
num_correct: int,
k: int
) -> float:
"""
Estimates pass@k using the unbiased estimator from Chen et al. (2021).
This avoids the high-variance naive estimator (sampling k and checking).
Instead, it computes: 1 - C(n-c, k) / C(n, k)
where n = num_samples, c = num_correct.
"""
if num_correct == 0:
return 0.0
if num_samples - num_correct < k:
return 1.0
# Use logarithms for numerical stability
log_numerator = sum(
np.log(num_samples - num_correct - i) for i in range(k)
)
log_denominator = sum(
np.log(num_samples - i) for i in range(k)
)
return 1.0 - np.exp(log_numerator - log_denominator)
# Example: model got 120 correct out of 164 problems, k=1
score = estimate_pass_at_k(num_samples=164, num_correct=120, k=1)
print(f"pass@1: {score:.4f}") # pass@1: 0.7317
HELM: Holistic Evaluation
Stanford CRFM's HELM framework evaluates models across 7 dimensions: accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. The 2025 HELM Capabilities update added newer models and scenarios including MMLU-Pro, GPQA, IFEval, and WildBench.
MT-Bench: Conversational Quality
MT-Bench uses 80 multi-turn questions across 8 categories (writing, roleplay, extraction, reasoning, math, coding, STEM knowledge, humanities). A strong LLM judge (typically GPT-4) scores responses on a 1-10 scale. Research by Zheng et al. showed that GPT-4 judgments achieve over 80% agreement with human evaluators.
Evaluation Pipeline Architecture
Building a production evaluation pipeline requires more than running a benchmark script. Below is the architecture for a robust, automated evaluation system.
+------------------+ +-------------------+ +------------------+
| Eval Dataset | | Model Registry | | Eval Config |
| (versioned, |---->| (model versions, |---->| (metrics, k, |
| immutable) | | endpoints) | | thresholds) |
+------------------+ +-------------------+ +------------------+
| | |
v v v
+--------------------------------------------------------------+
| Eval Orchestrator |
| - Loads dataset splits |
| - Dispatches inference requests (async, rate-limited) |
| - Collects raw outputs |
| - Routes to appropriate graders |
+--------------------------------------------------------------+
| | |
v v v
+------------------+ +-------------------+ +------------------+
| Exact Match | | LLM-as-Judge | | Code Execution |
| Grader | | Grader | | Grader |
+------------------+ +-------------------+ +------------------+
| | |
v v v
+--------------------------------------------------------------+
| Results Aggregator |
| - Per-category scores |
| - Confidence intervals |
| - Statistical significance tests |
| - Regression detection |
+--------------------------------------------------------------+
| |
v v
+------------------+ +-------------------+
| Dashboard / | | CI/CD Gate |
| Alerting | | (pass/fail) |
+------------------+ +-------------------+
Building a Custom Evaluation Pipeline
Step 1: Define Your Eval Dataset
Your evaluation dataset must be versioned, immutable, and representative of production traffic. Never evaluate on training data.
import json
import hashlib
from dataclasses import dataclass, field, asdict
from typing import List, Optional
from datetime import datetime
@dataclass
class EvalSample:
"""A single evaluation example."""
id: str
input_prompt: str
expected_output: str
category: str
difficulty: str # "easy", "medium", "hard"
metadata: dict = field(default_factory=dict)
@dataclass
class EvalDataset:
"""Versioned, immutable evaluation dataset."""
name: str
version: str
created_at: str
samples: List[EvalSample]
description: str = ""
def __post_init__(self):
self._checksum = self._compute_checksum()
def _compute_checksum(self) -> str:
content = json.dumps(
[asdict(s) for s in self.samples],
sort_keys=True
)
return hashlib.sha256(content.encode()).hexdigest()[:12]
@property
def checksum(self) -> str:
return self._checksum
def validate_integrity(self, expected_checksum: str) -> bool:
return self._checksum == expected_checksum
def split_by_category(self) -> dict:
categories = {}
for sample in self.samples:
categories.setdefault(sample.category, []).append(sample)
return categories
def save(self, path: str):
data = {
"name": self.name,
"version": self.version,
"created_at": self.created_at,
"description": self.description,
"checksum": self.checksum,
"sample_count": len(self.samples),
"samples": [asdict(s) for s in self.samples],
}
with open(path, "w") as f:
json.dump(data, f, indent=2)
# Example: Create an evaluation dataset for a customer support bot
samples = [
EvalSample(
id="cs-001",
input_prompt="Customer says: 'My order #12345 hasn't arrived.' "
"Generate a support response.",
expected_output="apologize, look up order status, provide ETA",
category="order_inquiry",
difficulty="easy",
),
EvalSample(
id="cs-002",
input_prompt="Customer says: 'I want a refund AND a replacement.' "
"Generate a support response following policy.",
expected_output="explain policy (refund OR replacement, not both), "
"offer alternatives",
category="refund_policy",
difficulty="hard",
),
]
dataset = EvalDataset(
name="customer_support_eval",
version="1.0.0",
created_at=datetime.now().isoformat(),
samples=samples,
description="Evaluation set for customer support chatbot v2",
)
print(f"Dataset: {dataset.name} v{dataset.version}")
print(f"Checksum: {dataset.checksum}")
print(f"Categories: {list(dataset.split_by_category().keys())}")
Step 2: Implement Graders
Different tasks require different grading strategies. Here are the three most common patterns.
import re
import asyncio
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import Any, Optional
@dataclass
class GradeResult:
score: float # 0.0 to 1.0
passed: bool
reasoning: str
grader_type: str
metadata: dict = None
class BaseGrader(ABC):
"""Base class for all graders."""
@abstractmethod
async def grade(
self, prompt: str, expected: str, actual: str
) -> GradeResult:
pass
class ExactMatchGrader(BaseGrader):
"""For tasks with deterministic correct answers."""
def __init__(self, normalize: bool = True):
self.normalize = normalize
def _normalize(self, text: str) -> str:
text = text.strip().lower()
text = re.sub(r'\s+', ' ', text)
return text
async def grade(
self, prompt: str, expected: str, actual: str
) -> GradeResult:
if self.normalize:
expected = self._normalize(expected)
actual = self._normalize(actual)
match = expected == actual
return GradeResult(
score=1.0 if match else 0.0,
passed=match,
reasoning=f"Exact match: {match}",
grader_type="exact_match",
)
class LLMJudgeGrader(BaseGrader):
"""Uses a strong LLM to judge response quality."""
JUDGE_PROMPT_TEMPLATE = """You are an expert evaluator. Score the following
response on a scale of 1-5 based on these criteria:
- Accuracy: Does it contain correct information?
- Completeness: Does it address all parts of the question?
- Helpfulness: Is it actionable and useful?
Question: {prompt}
Expected answer should cover: {expected}
Actual response: {actual}
Respond with ONLY a JSON object:
{{"score": <1-5>, "reasoning": "<brief explanation>"}}"""
def __init__(self, judge_client, judge_model: str = "claude-sonnet-4-20250514"):
self.client = judge_client
self.model = judge_model
async def grade(
self, prompt: str, expected: str, actual: str
) -> GradeResult:
judge_prompt = self.JUDGE_PROMPT_TEMPLATE.format(
prompt=prompt, expected=expected, actual=actual
)
response = await self.client.messages.create(
model=self.model,
max_tokens=256,
messages=[{"role": "user", "content": judge_prompt}],
)
result = json.loads(response.content[0].text)
normalized_score = result["score"] / 5.0
return GradeResult(
score=normalized_score,
passed=normalized_score >= 0.6,
reasoning=result["reasoning"],
grader_type="llm_judge",
metadata={"judge_model": self.model, "raw_score": result["score"]},
)
class CodeExecutionGrader(BaseGrader):
"""Grades code generation by running test cases."""
def __init__(self, timeout_seconds: int = 10):
self.timeout = timeout_seconds
async def grade(
self, prompt: str, expected: str, actual: str
) -> GradeResult:
test_code = f"{actual}\n\n{expected}"
try:
proc = await asyncio.create_subprocess_exec(
"python", "-c", test_code,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
stdout, stderr = await asyncio.wait_for(
proc.communicate(), timeout=self.timeout
)
passed = proc.returncode == 0
return GradeResult(
score=1.0 if passed else 0.0,
passed=passed,
reasoning=stderr.decode() if not passed else "All tests passed",
grader_type="code_execution",
)
except asyncio.TimeoutError:
return GradeResult(
score=0.0,
passed=False,
reasoning=f"Execution timed out after {self.timeout}s",
grader_type="code_execution",
)
Step 3: The Eval Orchestrator
The orchestrator ties everything together: it loads the dataset, dispatches inference requests with rate limiting, and routes outputs to the appropriate graders.
import asyncio
import time
from dataclasses import dataclass, field
from typing import Dict, List
@dataclass
class EvalConfig:
model_id: str
model_endpoint: str
dataset_path: str
grader_type: str # "exact_match", "llm_judge", "code_execution"
max_concurrent: int = 10
timeout_per_request: int = 30
temperature: float = 0.0
max_tokens: int = 1024
@dataclass
class EvalResult:
sample_id: str
category: str
grade: GradeResult
latency_ms: float
token_count: int = 0
class EvalOrchestrator:
"""Orchestrates the evaluation pipeline with rate limiting."""
def __init__(self, config: EvalConfig, client, grader: BaseGrader):
self.config = config
self.client = client
self.grader = grader
self.semaphore = asyncio.Semaphore(config.max_concurrent)
self.results: List[EvalResult] = []
async def evaluate_sample(self, sample: EvalSample) -> EvalResult:
async with self.semaphore:
start = time.monotonic()
response = await self.client.messages.create(
model=self.config.model_id,
max_tokens=self.config.max_tokens,
temperature=self.config.temperature,
messages=[{"role": "user", "content": sample.input_prompt}],
)
latency_ms = (time.monotonic() - start) * 1000
actual_output = response.content[0].text
grade = await self.grader.grade(
sample.input_prompt,
sample.expected_output,
actual_output,
)
return EvalResult(
sample_id=sample.id,
category=sample.category,
grade=grade,
latency_ms=latency_ms,
token_count=response.usage.output_tokens,
)
async def run(self, dataset: EvalDataset) -> Dict:
tasks = [
self.evaluate_sample(sample) for sample in dataset.samples
]
self.results = await asyncio.gather(*tasks, return_exceptions=True)
# Filter out exceptions
valid = [r for r in self.results if isinstance(r, EvalResult)]
errors = [r for r in self.results if isinstance(r, Exception)]
# Aggregate by category
by_category = {}
for r in valid:
by_category.setdefault(r.category, []).append(r)
summary = {
"model": self.config.model_id,
"total_samples": len(dataset.samples),
"completed": len(valid),
"errors": len(errors),
"overall_score": (
sum(r.grade.score for r in valid) / len(valid)
if valid else 0
),
"overall_pass_rate": (
sum(1 for r in valid if r.grade.passed) / len(valid)
if valid else 0
),
"avg_latency_ms": (
sum(r.latency_ms for r in valid) / len(valid)
if valid else 0
),
"by_category": {
cat: {
"score": sum(r.grade.score for r in rs) / len(rs),
"pass_rate": sum(
1 for r in rs if r.grade.passed
) / len(rs),
"count": len(rs),
}
for cat, rs in by_category.items()
},
}
return summary
Statistical Significance Testing for Model Comparison
One of the most common pitfalls in LLM evaluation is declaring one model "better" than another based on a small score difference without testing whether that difference is statistically significant. Anthropic's research team has specifically highlighted the importance of paired-differences tests for model comparisons, noting that they eliminate variance from question difficulty and focus on response-level differences.
import numpy as np
from scipy import stats
from dataclasses import dataclass
from typing import List, Tuple
@dataclass
class ComparisonResult:
model_a: str
model_b: str
mean_a: float
mean_b: float
difference: float
p_value: float
confidence_interval: Tuple[float, float]
significant: bool
effect_size: float # Cohen's d
test_used: str
n_samples: int
def compare_models(
scores_a: List[float],
scores_b: List[float],
model_a_name: str = "Model A",
model_b_name: str = "Model B",
alpha: float = 0.05,
) -> ComparisonResult:
"""
Paired comparison of two models on the same evaluation set.
Uses paired t-test when assumptions hold, Wilcoxon signed-rank otherwise.
"""
a = np.array(scores_a)
b = np.array(scores_b)
assert len(a) == len(b), "Both models must be evaluated on identical samples"
differences = a - b
n = len(differences)
# Check normality of differences (Shapiro-Wilk test)
if n >= 20:
_, normality_p = stats.shapiro(differences)
else:
normality_p = 0 # Too few samples, use non-parametric
if normality_p > 0.05:
# Paired t-test
t_stat, p_value = stats.ttest_rel(a, b)
test_name = "paired_t_test"
else:
# Wilcoxon signed-rank test (non-parametric alternative)
try:
w_stat, p_value = stats.wilcoxon(differences)
test_name = "wilcoxon_signed_rank"
except ValueError:
# All differences are zero
p_value = 1.0
test_name = "wilcoxon_signed_rank (degenerate)"
# Effect size (Cohen's d for paired samples)
diff_std = np.std(differences, ddof=1)
effect_size = np.mean(differences) / diff_std if diff_std > 0 else 0.0
# Bootstrap confidence interval for the mean difference
rng = np.random.default_rng(42)
boot_means = []
for _ in range(10000):
boot_sample = rng.choice(differences, size=n, replace=True)
boot_means.append(np.mean(boot_sample))
ci_lower = np.percentile(boot_means, 100 * alpha / 2)
ci_upper = np.percentile(boot_means, 100 * (1 - alpha / 2))
return ComparisonResult(
model_a=model_a_name,
model_b=model_b_name,
mean_a=float(np.mean(a)),
mean_b=float(np.mean(b)),
difference=float(np.mean(differences)),
p_value=float(p_value),
confidence_interval=(float(ci_lower), float(ci_upper)),
significant=p_value < alpha,
effect_size=float(effect_size),
test_used=test_name,
n_samples=n,
)
# Usage example
np.random.seed(42)
model_a_scores = np.random.binomial(1, 0.82, size=500).astype(float)
model_b_scores = np.random.binomial(1, 0.78, size=500).astype(float)
result = compare_models(
model_a_scores.tolist(),
model_b_scores.tolist(),
"Claude 3.5 Sonnet",
"GPT-4o",
)
print(f"Comparison: {result.model_a} vs {result.model_b}")
print(f" Scores: {result.mean_a:.4f} vs {result.mean_b:.4f}")
print(f" Difference: {result.difference:+.4f}")
print(f" p-value: {result.p_value:.4f} ({result.test_used})")
print(f" 95% CI: [{result.confidence_interval[0]:+.4f}, "
f"{result.confidence_interval[1]:+.4f}]")
print(f" Cohen's d: {result.effect_size:.3f}")
print(f" Significant: {result.significant}")
Multiple Comparison Correction
When comparing more than two models, you must apply correction for multiple comparisons to avoid false positives.
from itertools import combinations
def compare_multiple_models(
model_scores: Dict[str, List[float]],
alpha: float = 0.05,
) -> List[ComparisonResult]:
"""
Compare all pairs of models with Bonferroni correction.
"""
model_names = list(model_scores.keys())
pairs = list(combinations(model_names, 2))
n_comparisons = len(pairs)
# Bonferroni-corrected alpha
corrected_alpha = alpha / n_comparisons
results = []
for name_a, name_b in pairs:
result = compare_models(
model_scores[name_a],
model_scores[name_b],
name_a,
name_b,
alpha=corrected_alpha,
)
results.append(result)
print(f"Bonferroni-corrected alpha: {corrected_alpha:.4f} "
f"({n_comparisons} comparisons)")
for r in results:
sig_marker = "*" if r.significant else ""
print(f" {r.model_a} vs {r.model_b}: "
f"diff={r.difference:+.4f}, "
f"p={r.p_value:.4f}{sig_marker}")
return results
Common Pitfalls and Operational Warnings
1. Data Contamination
Warning: If your model was trained on data that includes your evaluation set, scores will be inflated and meaningless.
- Always check for overlap between training data and eval data.
- Use recently created evaluation samples that postdate the model's training cutoff.
- Include "canary strings" in eval datasets that you can search for in training dumps.
2. Benchmark Saturation
When top models score 88%+ on MMLU, a 1-point difference is statistically meaningless. MMLU-Pro (10-way multiple choice, 12,032 questions) and GPQA (graduate-level STEM) were specifically designed to address this saturation.
3. Evaluation Set Size
Minimum sample sizes for reliable results:
| Desired Margin of Error | Required Samples (95% CI) |
|---|---|
| +/- 1% | ~9,604 |
| +/- 2% | ~2,401 |
| +/- 5% | ~384 |
| +/- 10% | ~96 |
For pass/fail metrics, use the formula: n = (Z^2 _ p _ (1-p)) / E^2, where Z=1.96 for 95% confidence, p=estimated proportion, E=margin of error.
4. Temperature and Sampling Settings
Always fix temperature=0 for reproducible evaluations. If your production system uses temperature greater than 0, run each sample k times and report pass@k with confidence intervals.
5. LLM-as-Judge Pitfalls
- Position bias: The judge may prefer the first or last response in pairwise comparison.
- Verbosity bias: Longer responses tend to receive higher scores.
- Self-enhancement bias: A model used as judge may prefer its own style of output.
- Mitigation: Randomize response order, control for response length, use multiple judge models.
6. Prompt Format Sensitivity
A model's score can vary by 5-15% depending on the exact prompt format (e.g., "Answer:" vs "The answer is" vs few-shot examples). Always document and version your prompt templates alongside your evaluation datasets.
CI/CD Integration: Automated Eval Workflows
# .github/workflows/llm-eval.yml
name: LLM Evaluation Pipeline
on:
pull_request:
paths:
- 'prompts/**'
- 'model_config/**'
schedule:
- cron: '0 6 * * 1' # Weekly Monday 6AM UTC
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install -r requirements-eval.txt
- name: Run evaluation suite
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
python -m eval.run \
--dataset eval/datasets/production_v2.json \
--model claude-sonnet-4-20250514 \
--output results/latest.json \
--threshold 0.85
- name: Compare with baseline
run: |
python -m eval.compare \
--baseline results/baseline.json \
--current results/latest.json \
--alpha 0.05
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: eval-results
path: results/
- name: Gate check
run: |
python -m eval.gate \
--results results/latest.json \
--min-score 0.85 \
--max-regression 0.02
Production Monitoring: Continuous Evaluation
Static benchmarks are a snapshot. Production traffic changes over time, and model behavior can drift. Implement continuous evaluation on a sample of live traffic.
Key metrics to monitor:
- Response quality scores (sampled and graded periodically)
- Latency percentiles (p50, p95, p99)
- Token usage (input and output per request)
- Error rates (API errors, parsing failures, guardrail triggers)
- User feedback correlation (thumbs up/down vs automated eval scores)
Set alerts when any metric deviates by more than 2 standard deviations from the rolling baseline.
Summary
LLM evaluation is not a one-time activity. It is a continuous engineering discipline. The key principles are:
- Understand what benchmarks measure -- MMLU tests knowledge breadth, HumanEval tests coding correctness, MT-Bench tests conversational quality. None of them test your specific use case.
- Build custom evals that mirror your production workload, with versioned datasets, appropriate graders, and sufficient sample sizes.
- Apply statistical rigor -- always use paired tests, report confidence intervals, and correct for multiple comparisons.
- Automate everything -- integrate evaluations into CI/CD, run on prompt changes, and monitor production quality continuously.
- Watch for pitfalls -- data contamination, benchmark saturation, judge bias, and prompt sensitivity can all invalidate your results.
The investment in a solid evaluation pipeline pays for itself many times over by preventing costly production regressions and giving you confidence in model selection decisions.
References
- Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. (2020). "Measuring Massive Multitask Language Understanding." arXiv:2009.03300.
- Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.O., Kaplan, J., et al. (2021). "Evaluating Large Language Models Trained on Code." arXiv:2107.03374.
- Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., et al. (2022). "Holistic Evaluation of Language Models." arXiv:2211.09110. Stanford CRFM.
- Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." arXiv:2306.05685.
- OpenAI Evals Framework. GitHub: https://github.com/openai/evals
- Anthropic. (2025). "A Statistical Approach to Model Evaluations." https://www.anthropic.com/research/statistical-approach-to-model-evals
- Wang, Y., Ma, X., Zhang, G., et al. (2024). "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark." arXiv:2406.01574.