- Authors

- Name
- Youngju Kim
- @fjvbn20031
Table of Contents
- Why AI Benchmarks Matter
- Computer Vision Benchmarks
- NLP Benchmarks
- LLM Capability Benchmarks
- Comprehensive LLM Evaluation
- Korean Language Benchmarks
- Multimodal Benchmarks
- Using LM-Evaluation-Harness
1. Why AI Benchmarks Matter
The Need for Standardized Evaluation
How should AI models be compared? When two image classification models exist, a common standard is needed to determine which is better. Benchmark datasets provide exactly that common ground.
Without standardized benchmarks, each team could evaluate only on data favorable to them, making objective comparison impossible. Standard benchmarks like ImageNet, GLUE, and MMLU have enabled the AI research community to compete on the same test, measuring progress and setting direction.
Leaderboards and Competition
Benchmarks make AI progress visible through leaderboards.
- ImageNet LSVRC: AlexNet reduced Top-5 error from 26% to 15.3% in 2012, launching the deep learning revolution.
- GLUE/SuperGLUE: Documented the journey of BERT, RoBERTa, T5, and others surpassing human-level performance.
- HumanEval: Became the arena where GPT-4, Claude, Gemini, and others compete on code generation.
- LMSYS Chatbot Arena: Real human users blindly compare two models and vote, producing ELO ratings.
Limitations and Biases of Benchmarks
Benchmarks are powerful tools with clear limitations.
1. Dataset Contamination
LLMs are trained on vast internet text. If benchmark test data is present in training data, the model may be memorizing answers rather than genuinely solving problems. Even the GPT-4 technical report acknowledged this issue.
2. Goodhart's Law
"When a measure becomes a target, it ceases to be a good measure." Researchers who focus only on improving specific benchmark scores can raise scores without genuine capability improvements.
3. Bias and Representativeness
Many benchmarks are heavily weighted toward English and Western cultural data. Performance in Korean, Arabic, Swahili, and other languages can differ substantially from English benchmark scores.
4. Static Standards
Benchmarks do not change once created, but AI models continually improve. A difficult benchmark in 2023 can reach near-saturation by 2025.
5. Gap from Real-World Performance
High benchmark scores do not guarantee good performance in actual deployment. User experience, creativity, safety, and other hard-to-quantify factors matter just as much.
2. Computer Vision Benchmarks
ImageNet (ILSVRC)
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is the most influential benchmark in computer vision history. Originating from the ImageNet project (2009) led by Professor Fei-Fei Li at Stanford, it ran as an annual competition from 2010 to 2017.
Dataset Characteristics:
- 1,000 classes (everyday objects: dogs, cats, cars, etc.)
- Training data: approximately 1.2 million images
- Validation data: 50,000 images
- Test data: 100,000 images
- Average of about 1,200 images per class
Key Metrics:
- Top-1 Accuracy: Fraction of predictions where the top-1 predicted class is the correct label
- Top-5 Accuracy: Fraction where the correct label appears in the top 5 predictions
Historical Progress:
| Year | Model | Top-5 Error |
|---|---|---|
| 2010 | NEC-UIUC | 28.2% |
| 2012 | AlexNet | 15.3% |
| 2014 | VGG-16 | 7.3% |
| 2015 | ResNet-152 | 3.57% |
| 2017 | SENet | 2.25% |
| 2021 | CoAtNet | 0.95% |
| 2023 | ViT-22B | ~0.6% |
Human Top-5 error is estimated at about 5.1%. After ResNet surpassed human performance in 2015, research expanded to harder variants: ImageNet-A, ImageNet-R, and ImageNet-C.
# Measuring ImageNet validation accuracy with PyTorch
import torch
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models
from torch.utils.data import DataLoader
def evaluate_imagenet(model, val_dir, batch_size=256):
# Standard preprocessing (ImageNet validation standard)
val_transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
val_dataset = datasets.ImageFolder(val_dir, transform=val_transform)
val_loader = DataLoader(
val_dataset,
batch_size=batch_size,
shuffle=False,
num_workers=8,
pin_memory=True
)
model.eval()
top1_correct = 0
top5_correct = 0
total = 0
with torch.no_grad():
for images, labels in val_loader:
images = images.cuda()
labels = labels.cuda()
outputs = model(images)
_, predicted = outputs.topk(5, 1, True, True)
predicted = predicted.t()
correct = predicted.eq(labels.view(1, -1).expand_as(predicted))
top1_correct += correct[:1].reshape(-1).float().sum(0)
top5_correct += correct[:5].reshape(-1).float().sum(0)
total += labels.size(0)
top1_acc = top1_correct / total * 100
top5_acc = top5_correct / total * 100
print(f"Top-1 Accuracy: {top1_acc:.2f}%")
print(f"Top-5 Accuracy: {top5_acc:.2f}%")
return top1_acc, top5_acc
# Example: Evaluate ResNet-50
model = models.resnet50(pretrained=True).cuda()
evaluate_imagenet(model, '/path/to/imagenet/val')
COCO (Common Objects in Context)
COCO is a large-scale object detection, segmentation, and image captioning benchmark released by Microsoft in 2014.
Dataset Characteristics:
- 80 categories of everyday objects
- 330,000+ images
- 1.5+ million object instances
- 5 captions per image (for captioning tasks)
- Detailed instance segmentation masks
Key Metrics:
mAP (mean Average Precision) is COCO's primary metric. Various metrics exist depending on IoU (Intersection over Union) thresholds.
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
import json
def evaluate_coco_detection(annotation_file, result_file):
# Load COCO ground truth
coco_gt = COCO(annotation_file)
# Load predictions
coco_dt = coco_gt.loadRes(result_file)
# Bounding box evaluation
coco_eval = COCOeval(coco_gt, coco_dt, 'bbox')
coco_eval.evaluate()
coco_eval.accumulate()
coco_eval.summarize()
stats = coco_eval.stats
print(f"\n=== COCO Detection Results ===")
print(f"AP @ IoU=0.50:0.95 (COCO primary): {stats[0]:.3f}")
print(f"AP @ IoU=0.50 (PASCAL VOC style): {stats[1]:.3f}")
print(f"AP @ IoU=0.75 (strict): {stats[2]:.3f}")
print(f"AP small (area < 32^2): {stats[3]:.3f}")
print(f"AP medium: {stats[4]:.3f}")
print(f"AP large: {stats[5]:.3f}")
print(f"AR (max=1 per image): {stats[6]:.3f}")
print(f"AR (max=10 per image): {stats[7]:.3f}")
print(f"AR (max=100 per image): {stats[8]:.3f}")
return stats
# Explore COCO annotations
coco = COCO('instances_val2017.json')
cat_ids = coco.getCatIds(catNms=['person', 'car', 'dog'])
img_ids = coco.getImgIds(catIds=cat_ids[:1])
img = coco.loadImgs(img_ids[0])[0]
ann_ids = coco.getAnnIds(imgIds=img['id'])
anns = coco.loadAnns(ann_ids)
print(f"Image: {img['file_name']}, Annotations: {len(anns)}")
for ann in anns[:3]:
cat = coco.loadCats(ann['category_id'])[0]
print(f" Category: {cat['name']}, Area: {ann['area']:.0f}px^2")
State-of-the-Art COCO Performance (2025):
| Model | AP (box) | AP (mask) | Parameters |
|---|---|---|---|
| YOLOv8x | 53.9 | - | 68M |
| DINO (Swin-L) | 63.3 | - | 218M |
| Co-DINO (Swin-L) | 64.1 | 54.0 | 218M |
| InternImage-H | 65.4 | 56.1 | 2.18B |
ADE20K - Semantic Segmentation
ADE20K, built by MIT CSAIL, is a semantic segmentation benchmark covering 150 categories across 25,000 images.
Key Metrics:
- mIoU (mean Intersection over Union): Average IoU between predicted and ground-truth masks
- aAcc: Pixel-level overall accuracy
- mAcc: Per-class mean accuracy
import numpy as np
def compute_miou(pred_mask, gt_mask, num_classes=150):
"""Compute mIoU."""
iou_list = []
for cls in range(num_classes):
pred_cls = (pred_mask == cls)
gt_cls = (gt_mask == cls)
intersection = np.logical_and(pred_cls, gt_cls).sum()
union = np.logical_or(pred_cls, gt_cls).sum()
if union == 0:
continue # Skip if class not present in image
iou = intersection / union
iou_list.append(iou)
return np.mean(iou_list) if iou_list else 0.0
# Evaluation with mmsegmentation
from mmseg.apis import inference_segmentor, init_segmentor
config_file = 'configs/segformer/segformer_mit-b5_8xb2-160k_ade20k-512x512.py'
checkpoint_file = 'segformer_mit-b5_8x2_512x512_160k_ade20k_20220617_203542-745f14da.pth'
model = init_segmentor(config_file, checkpoint_file, device='cuda:0')
result = inference_segmentor(model, 'test_image.jpg')
Kinetics - Video Classification
Kinetics, provided by Google DeepMind, is a video action recognition benchmark.
- Kinetics-400: 400 action classes, ~300,000 clips
- Kinetics-600: 600 classes, ~500,000 clips
- Kinetics-700: 700 classes
Primary metrics: Top-1 and Top-5 accuracy (averaged per clip).
CIFAR-10/100
Small-scale image classification benchmarks widely used for rapid prototyping and paper validation.
import torch
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
def evaluate_cifar10(model, batch_size=128):
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])
testset = torchvision.datasets.CIFAR10(
root='./data', train=False, download=True, transform=transform
)
testloader = DataLoader(testset, batch_size=batch_size, shuffle=False, num_workers=4)
model.eval()
correct = 0
total = 0
with torch.no_grad():
for images, labels in testloader:
outputs = model(images)
_, predicted = torch.max(outputs, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
accuracy = 100 * correct / total
print(f"CIFAR-10 Accuracy: {accuracy:.2f}%")
return accuracy
3. NLP Benchmarks
GLUE (General Language Understanding Evaluation)
GLUE, jointly published by NYU and DeepMind in 2018, is an NLP model evaluation benchmark consisting of 9 different language understanding tasks.
GLUE Task Composition:
| Task | Description | Dataset | Metric |
|---|---|---|---|
| CoLA | Grammatical acceptability | 8,551 | Matthews Corr. |
| SST-2 | Sentiment classification | 67K | Accuracy |
| MRPC | Semantic equivalence | 3,700 | F1/Accuracy |
| STS-B | Sentence similarity score | 7K | Pearson/Spearman |
| QQP | Question pair similarity | 400K | F1/Accuracy |
| MNLI | Natural language inference (3-way) | 393K | Accuracy |
| QNLI | Question-answer inference | 105K | Accuracy |
| RTE | Textual entailment | 2,500 | Accuracy |
| WNLI | Winograd NLI | 634 | Accuracy |
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import numpy as np
from sklearn.metrics import matthews_corrcoef
def evaluate_glue_cola(model_name="bert-base-uncased"):
"""Evaluate CoLA (grammatical acceptability)."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=2
)
dataset = load_dataset("glue", "cola")
val_data = dataset["validation"]
predictions = []
labels = []
model.eval()
import torch
for item in val_data:
inputs = tokenizer(
item['sentence'],
return_tensors='pt',
padding=True,
truncation=True,
max_length=128
)
with torch.no_grad():
outputs = model(**inputs)
pred = outputs.logits.argmax(dim=-1).item()
predictions.append(pred)
labels.append(item['label'])
mcc = matthews_corrcoef(labels, predictions)
print(f"CoLA Matthews Correlation: {mcc:.4f}")
return mcc
def evaluate_glue_sst2(model_name="textattack/bert-base-uncased-SST-2"):
"""Evaluate SST-2 (sentiment classification)."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
dataset = load_dataset("glue", "sst2")
val_data = dataset["validation"]
correct = 0
total = len(val_data)
model.eval()
import torch
for item in val_data:
inputs = tokenizer(
item['sentence'],
return_tensors='pt',
truncation=True,
max_length=128
)
with torch.no_grad():
outputs = model(**inputs)
pred = outputs.logits.argmax(dim=-1).item()
if pred == item['label']:
correct += 1
acc = correct / total
print(f"SST-2 Accuracy: {acc:.4f}")
return acc
SuperGLUE
When GLUE approached saturation (near-human performance), SuperGLUE was introduced in 2019 with harder tasks.
SuperGLUE Tasks:
- BoolQ: Yes/no question answering (9,427)
- CB: Commitment/entailment (250, 3-way)
- COPA: Cause/effect reasoning (1,000)
- MultiRC: Multi-sentence reading comprehension (9,693)
- ReCoRD: Cloze-style reading comprehension (120K)
- RTE: Textual entailment recognition (5,749)
- WiC: Word-in-context disambiguation (9,600)
- WSC: Winograd Schema Challenge (554)
Human baseline: 89.8 / GPT-4-class models: 90+ (surpassing humans)
SQuAD 1.1 & 2.0
SQuAD (Stanford Question Answering Dataset) is a machine reading comprehension benchmark where answers are extracted from Wikipedia passages.
- SQuAD 1.1: 536 Wikipedia articles, 107,785 question-answer pairs. All answers exist within the passage.
- SQuAD 2.0: SQuAD 1.1 + 53,775 unanswerable questions added.
Evaluation Metrics:
- EM (Exact Match): Fraction of predictions exactly matching the gold answer
- F1 Score: Token-level partial match score
from datasets import load_dataset
from transformers import pipeline
def evaluate_squad(model_name="deepset/roberta-base-squad2"):
"""Evaluate SQuAD 2.0."""
qa_pipeline = pipeline("question-answering", model=model_name)
dataset = load_dataset("squad_v2", split="validation")
em_scores = []
f1_scores = []
no_answer_correct = 0
no_answer_total = 0
for item in dataset.select(range(200)):
context = item['context']
question = item['question']
answers = item['answers']
result = qa_pipeline(question=question, context=context)
predicted = result['answer'].lower().strip()
has_answer = len(answers['text']) > 0
if not has_answer:
no_answer_total += 1
if result['score'] < 0.1:
no_answer_correct += 1
em_scores.append(0)
f1_scores.append(0)
else:
gold_answers = [a.lower().strip() for a in answers['text']]
em = max(int(predicted == gold) for gold in gold_answers)
em_scores.append(em)
best_f1 = 0
for gold in gold_answers:
pred_tokens = set(predicted.split())
gold_tokens = set(gold.split())
common = pred_tokens & gold_tokens
if len(common) == 0:
f1 = 0
else:
precision = len(common) / len(pred_tokens)
recall = len(common) / len(gold_tokens)
f1 = 2 * precision * recall / (precision + recall)
best_f1 = max(best_f1, f1)
f1_scores.append(best_f1)
print(f"SQuAD 2.0 Results (200 samples):")
print(f" EM: {sum(em_scores)/len(em_scores)*100:.1f}%")
print(f" F1: {sum(f1_scores)/len(f1_scores)*100:.1f}%")
if no_answer_total > 0:
print(f" No-Answer Accuracy: {no_answer_correct/no_answer_total*100:.1f}%")
WMT - Machine Translation
WMT (Workshop on Machine Translation) is an annual competition evaluating machine translation models across multiple language pairs (English-German, English-Chinese, English-Korean, etc.).
Key Metrics:
- BLEU (Bilingual Evaluation Understudy): Automatic evaluation based on n-gram precision
- COMET: Neural metric with high correlation to human judgment
- chrF: Character-level n-gram F-score
import sacrebleu
def compute_bleu(predictions, references):
"""Compute BLEU score."""
bleu = sacrebleu.corpus_bleu(predictions, [references])
print(f"BLEU: {bleu.score:.2f}")
print(f"BP: {bleu.bp:.3f}")
print(f"Ratio: {bleu.sys_len/bleu.ref_len:.3f}")
return bleu.score
from transformers import MarianMTModel, MarianTokenizer
def evaluate_translation(src_texts, tgt_texts, model_name="Helsinki-NLP/opus-mt-en-de"):
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
predictions = []
for text in src_texts[:100]:
inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True)
translated = model.generate(**inputs, max_length=512)
pred = tokenizer.decode(translated[0], skip_special_tokens=True)
predictions.append(pred)
bleu_score = compute_bleu(predictions, tgt_texts[:100])
return bleu_score
4. LLM Capability Benchmarks
MMLU (Massive Multitask Language Understanding)
MMLU, published by Dan Hendrycks at UC Berkeley in 2020, evaluates LLM knowledge and reasoning with graduate-level multiple-choice questions across 57 academic disciplines.
Domain Breakdown:
- STEM: Mathematics, physics, chemistry, computer science, engineering
- Humanities: History, philosophy, law, ethics
- Social Sciences: Psychology, economics, political science, sociology
- Other: Medicine, nutrition, moral scenarios, professional accounting
Each question is four-choice, with approximately 14,000 questions total.
MMLU Performance by Model:
| Model | MMLU Score | Year |
|---|---|---|
| GPT-3 (175B) | 43.9% | 2020 |
| Gopher (280B) | 60.0% | 2021 |
| GPT-4 | 86.4% | 2023 |
| Claude 3 Opus | 86.8% | 2024 |
| Gemini Ultra | 90.0% | 2024 |
| GPT-4o | 88.7% | 2024 |
| Human expert estimate | ~90% | - |
from datasets import load_dataset
def evaluate_mmlu(model_fn, subjects=None, num_few_shot=5):
"""MMLU evaluation function."""
if subjects is None:
subjects = ['abstract_algebra', 'anatomy', 'astronomy', 'college_mathematics']
results = {}
for subject in subjects:
dataset = load_dataset("lukaemon/mmlu", subject)
test_data = dataset['test']
dev_data = dataset['dev']
correct = 0
total = 0
# Build few-shot prompt
few_shot_examples = ""
for i, item in enumerate(dev_data.select(range(num_few_shot))):
few_shot_examples += f"Q: {item['input']}\n"
few_shot_examples += f"(A) {item['A']} (B) {item['B']} (C) {item['C']} (D) {item['D']}\n"
few_shot_examples += f"Answer: {item['target']}\n\n"
for item in test_data:
prompt = few_shot_examples
prompt += f"Q: {item['input']}\n"
prompt += f"(A) {item['A']} (B) {item['B']} (C) {item['C']} (D) {item['D']}\n"
prompt += "Answer:"
response = model_fn(prompt)
pred = response.strip()[0] if response.strip() else 'A'
if pred == item['target']:
correct += 1
total += 1
accuracy = correct / total
results[subject] = accuracy
print(f"{subject}: {accuracy:.3f} ({correct}/{total})")
overall = sum(results.values()) / len(results)
print(f"\nOverall average: {overall:.3f}")
return results
BIG-Bench (Beyond the Imitation Game Benchmark)
BIG-Bench, led by Google, consists of 204 diverse tasks designed to probe the limits of LLMs. It includes creative reasoning, common sense, mathematics, and code that language models still struggle with.
BIG-Bench Hard: 23 difficult tasks where chain-of-thought prompting dramatically improves performance.
from lm_eval import evaluator
# BIG-Bench evaluation via lm-evaluation-harness
results = evaluator.simple_evaluate(
model="hf",
model_args="pretrained=meta-llama/Llama-3.2-3B-Instruct",
tasks=["bigbench_causal_judgment", "bigbench_date_understanding"],
num_fewshot=3,
batch_size="auto"
)
print(results['results'])
HellaSwag - Commonsense Reasoning
HellaSwag, published in 2019, is a commonsense reasoning benchmark where the model selects the most natural sentence continuation from four choices.
from datasets import load_dataset
import torch
from transformers import AutoTokenizer, AutoModelForMultipleChoice
def evaluate_hellaswag(model_name="microsoft/deberta-v2-xxlarge"):
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMultipleChoice.from_pretrained(model_name)
dataset = load_dataset("hellaswag", split="validation")
correct = 0
total = min(500, len(dataset))
for item in dataset.select(range(total)):
context = item['ctx']
endings = item['endings']
label = int(item['label'])
choices = [context + " " + ending for ending in endings]
encoding = tokenizer(
[context] * 4,
choices,
return_tensors='pt',
padding=True,
truncation=True,
max_length=256
)
with torch.no_grad():
outputs = model(**{k: v.unsqueeze(0) for k, v in encoding.items()})
logits = outputs.logits
predicted = logits.argmax(dim=-1).item()
if predicted == label:
correct += 1
accuracy = correct / total
print(f"HellaSwag Accuracy: {accuracy:.4f}")
return accuracy
ARC (AI2 Reasoning Challenge)
ARC, published by AI2 (Allen Institute for AI), is an elementary-to-high-school level science question benchmark.
- ARC-Easy: Relatively straightforward questions (5,197)
- ARC-Challenge: Difficult questions that even retrieval-based models get wrong (1,172)
TruthfulQA - Factuality Evaluation
TruthfulQA evaluates how accurately a model responds to questions about widely held misconceptions, myths, and biases.
from datasets import load_dataset
from transformers import pipeline
def evaluate_truthfulqa(model_name="gpt2-xl"):
"""TruthfulQA MC1 (single correct answer) evaluation."""
dataset = load_dataset("truthful_qa", "multiple_choice")
val_data = dataset["validation"]
generator = pipeline("text-generation", model=model_name)
correct = 0
total = min(100, len(val_data))
for item in val_data.select(range(total)):
question = item['question']
choices = item['mc1_targets']['choices']
labels = item['mc1_targets']['labels']
correct_idx = labels.index(1)
prompt = f"Q: {question}\nOptions:\n"
for i, choice in enumerate(choices):
letter = chr(65 + i)
prompt += f"{letter}. {choice}\n"
prompt += "Answer:"
response = generator(prompt, max_new_tokens=5, do_sample=False)
generated = response[0]['generated_text'][len(prompt):].strip()
pred_letter = generated[0] if generated else 'A'
pred_idx = ord(pred_letter) - 65
if pred_idx == correct_idx:
correct += 1
accuracy = correct / total
print(f"TruthfulQA MC1 Accuracy: {accuracy:.4f}")
return accuracy
GSM8K - Grade School Math
GSM8K (Grade School Math 8K), published by OpenAI in 2021, consists of 8,500 grade school math word problems evaluating step-by-step mathematical reasoning.
from datasets import load_dataset
import re
def extract_number(text):
"""Extract the final numeric answer from text."""
numbers = re.findall(r'-?\d+\.?\d*', text)
return numbers[-1] if numbers else None
def evaluate_gsm8k_chain_of_thought(model_fn, num_shot=8):
"""Evaluate GSM8K with Chain-of-Thought prompting."""
dataset = load_dataset("gsm8k", "main")
test_data = dataset['test']
few_shot_prompt = """Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans times 3 balls = 6 balls. 5 + 6 = 11 balls. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many do they have?
A: They started with 23. Used 20: 23 - 20 = 3. Then bought 6: 3 + 6 = 9. The answer is 9.
"""
correct = 0
total = min(200, len(test_data))
for item in test_data.select(range(total)):
question = item['question']
gold_answer = item['answer'].split('####')[-1].strip()
prompt = few_shot_prompt + f"Q: {question}\nA:"
response = model_fn(prompt, max_tokens=256)
pred = extract_number(response)
gold = extract_number(gold_answer)
if pred and gold and abs(float(pred) - float(gold)) < 0.01:
correct += 1
accuracy = correct / total
print(f"GSM8K Accuracy (Chain-of-Thought): {accuracy:.4f}")
return accuracy
HumanEval - Code Generation
HumanEval, published by OpenAI in 2021, consists of 164 Python function signatures with docstrings where the model must write the complete function.
Metric: pass@k
The probability of passing at least one test in k attempts.
from datasets import load_dataset
import subprocess
import tempfile
import os
def evaluate_humaneval(model_fn, k=1, n=10, temperature=0.8):
"""HumanEval pass@k evaluation."""
dataset = load_dataset("openai_humaneval")
test_data = dataset['test']
task_results = {}
for item in test_data.select(range(20)):
task_id = item['task_id']
prompt = item['prompt']
tests = item['test']
entry_point = item['entry_point']
passes = 0
for attempt in range(n):
code = model_fn(prompt, temperature=temperature)
full_code = prompt + code + "\n" + tests + f"\ncheck({entry_point})"
with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
f.write(full_code)
tmp_path = f.name
try:
result = subprocess.run(
['python', tmp_path],
timeout=10,
capture_output=True,
text=True
)
if result.returncode == 0:
passes += 1
except subprocess.TimeoutExpired:
pass
finally:
os.unlink(tmp_path)
task_results[task_id] = passes / n
pass_at_1 = sum(task_results.values()) / len(task_results)
print(f"pass@1: {pass_at_1:.4f}")
return pass_at_1
# HumanEval performance of major models (2025)
humaneval_scores = {
"GPT-3 (175B)": 0.0,
"Codex (12B)": 0.288,
"GPT-4": 0.870,
"Claude 3.5 Sonnet": 0.900,
"DeepSeek-Coder-33B": 0.823,
"Llama 3.1 70B": 0.803,
}
MBPP - Python Programming
MBPP (Mostly Basic Python Problems), published by Google, consists of 974 Python programming problems covering a wider range of difficulty than HumanEval.
from datasets import load_dataset
def explore_mbpp():
"""Explore the MBPP dataset."""
dataset = load_dataset("mbpp")
test_data = dataset['test']
print("Sample MBPP problems:")
for item in test_data.select(range(3)):
print(f"\nTask ID: {item['task_id']}")
print(f"Problem: {item['text']}")
print(f"Test cases: {item['test_list'][:2]}")
print(f"Reference code:\n{item['code']}")
print("-" * 50)
5. Comprehensive LLM Evaluation
MT-Bench - Multi-Turn Dialogue Evaluation
MT-Bench, developed by the LMSYS team at UC Berkeley, is a multi-turn dialogue evaluation benchmark that uses GPT-4 as a judge, scoring responses on a 1-10 scale.
8 categories, 10 questions each:
- Writing
- Roleplay
- Reasoning
- Math
- Coding
- Extraction
- STEM
- Humanities
from openai import OpenAI
def mt_bench_judge(question, answer, reference_answer=None):
"""Evaluate MT-Bench response using GPT-4."""
client = OpenAI()
system_prompt = """You are a helpful assistant that evaluates AI responses.
Rate the response on a scale of 1-10 based on: accuracy, relevance, completeness, and clarity.
Output format: Score: X/10\nRationale: [brief explanation]"""
user_prompt = f"""Question: {question}
AI Response: {answer}
{f'Reference Answer: {reference_answer}' if reference_answer else ''}
Please evaluate this response."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0.0
)
judge_response = response.choices[0].message.content
print(f"Evaluation:\n{judge_response}")
return judge_response
# Official MT-Bench usage with FastChat:
# git clone https://github.com/lm-sys/FastChat
# python -m fastchat.llm_judge.gen_model_answer --model-path your-model
# python -m fastchat.llm_judge.gen_judgment --judge-model gpt-4
# python -m fastchat.llm_judge.show_result
LMSYS Chatbot Arena
Chatbot Arena has real users compare responses from two anonymous models and vote for the better one. Using the ELO rating system, it reflects genuine human preferences.
Top Models by ELO (March 2025, approximate):
| Rank | Model | ELO |
|---|---|---|
| 1 | GPT-4.5 | ~1370 |
| 2 | Gemini 2.0 Ultra | ~1360 |
| 3 | Claude 3.7 Sonnet | ~1350 |
| 4 | GPT-4o | ~1340 |
| 5 | Llama 3.3 70B | ~1250 |
HELM (Holistic Evaluation of Language Models)
HELM, developed by Stanford CRFM, evaluates models across 7 dimensions beyond simple accuracy:
- Accuracy
- Calibration
- Robustness
- Fairness
- Bias
- Toxicity
- Efficiency
# Run HELM evaluation
pip install crfm-helm
helm-run \
--conf src/helm/benchmark/presentation/run_specs_lite.conf \
--local \
--max-eval-instances 1000 \
--num-train-trials 1
# View results
helm-summarize --suite v1
helm-server
Open LLM Leaderboard (HuggingFace)
The HuggingFace Open LLM Leaderboard is a public leaderboard evaluating open-source LLMs on consistent benchmarks.
Evaluation Tasks:
- MMLU (5-shot)
- ARC Challenge (25-shot)
- HellaSwag (10-shot)
- TruthfulQA (0-shot)
- Winogrande (5-shot)
- GSM8K (5-shot)
from huggingface_hub import HfApi
def fetch_leaderboard_data():
"""Fetch Open LLM Leaderboard data."""
api = HfApi()
dataset_info = api.dataset_info("open-llm-leaderboard/results")
print(f"Last updated: {dataset_info.lastModified}")
files = api.list_repo_files(
repo_id="open-llm-leaderboard/results",
repo_type="dataset"
)
for f in list(files)[:5]:
print(f"File: {f}")
6. Korean Language Benchmarks
KLUE (Korean Language Understanding Evaluation)
KLUE, jointly developed in 2021 by ETRI and other Korean institutions, is a Korean language understanding benchmark consisting of 8 tasks.
KLUE Tasks:
| Task | Type | Data Size | Metric |
|---|---|---|---|
| TC (Topic Classification) | Document classification | 60K | Accuracy |
| STS (Semantic Textual Similarity) | Sentence similarity | 13K | Pearson |
| NLI (Natural Language Inference) | 3-way classification | 30K | Accuracy |
| NER (Named Entity Recognition) | Entity extraction | 21K | Entity F1 |
| RE (Relation Extraction) | Relation classification | 32K | micro-F1 |
| DP (Dependency Parsing) | Syntactic analysis | 23K | UAS/LAS |
| MRC (Machine Reading Comprehension) | Reading comprehension | 24K | EM/F1 |
| DST (Dialogue State Tracking) | Dialogue tracking | 10K | JGA |
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
def evaluate_klue_nli(model_name="klue/roberta-large"):
"""Evaluate KLUE-NLI."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=3
)
dataset = load_dataset("klue", "nli")
val_data = dataset['validation']
correct = 0
total = min(500, len(val_data))
model.eval()
for item in val_data.select(range(total)):
premise = item['premise']
hypothesis = item['hypothesis']
gold_label = item['label']
inputs = tokenizer(
premise, hypothesis,
return_tensors='pt',
truncation=True,
max_length=512
)
with torch.no_grad():
outputs = model(**inputs)
pred = outputs.logits.argmax(dim=-1).item()
if pred == gold_label:
correct += 1
accuracy = correct / total
print(f"KLUE-NLI Accuracy: {accuracy:.4f}")
return accuracy
def evaluate_klue_mrc(model_name="klue/roberta-large"):
"""Evaluate KLUE-MRC (Machine Reading Comprehension)."""
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model=model_name,
tokenizer=model_name
)
dataset = load_dataset("klue", "mrc")
val_data = dataset['validation']
em_scores = []
f1_scores = []
for item in val_data.select(range(100)):
context = item['context']
question = item['question']
answers = item['answers']['text']
result = qa_pipeline(question=question, context=context)
predicted = result['answer'].strip()
em = max(int(predicted == a) for a in answers)
em_scores.append(em)
best_f1 = 0
for gold in answers:
pred_chars = set(predicted)
gold_chars = set(gold)
common = pred_chars & gold_chars
if common:
precision = len(common) / len(pred_chars)
recall = len(common) / len(gold_chars)
f1 = 2 * precision * recall / (precision + recall)
best_f1 = max(best_f1, f1)
f1_scores.append(best_f1)
print(f"KLUE-MRC EM: {sum(em_scores)/len(em_scores)*100:.1f}%")
print(f"KLUE-MRC F1: {sum(f1_scores)/len(f1_scores)*100:.1f}%")
KoBEST
KoBEST (Korean Balanced Evaluation of Significant Tasks), developed by KAIST, includes 5 tasks:
- BoolQ: Yes/no question answering
- COPA: Cause/effect reasoning
- WiC: Word-in-context disambiguation
- HellaSwag: Commonsense completion
- SentiNeg: Negation sentiment understanding
KMMLU (Korean MMLU)
KMMLU extends MMLU to Korean, including Korean-specific subjects (Korean history, Korean law, Korean medicine) alongside Korean translations of MMLU topics.
from datasets import load_dataset
def evaluate_kmmlu_sample():
"""Explore the KMMLU dataset."""
dataset = load_dataset("HAERAE-HUB/KMMLU")
test_data = dataset['test']
print(f"Total questions: {len(test_data)}")
subjects = set(test_data['subject'])
print(f"Number of subjects: {len(subjects)}")
print(f"Sample subjects: {list(subjects)[:10]}")
item = test_data[0]
print(f"\nSubject: {item['subject']}")
print(f"Question: {item['question']}")
print(f"A: {item['A']}")
print(f"B: {item['B']}")
print(f"C: {item['C']}")
print(f"D: {item['D']}")
print(f"Answer: {item['answer']}")
7. Multimodal Benchmarks
VQA (Visual Question Answering)
VQA is the task of answering natural language questions about images.
- VQA v2: ~1.1M (image, question, answer) triples. Two complementary questions per image.
- Metric: Accuracy = min(answers/3, 1) — how many of 10 annotators agree
from transformers import BlipProcessor, BlipForQuestionAnswering
import torch
from PIL import Image
def evaluate_vqa_blip():
"""VQA evaluation with BLIP."""
processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")
model.eval()
questions = [
("What color is the car?", "test_car.jpg"),
("How many people are in the image?", "test_crowd.jpg"),
("Is it raining?", "test_outdoor.jpg")
]
for question, image_path in questions:
try:
image = Image.open(image_path).convert('RGB')
inputs = processor(image, question, return_tensors="pt")
with torch.no_grad():
out = model.generate(**inputs, max_length=20)
answer = processor.decode(out[0], skip_special_tokens=True)
print(f"Q: {question}")
print(f"A: {answer}\n")
except FileNotFoundError:
print(f"Image not found: {image_path}")
MMBench
MMBench, published by Shanghai AI Lab, is a multimodal LLM evaluation benchmark covering 20 capability dimensions with 3,000 multiple-choice questions.
Sample Dimensions:
- Attribute Recognition
- Spatial Relationship
- Action Recognition
- OCR
- Commonsense Reasoning
MMMU (Massive Multidiscipline Multimodal Understanding)
MMMU evaluates university-level multimodal understanding across 6 core disciplines (Art, Science, Engineering, Medicine, Technology, Humanities), 30 subjects, and 11,550 questions.
from datasets import load_dataset
import torch
def explore_mmmu():
"""Explore the MMMU dataset."""
dataset = load_dataset("MMMU/MMMU", "Accounting")
print(f"Accounting validation: {len(dataset['validation'])} questions")
item = dataset['validation'][0]
print(f"\nQuestion: {item['question']}")
print(f"Option A: {item['option_A']}")
print(f"Option B: {item['option_B']}")
print(f"Answer: {item['answer']}")
if item['image_1']:
print("Image included")
def evaluate_mmmu_with_llava(model_name="llava-hf/llava-v1.6-mistral-7b-hf"):
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
processor = LlavaNextProcessor.from_pretrained(model_name)
model = LlavaNextForConditionalGeneration.from_pretrained(
model_name, torch_dtype="auto", device_map="auto"
)
dataset = load_dataset("MMMU/MMMU", "Accounting", split="validation")
correct = 0
total = min(50, len(dataset))
for item in dataset.select(range(total)):
question = item['question']
options = [item.get(f'option_{c}', '') for c in 'ABCDE' if item.get(f'option_{c}')]
gold = item['answer']
if item['image_1']:
image = item['image_1']
prompt = f"[INST] [IMG]\nQuestion: {question}\nOptions: {options}\nAnswer with only the option letter. [/INST]"
inputs = processor(prompt, image, return_tensors='pt').to(model.device)
else:
prompt = f"[INST] Question: {question}\nOptions: {options}\nAnswer with only the option letter. [/INST]"
inputs = processor(prompt, return_tensors='pt').to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=10)
response = processor.decode(output[0], skip_special_tokens=True)
pred = response[-1].upper() if response else 'A'
if pred == gold:
correct += 1
acc = correct / total
print(f"MMMU-Accounting Accuracy: {acc:.3f}")
return acc
8. Using LM-Evaluation-Harness
EleutherAI's lm-evaluation-harness is the standard tool for LLM evaluation, supporting 100+ benchmarks.
Installation and Basic Usage
# Install
git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
# Evaluate MMLU with GPT-2
lm_eval --model hf \
--model_args pretrained=gpt2 \
--tasks mmlu \
--num_fewshot 5 \
--batch_size 8 \
--output_path results/gpt2_mmlu
# Evaluate multiple tasks simultaneously
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-3.2-3B-Instruct \
--tasks mmlu,arc_challenge,hellaswag,truthfulqa_mc1,gsm8k \
--num_fewshot 5 \
--batch_size 4 \
--output_path results/llama3.2_3b
# Run with 4-bit quantization
lm_eval --model hf \
--model_args pretrained=meta-llama/Meta-Llama-3-8B,load_in_4bit=True \
--tasks mmlu \
--num_fewshot 5 \
--batch_size 1
Python API
import lm_eval
from lm_eval import evaluator
import json
import os
def run_comprehensive_evaluation(model_path, output_dir="./results"):
"""Run comprehensive LM-Evaluation-Harness evaluation."""
os.makedirs(output_dir, exist_ok=True)
task_groups = {
"knowledge": ["mmlu", "arc_challenge", "arc_easy"],
"reasoning": ["hellaswag", "winogrande", "piqa"],
"truthfulness": ["truthfulqa_mc1"],
"math": ["gsm8k"],
"coding": ["humaneval"],
}
all_results = {}
fewshot_map = {
"mmlu": 5, "arc_challenge": 25, "arc_easy": 25,
"hellaswag": 10, "winogrande": 5, "piqa": 0,
"truthfulqa_mc1": 0, "gsm8k": 5, "humaneval": 0
}
for group, tasks in task_groups.items():
print(f"\n=== Evaluating {group.upper()} ===")
results = evaluator.simple_evaluate(
model="hf",
model_args=f"pretrained={model_path}",
tasks=tasks,
num_fewshot=fewshot_map.get(tasks[0], 0),
batch_size="auto",
device="cuda" if __import__("torch").cuda.is_available() else "cpu",
)
all_results[group] = results['results']
for task, metrics in results['results'].items():
if 'acc,none' in metrics:
print(f" {task}: {metrics['acc,none']*100:.1f}%")
elif 'exact_match,strict-match' in metrics:
print(f" {task}: {metrics['exact_match,strict-match']*100:.1f}%")
with open(f"{output_dir}/evaluation_results.json", "w", encoding="utf-8") as f:
json.dump(all_results, f, ensure_ascii=False, indent=2)
print(f"\nResults saved to: {output_dir}/evaluation_results.json")
return all_results
def compare_models(model_paths, tasks=None):
"""Compare multiple models on the same tasks."""
if tasks is None:
tasks = ["mmlu", "arc_challenge", "hellaswag", "gsm8k"]
comparison = {}
for model_path in model_paths:
print(f"\nEvaluating: {model_path}")
results = evaluator.simple_evaluate(
model="hf",
model_args=f"pretrained={model_path}",
tasks=tasks,
num_fewshot=5,
batch_size="auto"
)
model_scores = {}
for task, metrics in results['results'].items():
for metric, value in metrics.items():
if isinstance(value, (int, float)) and not metric.endswith('_stderr'):
model_scores[f"{task}/{metric}"] = round(value * 100, 2)
comparison[model_path.split('/')[-1]] = model_scores
print("\n" + "="*80)
print("Model Comparison Results:")
print("="*80)
all_metrics = sorted(set().union(*[s.keys() for s in comparison.values()]))
header = f"{'Metric':<40}" + "".join(f"{m[:15]:<18}" for m in comparison.keys())
print(header)
print("-" * 80)
for metric in all_metrics:
if 'acc,none' in metric or 'exact_match' in metric:
row = f"{metric:<40}"
for model_name in comparison:
score = comparison[model_name].get(metric, "N/A")
row += f"{score:<18}"
print(row)
return comparison
Adding a Custom Task
# custom_task.py
from lm_eval.api.task import Task
from lm_eval.api.instance import Instance
class CustomQATask(Task):
"""Custom Q&A task for lm-evaluation-harness."""
VERSION = 1.0
DATASET_PATH = "your-org/your-qa-dataset"
DATASET_NAME = None
def has_training_docs(self):
return False
def has_validation_docs(self):
return True
def has_test_docs(self):
return True
def validation_docs(self):
return self.dataset["validation"]
def test_docs(self):
return self.dataset["test"]
def doc_to_text(self, doc):
return f"Question: {doc['question']}\nAnswer:"
def doc_to_target(self, doc):
return " " + doc['answer']
def construct_requests(self, doc, ctx):
return [Instance(
request_type="generate_until",
doc=doc,
arguments=(ctx, {"until": ["\n", "Question:"]}),
idx=0
)]
def process_results(self, doc, results):
gold = doc['answer'].lower().strip()
pred = results[0].lower().strip()
return {"exact_match": int(gold == pred)}
def aggregation(self):
return {"exact_match": "mean"}
def higher_is_better(self):
return {"exact_match": True}
Summary
AI benchmark datasets are the compass guiding AI research and development. Key takeaways:
Computer Vision:
- ImageNet: The gold standard for 1,000-class image classification
- COCO: The standard for object detection and segmentation
- ADE20K: Primary benchmark for semantic segmentation
NLP:
- GLUE/SuperGLUE: Comprehensive language understanding evaluation
- SQuAD: The standard benchmark for machine reading comprehension
LLM Capabilities:
- MMLU: Knowledge evaluation across 57 disciplines (broadest scope)
- HumanEval: Code generation capability evaluation
- GSM8K: Mathematical reasoning evaluation
Comprehensive Evaluation:
- HELM: Balanced evaluation across 7 dimensions
- Chatbot Arena: Human preference-based ELO ratings
- Open LLM Leaderboard: Comparing open-source LLMs
Korean Language:
- KLUE: 8-task Korean language understanding evaluation
- KMMLU: Korean knowledge evaluation
When interpreting benchmark results, always consider the possibility of data contamination, measurement bias, and the gap from real-world deployment conditions. A single benchmark cannot capture the full picture — comprehensive evaluation across multiple dimensions better reflects a model's genuine capabilities.