Skip to content
Published on

AI Benchmark Datasets Complete Guide: ImageNet, COCO, GLUE, MMLU, HumanEval

Authors

Table of Contents

  1. Why AI Benchmarks Matter
  2. Computer Vision Benchmarks
  3. NLP Benchmarks
  4. LLM Capability Benchmarks
  5. Comprehensive LLM Evaluation
  6. Korean Language Benchmarks
  7. Multimodal Benchmarks
  8. Using LM-Evaluation-Harness

1. Why AI Benchmarks Matter

The Need for Standardized Evaluation

How should AI models be compared? When two image classification models exist, a common standard is needed to determine which is better. Benchmark datasets provide exactly that common ground.

Without standardized benchmarks, each team could evaluate only on data favorable to them, making objective comparison impossible. Standard benchmarks like ImageNet, GLUE, and MMLU have enabled the AI research community to compete on the same test, measuring progress and setting direction.

Leaderboards and Competition

Benchmarks make AI progress visible through leaderboards.

  • ImageNet LSVRC: AlexNet reduced Top-5 error from 26% to 15.3% in 2012, launching the deep learning revolution.
  • GLUE/SuperGLUE: Documented the journey of BERT, RoBERTa, T5, and others surpassing human-level performance.
  • HumanEval: Became the arena where GPT-4, Claude, Gemini, and others compete on code generation.
  • LMSYS Chatbot Arena: Real human users blindly compare two models and vote, producing ELO ratings.

Limitations and Biases of Benchmarks

Benchmarks are powerful tools with clear limitations.

1. Dataset Contamination

LLMs are trained on vast internet text. If benchmark test data is present in training data, the model may be memorizing answers rather than genuinely solving problems. Even the GPT-4 technical report acknowledged this issue.

2. Goodhart's Law

"When a measure becomes a target, it ceases to be a good measure." Researchers who focus only on improving specific benchmark scores can raise scores without genuine capability improvements.

3. Bias and Representativeness

Many benchmarks are heavily weighted toward English and Western cultural data. Performance in Korean, Arabic, Swahili, and other languages can differ substantially from English benchmark scores.

4. Static Standards

Benchmarks do not change once created, but AI models continually improve. A difficult benchmark in 2023 can reach near-saturation by 2025.

5. Gap from Real-World Performance

High benchmark scores do not guarantee good performance in actual deployment. User experience, creativity, safety, and other hard-to-quantify factors matter just as much.


2. Computer Vision Benchmarks

ImageNet (ILSVRC)

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is the most influential benchmark in computer vision history. Originating from the ImageNet project (2009) led by Professor Fei-Fei Li at Stanford, it ran as an annual competition from 2010 to 2017.

Dataset Characteristics:

  • 1,000 classes (everyday objects: dogs, cats, cars, etc.)
  • Training data: approximately 1.2 million images
  • Validation data: 50,000 images
  • Test data: 100,000 images
  • Average of about 1,200 images per class

Key Metrics:

  • Top-1 Accuracy: Fraction of predictions where the top-1 predicted class is the correct label
  • Top-5 Accuracy: Fraction where the correct label appears in the top 5 predictions

Historical Progress:

YearModelTop-5 Error
2010NEC-UIUC28.2%
2012AlexNet15.3%
2014VGG-167.3%
2015ResNet-1523.57%
2017SENet2.25%
2021CoAtNet0.95%
2023ViT-22B~0.6%

Human Top-5 error is estimated at about 5.1%. After ResNet surpassed human performance in 2015, research expanded to harder variants: ImageNet-A, ImageNet-R, and ImageNet-C.

# Measuring ImageNet validation accuracy with PyTorch
import torch
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models
from torch.utils.data import DataLoader

def evaluate_imagenet(model, val_dir, batch_size=256):
    # Standard preprocessing (ImageNet validation standard)
    val_transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(
            mean=[0.485, 0.456, 0.406],
            std=[0.229, 0.224, 0.225]
        )
    ])

    val_dataset = datasets.ImageFolder(val_dir, transform=val_transform)
    val_loader = DataLoader(
        val_dataset,
        batch_size=batch_size,
        shuffle=False,
        num_workers=8,
        pin_memory=True
    )

    model.eval()
    top1_correct = 0
    top5_correct = 0
    total = 0

    with torch.no_grad():
        for images, labels in val_loader:
            images = images.cuda()
            labels = labels.cuda()

            outputs = model(images)
            _, predicted = outputs.topk(5, 1, True, True)
            predicted = predicted.t()
            correct = predicted.eq(labels.view(1, -1).expand_as(predicted))

            top1_correct += correct[:1].reshape(-1).float().sum(0)
            top5_correct += correct[:5].reshape(-1).float().sum(0)
            total += labels.size(0)

    top1_acc = top1_correct / total * 100
    top5_acc = top5_correct / total * 100
    print(f"Top-1 Accuracy: {top1_acc:.2f}%")
    print(f"Top-5 Accuracy: {top5_acc:.2f}%")
    return top1_acc, top5_acc

# Example: Evaluate ResNet-50
model = models.resnet50(pretrained=True).cuda()
evaluate_imagenet(model, '/path/to/imagenet/val')

COCO (Common Objects in Context)

COCO is a large-scale object detection, segmentation, and image captioning benchmark released by Microsoft in 2014.

Dataset Characteristics:

  • 80 categories of everyday objects
  • 330,000+ images
  • 1.5+ million object instances
  • 5 captions per image (for captioning tasks)
  • Detailed instance segmentation masks

Key Metrics:

mAP (mean Average Precision) is COCO's primary metric. Various metrics exist depending on IoU (Intersection over Union) thresholds.

from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
import json

def evaluate_coco_detection(annotation_file, result_file):
    # Load COCO ground truth
    coco_gt = COCO(annotation_file)

    # Load predictions
    coco_dt = coco_gt.loadRes(result_file)

    # Bounding box evaluation
    coco_eval = COCOeval(coco_gt, coco_dt, 'bbox')
    coco_eval.evaluate()
    coco_eval.accumulate()
    coco_eval.summarize()

    stats = coco_eval.stats
    print(f"\n=== COCO Detection Results ===")
    print(f"AP @ IoU=0.50:0.95 (COCO primary): {stats[0]:.3f}")
    print(f"AP @ IoU=0.50 (PASCAL VOC style): {stats[1]:.3f}")
    print(f"AP @ IoU=0.75 (strict): {stats[2]:.3f}")
    print(f"AP small (area < 32^2): {stats[3]:.3f}")
    print(f"AP medium: {stats[4]:.3f}")
    print(f"AP large: {stats[5]:.3f}")
    print(f"AR (max=1 per image): {stats[6]:.3f}")
    print(f"AR (max=10 per image): {stats[7]:.3f}")
    print(f"AR (max=100 per image): {stats[8]:.3f}")
    return stats

# Explore COCO annotations
coco = COCO('instances_val2017.json')
cat_ids = coco.getCatIds(catNms=['person', 'car', 'dog'])
img_ids = coco.getImgIds(catIds=cat_ids[:1])

img = coco.loadImgs(img_ids[0])[0]
ann_ids = coco.getAnnIds(imgIds=img['id'])
anns = coco.loadAnns(ann_ids)
print(f"Image: {img['file_name']}, Annotations: {len(anns)}")
for ann in anns[:3]:
    cat = coco.loadCats(ann['category_id'])[0]
    print(f"  Category: {cat['name']}, Area: {ann['area']:.0f}px^2")

State-of-the-Art COCO Performance (2025):

ModelAP (box)AP (mask)Parameters
YOLOv8x53.9-68M
DINO (Swin-L)63.3-218M
Co-DINO (Swin-L)64.154.0218M
InternImage-H65.456.12.18B

ADE20K - Semantic Segmentation

ADE20K, built by MIT CSAIL, is a semantic segmentation benchmark covering 150 categories across 25,000 images.

Key Metrics:

  • mIoU (mean Intersection over Union): Average IoU between predicted and ground-truth masks
  • aAcc: Pixel-level overall accuracy
  • mAcc: Per-class mean accuracy
import numpy as np

def compute_miou(pred_mask, gt_mask, num_classes=150):
    """Compute mIoU."""
    iou_list = []

    for cls in range(num_classes):
        pred_cls = (pred_mask == cls)
        gt_cls = (gt_mask == cls)

        intersection = np.logical_and(pred_cls, gt_cls).sum()
        union = np.logical_or(pred_cls, gt_cls).sum()

        if union == 0:
            continue  # Skip if class not present in image

        iou = intersection / union
        iou_list.append(iou)

    return np.mean(iou_list) if iou_list else 0.0

# Evaluation with mmsegmentation
from mmseg.apis import inference_segmentor, init_segmentor

config_file = 'configs/segformer/segformer_mit-b5_8xb2-160k_ade20k-512x512.py'
checkpoint_file = 'segformer_mit-b5_8x2_512x512_160k_ade20k_20220617_203542-745f14da.pth'

model = init_segmentor(config_file, checkpoint_file, device='cuda:0')
result = inference_segmentor(model, 'test_image.jpg')

Kinetics - Video Classification

Kinetics, provided by Google DeepMind, is a video action recognition benchmark.

  • Kinetics-400: 400 action classes, ~300,000 clips
  • Kinetics-600: 600 classes, ~500,000 clips
  • Kinetics-700: 700 classes

Primary metrics: Top-1 and Top-5 accuracy (averaged per clip).

CIFAR-10/100

Small-scale image classification benchmarks widely used for rapid prototyping and paper validation.

import torch
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader

def evaluate_cifar10(model, batch_size=128):
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
    ])

    testset = torchvision.datasets.CIFAR10(
        root='./data', train=False, download=True, transform=transform
    )
    testloader = DataLoader(testset, batch_size=batch_size, shuffle=False, num_workers=4)

    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for images, labels in testloader:
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    accuracy = 100 * correct / total
    print(f"CIFAR-10 Accuracy: {accuracy:.2f}%")
    return accuracy

3. NLP Benchmarks

GLUE (General Language Understanding Evaluation)

GLUE, jointly published by NYU and DeepMind in 2018, is an NLP model evaluation benchmark consisting of 9 different language understanding tasks.

GLUE Task Composition:

TaskDescriptionDatasetMetric
CoLAGrammatical acceptability8,551Matthews Corr.
SST-2Sentiment classification67KAccuracy
MRPCSemantic equivalence3,700F1/Accuracy
STS-BSentence similarity score7KPearson/Spearman
QQPQuestion pair similarity400KF1/Accuracy
MNLINatural language inference (3-way)393KAccuracy
QNLIQuestion-answer inference105KAccuracy
RTETextual entailment2,500Accuracy
WNLIWinograd NLI634Accuracy
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import numpy as np
from sklearn.metrics import matthews_corrcoef

def evaluate_glue_cola(model_name="bert-base-uncased"):
    """Evaluate CoLA (grammatical acceptability)."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name, num_labels=2
    )

    dataset = load_dataset("glue", "cola")
    val_data = dataset["validation"]

    predictions = []
    labels = []

    model.eval()
    import torch

    for item in val_data:
        inputs = tokenizer(
            item['sentence'],
            return_tensors='pt',
            padding=True,
            truncation=True,
            max_length=128
        )

        with torch.no_grad():
            outputs = model(**inputs)
            pred = outputs.logits.argmax(dim=-1).item()

        predictions.append(pred)
        labels.append(item['label'])

    mcc = matthews_corrcoef(labels, predictions)
    print(f"CoLA Matthews Correlation: {mcc:.4f}")
    return mcc

def evaluate_glue_sst2(model_name="textattack/bert-base-uncased-SST-2"):
    """Evaluate SST-2 (sentiment classification)."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)

    dataset = load_dataset("glue", "sst2")
    val_data = dataset["validation"]

    correct = 0
    total = len(val_data)

    model.eval()
    import torch

    for item in val_data:
        inputs = tokenizer(
            item['sentence'],
            return_tensors='pt',
            truncation=True,
            max_length=128
        )
        with torch.no_grad():
            outputs = model(**inputs)
            pred = outputs.logits.argmax(dim=-1).item()
        if pred == item['label']:
            correct += 1

    acc = correct / total
    print(f"SST-2 Accuracy: {acc:.4f}")
    return acc

SuperGLUE

When GLUE approached saturation (near-human performance), SuperGLUE was introduced in 2019 with harder tasks.

SuperGLUE Tasks:

  • BoolQ: Yes/no question answering (9,427)
  • CB: Commitment/entailment (250, 3-way)
  • COPA: Cause/effect reasoning (1,000)
  • MultiRC: Multi-sentence reading comprehension (9,693)
  • ReCoRD: Cloze-style reading comprehension (120K)
  • RTE: Textual entailment recognition (5,749)
  • WiC: Word-in-context disambiguation (9,600)
  • WSC: Winograd Schema Challenge (554)

Human baseline: 89.8 / GPT-4-class models: 90+ (surpassing humans)

SQuAD 1.1 & 2.0

SQuAD (Stanford Question Answering Dataset) is a machine reading comprehension benchmark where answers are extracted from Wikipedia passages.

  • SQuAD 1.1: 536 Wikipedia articles, 107,785 question-answer pairs. All answers exist within the passage.
  • SQuAD 2.0: SQuAD 1.1 + 53,775 unanswerable questions added.

Evaluation Metrics:

  • EM (Exact Match): Fraction of predictions exactly matching the gold answer
  • F1 Score: Token-level partial match score
from datasets import load_dataset
from transformers import pipeline

def evaluate_squad(model_name="deepset/roberta-base-squad2"):
    """Evaluate SQuAD 2.0."""
    qa_pipeline = pipeline("question-answering", model=model_name)
    dataset = load_dataset("squad_v2", split="validation")

    em_scores = []
    f1_scores = []
    no_answer_correct = 0
    no_answer_total = 0

    for item in dataset.select(range(200)):
        context = item['context']
        question = item['question']
        answers = item['answers']

        result = qa_pipeline(question=question, context=context)
        predicted = result['answer'].lower().strip()

        has_answer = len(answers['text']) > 0

        if not has_answer:
            no_answer_total += 1
            if result['score'] < 0.1:
                no_answer_correct += 1
            em_scores.append(0)
            f1_scores.append(0)
        else:
            gold_answers = [a.lower().strip() for a in answers['text']]
            em = max(int(predicted == gold) for gold in gold_answers)
            em_scores.append(em)

            best_f1 = 0
            for gold in gold_answers:
                pred_tokens = set(predicted.split())
                gold_tokens = set(gold.split())
                common = pred_tokens & gold_tokens
                if len(common) == 0:
                    f1 = 0
                else:
                    precision = len(common) / len(pred_tokens)
                    recall = len(common) / len(gold_tokens)
                    f1 = 2 * precision * recall / (precision + recall)
                best_f1 = max(best_f1, f1)
            f1_scores.append(best_f1)

    print(f"SQuAD 2.0 Results (200 samples):")
    print(f"  EM: {sum(em_scores)/len(em_scores)*100:.1f}%")
    print(f"  F1: {sum(f1_scores)/len(f1_scores)*100:.1f}%")
    if no_answer_total > 0:
        print(f"  No-Answer Accuracy: {no_answer_correct/no_answer_total*100:.1f}%")

WMT - Machine Translation

WMT (Workshop on Machine Translation) is an annual competition evaluating machine translation models across multiple language pairs (English-German, English-Chinese, English-Korean, etc.).

Key Metrics:

  • BLEU (Bilingual Evaluation Understudy): Automatic evaluation based on n-gram precision
  • COMET: Neural metric with high correlation to human judgment
  • chrF: Character-level n-gram F-score
import sacrebleu

def compute_bleu(predictions, references):
    """Compute BLEU score."""
    bleu = sacrebleu.corpus_bleu(predictions, [references])
    print(f"BLEU: {bleu.score:.2f}")
    print(f"BP: {bleu.bp:.3f}")
    print(f"Ratio: {bleu.sys_len/bleu.ref_len:.3f}")
    return bleu.score

from transformers import MarianMTModel, MarianTokenizer

def evaluate_translation(src_texts, tgt_texts, model_name="Helsinki-NLP/opus-mt-en-de"):
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name)

    predictions = []
    for text in src_texts[:100]:
        inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True)
        translated = model.generate(**inputs, max_length=512)
        pred = tokenizer.decode(translated[0], skip_special_tokens=True)
        predictions.append(pred)

    bleu_score = compute_bleu(predictions, tgt_texts[:100])
    return bleu_score

4. LLM Capability Benchmarks

MMLU (Massive Multitask Language Understanding)

MMLU, published by Dan Hendrycks at UC Berkeley in 2020, evaluates LLM knowledge and reasoning with graduate-level multiple-choice questions across 57 academic disciplines.

Domain Breakdown:

  • STEM: Mathematics, physics, chemistry, computer science, engineering
  • Humanities: History, philosophy, law, ethics
  • Social Sciences: Psychology, economics, political science, sociology
  • Other: Medicine, nutrition, moral scenarios, professional accounting

Each question is four-choice, with approximately 14,000 questions total.

MMLU Performance by Model:

ModelMMLU ScoreYear
GPT-3 (175B)43.9%2020
Gopher (280B)60.0%2021
GPT-486.4%2023
Claude 3 Opus86.8%2024
Gemini Ultra90.0%2024
GPT-4o88.7%2024
Human expert estimate~90%-
from datasets import load_dataset

def evaluate_mmlu(model_fn, subjects=None, num_few_shot=5):
    """MMLU evaluation function."""
    if subjects is None:
        subjects = ['abstract_algebra', 'anatomy', 'astronomy', 'college_mathematics']

    results = {}

    for subject in subjects:
        dataset = load_dataset("lukaemon/mmlu", subject)
        test_data = dataset['test']
        dev_data = dataset['dev']

        correct = 0
        total = 0

        # Build few-shot prompt
        few_shot_examples = ""
        for i, item in enumerate(dev_data.select(range(num_few_shot))):
            few_shot_examples += f"Q: {item['input']}\n"
            few_shot_examples += f"(A) {item['A']}  (B) {item['B']}  (C) {item['C']}  (D) {item['D']}\n"
            few_shot_examples += f"Answer: {item['target']}\n\n"

        for item in test_data:
            prompt = few_shot_examples
            prompt += f"Q: {item['input']}\n"
            prompt += f"(A) {item['A']}  (B) {item['B']}  (C) {item['C']}  (D) {item['D']}\n"
            prompt += "Answer:"

            response = model_fn(prompt)
            pred = response.strip()[0] if response.strip() else 'A'

            if pred == item['target']:
                correct += 1
            total += 1

        accuracy = correct / total
        results[subject] = accuracy
        print(f"{subject}: {accuracy:.3f} ({correct}/{total})")

    overall = sum(results.values()) / len(results)
    print(f"\nOverall average: {overall:.3f}")
    return results

BIG-Bench (Beyond the Imitation Game Benchmark)

BIG-Bench, led by Google, consists of 204 diverse tasks designed to probe the limits of LLMs. It includes creative reasoning, common sense, mathematics, and code that language models still struggle with.

BIG-Bench Hard: 23 difficult tasks where chain-of-thought prompting dramatically improves performance.

from lm_eval import evaluator

# BIG-Bench evaluation via lm-evaluation-harness
results = evaluator.simple_evaluate(
    model="hf",
    model_args="pretrained=meta-llama/Llama-3.2-3B-Instruct",
    tasks=["bigbench_causal_judgment", "bigbench_date_understanding"],
    num_fewshot=3,
    batch_size="auto"
)
print(results['results'])

HellaSwag - Commonsense Reasoning

HellaSwag, published in 2019, is a commonsense reasoning benchmark where the model selects the most natural sentence continuation from four choices.

from datasets import load_dataset
import torch
from transformers import AutoTokenizer, AutoModelForMultipleChoice

def evaluate_hellaswag(model_name="microsoft/deberta-v2-xxlarge"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForMultipleChoice.from_pretrained(model_name)

    dataset = load_dataset("hellaswag", split="validation")

    correct = 0
    total = min(500, len(dataset))

    for item in dataset.select(range(total)):
        context = item['ctx']
        endings = item['endings']
        label = int(item['label'])

        choices = [context + " " + ending for ending in endings]

        encoding = tokenizer(
            [context] * 4,
            choices,
            return_tensors='pt',
            padding=True,
            truncation=True,
            max_length=256
        )

        with torch.no_grad():
            outputs = model(**{k: v.unsqueeze(0) for k, v in encoding.items()})
            logits = outputs.logits
            predicted = logits.argmax(dim=-1).item()

        if predicted == label:
            correct += 1

    accuracy = correct / total
    print(f"HellaSwag Accuracy: {accuracy:.4f}")
    return accuracy

ARC (AI2 Reasoning Challenge)

ARC, published by AI2 (Allen Institute for AI), is an elementary-to-high-school level science question benchmark.

  • ARC-Easy: Relatively straightforward questions (5,197)
  • ARC-Challenge: Difficult questions that even retrieval-based models get wrong (1,172)

TruthfulQA - Factuality Evaluation

TruthfulQA evaluates how accurately a model responds to questions about widely held misconceptions, myths, and biases.

from datasets import load_dataset
from transformers import pipeline

def evaluate_truthfulqa(model_name="gpt2-xl"):
    """TruthfulQA MC1 (single correct answer) evaluation."""
    dataset = load_dataset("truthful_qa", "multiple_choice")
    val_data = dataset["validation"]

    generator = pipeline("text-generation", model=model_name)
    correct = 0
    total = min(100, len(val_data))

    for item in val_data.select(range(total)):
        question = item['question']
        choices = item['mc1_targets']['choices']
        labels = item['mc1_targets']['labels']
        correct_idx = labels.index(1)

        prompt = f"Q: {question}\nOptions:\n"
        for i, choice in enumerate(choices):
            letter = chr(65 + i)
            prompt += f"{letter}. {choice}\n"
        prompt += "Answer:"

        response = generator(prompt, max_new_tokens=5, do_sample=False)
        generated = response[0]['generated_text'][len(prompt):].strip()
        pred_letter = generated[0] if generated else 'A'
        pred_idx = ord(pred_letter) - 65

        if pred_idx == correct_idx:
            correct += 1

    accuracy = correct / total
    print(f"TruthfulQA MC1 Accuracy: {accuracy:.4f}")
    return accuracy

GSM8K - Grade School Math

GSM8K (Grade School Math 8K), published by OpenAI in 2021, consists of 8,500 grade school math word problems evaluating step-by-step mathematical reasoning.

from datasets import load_dataset
import re

def extract_number(text):
    """Extract the final numeric answer from text."""
    numbers = re.findall(r'-?\d+\.?\d*', text)
    return numbers[-1] if numbers else None

def evaluate_gsm8k_chain_of_thought(model_fn, num_shot=8):
    """Evaluate GSM8K with Chain-of-Thought prompting."""
    dataset = load_dataset("gsm8k", "main")
    test_data = dataset['test']

    few_shot_prompt = """Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans times 3 balls = 6 balls. 5 + 6 = 11 balls. The answer is 11.

Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many do they have?
A: They started with 23. Used 20: 23 - 20 = 3. Then bought 6: 3 + 6 = 9. The answer is 9.

"""

    correct = 0
    total = min(200, len(test_data))

    for item in test_data.select(range(total)):
        question = item['question']
        gold_answer = item['answer'].split('####')[-1].strip()

        prompt = few_shot_prompt + f"Q: {question}\nA:"
        response = model_fn(prompt, max_tokens=256)

        pred = extract_number(response)
        gold = extract_number(gold_answer)

        if pred and gold and abs(float(pred) - float(gold)) < 0.01:
            correct += 1

    accuracy = correct / total
    print(f"GSM8K Accuracy (Chain-of-Thought): {accuracy:.4f}")
    return accuracy

HumanEval - Code Generation

HumanEval, published by OpenAI in 2021, consists of 164 Python function signatures with docstrings where the model must write the complete function.

Metric: pass@k

The probability of passing at least one test in k attempts.

from datasets import load_dataset
import subprocess
import tempfile
import os

def evaluate_humaneval(model_fn, k=1, n=10, temperature=0.8):
    """HumanEval pass@k evaluation."""
    dataset = load_dataset("openai_humaneval")
    test_data = dataset['test']

    task_results = {}

    for item in test_data.select(range(20)):
        task_id = item['task_id']
        prompt = item['prompt']
        tests = item['test']
        entry_point = item['entry_point']

        passes = 0

        for attempt in range(n):
            code = model_fn(prompt, temperature=temperature)

            full_code = prompt + code + "\n" + tests + f"\ncheck({entry_point})"

            with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
                f.write(full_code)
                tmp_path = f.name

            try:
                result = subprocess.run(
                    ['python', tmp_path],
                    timeout=10,
                    capture_output=True,
                    text=True
                )
                if result.returncode == 0:
                    passes += 1
            except subprocess.TimeoutExpired:
                pass
            finally:
                os.unlink(tmp_path)

        task_results[task_id] = passes / n

    pass_at_1 = sum(task_results.values()) / len(task_results)
    print(f"pass@1: {pass_at_1:.4f}")
    return pass_at_1

# HumanEval performance of major models (2025)
humaneval_scores = {
    "GPT-3 (175B)": 0.0,
    "Codex (12B)": 0.288,
    "GPT-4": 0.870,
    "Claude 3.5 Sonnet": 0.900,
    "DeepSeek-Coder-33B": 0.823,
    "Llama 3.1 70B": 0.803,
}

MBPP - Python Programming

MBPP (Mostly Basic Python Problems), published by Google, consists of 974 Python programming problems covering a wider range of difficulty than HumanEval.

from datasets import load_dataset

def explore_mbpp():
    """Explore the MBPP dataset."""
    dataset = load_dataset("mbpp")
    test_data = dataset['test']

    print("Sample MBPP problems:")
    for item in test_data.select(range(3)):
        print(f"\nTask ID: {item['task_id']}")
        print(f"Problem: {item['text']}")
        print(f"Test cases: {item['test_list'][:2]}")
        print(f"Reference code:\n{item['code']}")
        print("-" * 50)

5. Comprehensive LLM Evaluation

MT-Bench - Multi-Turn Dialogue Evaluation

MT-Bench, developed by the LMSYS team at UC Berkeley, is a multi-turn dialogue evaluation benchmark that uses GPT-4 as a judge, scoring responses on a 1-10 scale.

8 categories, 10 questions each:

  • Writing
  • Roleplay
  • Reasoning
  • Math
  • Coding
  • Extraction
  • STEM
  • Humanities
from openai import OpenAI

def mt_bench_judge(question, answer, reference_answer=None):
    """Evaluate MT-Bench response using GPT-4."""
    client = OpenAI()

    system_prompt = """You are a helpful assistant that evaluates AI responses.
Rate the response on a scale of 1-10 based on: accuracy, relevance, completeness, and clarity.
Output format: Score: X/10\nRationale: [brief explanation]"""

    user_prompt = f"""Question: {question}

AI Response: {answer}

{f'Reference Answer: {reference_answer}' if reference_answer else ''}

Please evaluate this response."""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.0
    )

    judge_response = response.choices[0].message.content
    print(f"Evaluation:\n{judge_response}")
    return judge_response

# Official MT-Bench usage with FastChat:
# git clone https://github.com/lm-sys/FastChat
# python -m fastchat.llm_judge.gen_model_answer --model-path your-model
# python -m fastchat.llm_judge.gen_judgment --judge-model gpt-4
# python -m fastchat.llm_judge.show_result

LMSYS Chatbot Arena

Chatbot Arena has real users compare responses from two anonymous models and vote for the better one. Using the ELO rating system, it reflects genuine human preferences.

Top Models by ELO (March 2025, approximate):

RankModelELO
1GPT-4.5~1370
2Gemini 2.0 Ultra~1360
3Claude 3.7 Sonnet~1350
4GPT-4o~1340
5Llama 3.3 70B~1250

HELM (Holistic Evaluation of Language Models)

HELM, developed by Stanford CRFM, evaluates models across 7 dimensions beyond simple accuracy:

  1. Accuracy
  2. Calibration
  3. Robustness
  4. Fairness
  5. Bias
  6. Toxicity
  7. Efficiency
# Run HELM evaluation
pip install crfm-helm

helm-run \
    --conf src/helm/benchmark/presentation/run_specs_lite.conf \
    --local \
    --max-eval-instances 1000 \
    --num-train-trials 1

# View results
helm-summarize --suite v1
helm-server

Open LLM Leaderboard (HuggingFace)

The HuggingFace Open LLM Leaderboard is a public leaderboard evaluating open-source LLMs on consistent benchmarks.

Evaluation Tasks:

  • MMLU (5-shot)
  • ARC Challenge (25-shot)
  • HellaSwag (10-shot)
  • TruthfulQA (0-shot)
  • Winogrande (5-shot)
  • GSM8K (5-shot)
from huggingface_hub import HfApi

def fetch_leaderboard_data():
    """Fetch Open LLM Leaderboard data."""
    api = HfApi()

    dataset_info = api.dataset_info("open-llm-leaderboard/results")
    print(f"Last updated: {dataset_info.lastModified}")

    files = api.list_repo_files(
        repo_id="open-llm-leaderboard/results",
        repo_type="dataset"
    )

    for f in list(files)[:5]:
        print(f"File: {f}")

6. Korean Language Benchmarks

KLUE (Korean Language Understanding Evaluation)

KLUE, jointly developed in 2021 by ETRI and other Korean institutions, is a Korean language understanding benchmark consisting of 8 tasks.

KLUE Tasks:

TaskTypeData SizeMetric
TC (Topic Classification)Document classification60KAccuracy
STS (Semantic Textual Similarity)Sentence similarity13KPearson
NLI (Natural Language Inference)3-way classification30KAccuracy
NER (Named Entity Recognition)Entity extraction21KEntity F1
RE (Relation Extraction)Relation classification32Kmicro-F1
DP (Dependency Parsing)Syntactic analysis23KUAS/LAS
MRC (Machine Reading Comprehension)Reading comprehension24KEM/F1
DST (Dialogue State Tracking)Dialogue tracking10KJGA
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

def evaluate_klue_nli(model_name="klue/roberta-large"):
    """Evaluate KLUE-NLI."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name, num_labels=3
    )

    dataset = load_dataset("klue", "nli")
    val_data = dataset['validation']

    correct = 0
    total = min(500, len(val_data))

    model.eval()
    for item in val_data.select(range(total)):
        premise = item['premise']
        hypothesis = item['hypothesis']
        gold_label = item['label']

        inputs = tokenizer(
            premise, hypothesis,
            return_tensors='pt',
            truncation=True,
            max_length=512
        )

        with torch.no_grad():
            outputs = model(**inputs)
            pred = outputs.logits.argmax(dim=-1).item()

        if pred == gold_label:
            correct += 1

    accuracy = correct / total
    print(f"KLUE-NLI Accuracy: {accuracy:.4f}")
    return accuracy

def evaluate_klue_mrc(model_name="klue/roberta-large"):
    """Evaluate KLUE-MRC (Machine Reading Comprehension)."""
    from transformers import pipeline

    qa_pipeline = pipeline(
        "question-answering",
        model=model_name,
        tokenizer=model_name
    )

    dataset = load_dataset("klue", "mrc")
    val_data = dataset['validation']

    em_scores = []
    f1_scores = []

    for item in val_data.select(range(100)):
        context = item['context']
        question = item['question']
        answers = item['answers']['text']

        result = qa_pipeline(question=question, context=context)
        predicted = result['answer'].strip()

        em = max(int(predicted == a) for a in answers)
        em_scores.append(em)

        best_f1 = 0
        for gold in answers:
            pred_chars = set(predicted)
            gold_chars = set(gold)
            common = pred_chars & gold_chars
            if common:
                precision = len(common) / len(pred_chars)
                recall = len(common) / len(gold_chars)
                f1 = 2 * precision * recall / (precision + recall)
                best_f1 = max(best_f1, f1)
        f1_scores.append(best_f1)

    print(f"KLUE-MRC EM: {sum(em_scores)/len(em_scores)*100:.1f}%")
    print(f"KLUE-MRC F1: {sum(f1_scores)/len(f1_scores)*100:.1f}%")

KoBEST

KoBEST (Korean Balanced Evaluation of Significant Tasks), developed by KAIST, includes 5 tasks:

  • BoolQ: Yes/no question answering
  • COPA: Cause/effect reasoning
  • WiC: Word-in-context disambiguation
  • HellaSwag: Commonsense completion
  • SentiNeg: Negation sentiment understanding

KMMLU (Korean MMLU)

KMMLU extends MMLU to Korean, including Korean-specific subjects (Korean history, Korean law, Korean medicine) alongside Korean translations of MMLU topics.

from datasets import load_dataset

def evaluate_kmmlu_sample():
    """Explore the KMMLU dataset."""
    dataset = load_dataset("HAERAE-HUB/KMMLU")
    test_data = dataset['test']

    print(f"Total questions: {len(test_data)}")

    subjects = set(test_data['subject'])
    print(f"Number of subjects: {len(subjects)}")
    print(f"Sample subjects: {list(subjects)[:10]}")

    item = test_data[0]
    print(f"\nSubject: {item['subject']}")
    print(f"Question: {item['question']}")
    print(f"A: {item['A']}")
    print(f"B: {item['B']}")
    print(f"C: {item['C']}")
    print(f"D: {item['D']}")
    print(f"Answer: {item['answer']}")

7. Multimodal Benchmarks

VQA (Visual Question Answering)

VQA is the task of answering natural language questions about images.

  • VQA v2: ~1.1M (image, question, answer) triples. Two complementary questions per image.
  • Metric: Accuracy = min(answers/3, 1) — how many of 10 annotators agree
from transformers import BlipProcessor, BlipForQuestionAnswering
import torch
from PIL import Image

def evaluate_vqa_blip():
    """VQA evaluation with BLIP."""
    processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
    model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")
    model.eval()

    questions = [
        ("What color is the car?", "test_car.jpg"),
        ("How many people are in the image?", "test_crowd.jpg"),
        ("Is it raining?", "test_outdoor.jpg")
    ]

    for question, image_path in questions:
        try:
            image = Image.open(image_path).convert('RGB')
            inputs = processor(image, question, return_tensors="pt")

            with torch.no_grad():
                out = model.generate(**inputs, max_length=20)

            answer = processor.decode(out[0], skip_special_tokens=True)
            print(f"Q: {question}")
            print(f"A: {answer}\n")
        except FileNotFoundError:
            print(f"Image not found: {image_path}")

MMBench

MMBench, published by Shanghai AI Lab, is a multimodal LLM evaluation benchmark covering 20 capability dimensions with 3,000 multiple-choice questions.

Sample Dimensions:

  • Attribute Recognition
  • Spatial Relationship
  • Action Recognition
  • OCR
  • Commonsense Reasoning

MMMU (Massive Multidiscipline Multimodal Understanding)

MMMU evaluates university-level multimodal understanding across 6 core disciplines (Art, Science, Engineering, Medicine, Technology, Humanities), 30 subjects, and 11,550 questions.

from datasets import load_dataset
import torch

def explore_mmmu():
    """Explore the MMMU dataset."""
    dataset = load_dataset("MMMU/MMMU", "Accounting")
    print(f"Accounting validation: {len(dataset['validation'])} questions")

    item = dataset['validation'][0]
    print(f"\nQuestion: {item['question']}")
    print(f"Option A: {item['option_A']}")
    print(f"Option B: {item['option_B']}")
    print(f"Answer: {item['answer']}")

    if item['image_1']:
        print("Image included")

def evaluate_mmmu_with_llava(model_name="llava-hf/llava-v1.6-mistral-7b-hf"):
    from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration

    processor = LlavaNextProcessor.from_pretrained(model_name)
    model = LlavaNextForConditionalGeneration.from_pretrained(
        model_name, torch_dtype="auto", device_map="auto"
    )

    dataset = load_dataset("MMMU/MMMU", "Accounting", split="validation")
    correct = 0
    total = min(50, len(dataset))

    for item in dataset.select(range(total)):
        question = item['question']
        options = [item.get(f'option_{c}', '') for c in 'ABCDE' if item.get(f'option_{c}')]
        gold = item['answer']

        if item['image_1']:
            image = item['image_1']
            prompt = f"[INST] [IMG]\nQuestion: {question}\nOptions: {options}\nAnswer with only the option letter. [/INST]"
            inputs = processor(prompt, image, return_tensors='pt').to(model.device)
        else:
            prompt = f"[INST] Question: {question}\nOptions: {options}\nAnswer with only the option letter. [/INST]"
            inputs = processor(prompt, return_tensors='pt').to(model.device)

        with torch.no_grad():
            output = model.generate(**inputs, max_new_tokens=10)
            response = processor.decode(output[0], skip_special_tokens=True)
            pred = response[-1].upper() if response else 'A'

        if pred == gold:
            correct += 1

    acc = correct / total
    print(f"MMMU-Accounting Accuracy: {acc:.3f}")
    return acc

8. Using LM-Evaluation-Harness

EleutherAI's lm-evaluation-harness is the standard tool for LLM evaluation, supporting 100+ benchmarks.

Installation and Basic Usage

# Install
git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

# Evaluate MMLU with GPT-2
lm_eval --model hf \
    --model_args pretrained=gpt2 \
    --tasks mmlu \
    --num_fewshot 5 \
    --batch_size 8 \
    --output_path results/gpt2_mmlu

# Evaluate multiple tasks simultaneously
lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-3.2-3B-Instruct \
    --tasks mmlu,arc_challenge,hellaswag,truthfulqa_mc1,gsm8k \
    --num_fewshot 5 \
    --batch_size 4 \
    --output_path results/llama3.2_3b

# Run with 4-bit quantization
lm_eval --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B,load_in_4bit=True \
    --tasks mmlu \
    --num_fewshot 5 \
    --batch_size 1

Python API

import lm_eval
from lm_eval import evaluator
import json
import os

def run_comprehensive_evaluation(model_path, output_dir="./results"):
    """Run comprehensive LM-Evaluation-Harness evaluation."""
    os.makedirs(output_dir, exist_ok=True)

    task_groups = {
        "knowledge": ["mmlu", "arc_challenge", "arc_easy"],
        "reasoning": ["hellaswag", "winogrande", "piqa"],
        "truthfulness": ["truthfulqa_mc1"],
        "math": ["gsm8k"],
        "coding": ["humaneval"],
    }

    all_results = {}

    fewshot_map = {
        "mmlu": 5, "arc_challenge": 25, "arc_easy": 25,
        "hellaswag": 10, "winogrande": 5, "piqa": 0,
        "truthfulqa_mc1": 0, "gsm8k": 5, "humaneval": 0
    }

    for group, tasks in task_groups.items():
        print(f"\n=== Evaluating {group.upper()} ===")

        results = evaluator.simple_evaluate(
            model="hf",
            model_args=f"pretrained={model_path}",
            tasks=tasks,
            num_fewshot=fewshot_map.get(tasks[0], 0),
            batch_size="auto",
            device="cuda" if __import__("torch").cuda.is_available() else "cpu",
        )

        all_results[group] = results['results']

        for task, metrics in results['results'].items():
            if 'acc,none' in metrics:
                print(f"  {task}: {metrics['acc,none']*100:.1f}%")
            elif 'exact_match,strict-match' in metrics:
                print(f"  {task}: {metrics['exact_match,strict-match']*100:.1f}%")

    with open(f"{output_dir}/evaluation_results.json", "w", encoding="utf-8") as f:
        json.dump(all_results, f, ensure_ascii=False, indent=2)

    print(f"\nResults saved to: {output_dir}/evaluation_results.json")
    return all_results


def compare_models(model_paths, tasks=None):
    """Compare multiple models on the same tasks."""
    if tasks is None:
        tasks = ["mmlu", "arc_challenge", "hellaswag", "gsm8k"]

    comparison = {}

    for model_path in model_paths:
        print(f"\nEvaluating: {model_path}")
        results = evaluator.simple_evaluate(
            model="hf",
            model_args=f"pretrained={model_path}",
            tasks=tasks,
            num_fewshot=5,
            batch_size="auto"
        )

        model_scores = {}
        for task, metrics in results['results'].items():
            for metric, value in metrics.items():
                if isinstance(value, (int, float)) and not metric.endswith('_stderr'):
                    model_scores[f"{task}/{metric}"] = round(value * 100, 2)

        comparison[model_path.split('/')[-1]] = model_scores

    print("\n" + "="*80)
    print("Model Comparison Results:")
    print("="*80)

    all_metrics = sorted(set().union(*[s.keys() for s in comparison.values()]))
    header = f"{'Metric':<40}" + "".join(f"{m[:15]:<18}" for m in comparison.keys())
    print(header)
    print("-" * 80)

    for metric in all_metrics:
        if 'acc,none' in metric or 'exact_match' in metric:
            row = f"{metric:<40}"
            for model_name in comparison:
                score = comparison[model_name].get(metric, "N/A")
                row += f"{score:<18}"
            print(row)

    return comparison

Adding a Custom Task

# custom_task.py
from lm_eval.api.task import Task
from lm_eval.api.instance import Instance

class CustomQATask(Task):
    """Custom Q&A task for lm-evaluation-harness."""
    VERSION = 1.0
    DATASET_PATH = "your-org/your-qa-dataset"
    DATASET_NAME = None

    def has_training_docs(self):
        return False

    def has_validation_docs(self):
        return True

    def has_test_docs(self):
        return True

    def validation_docs(self):
        return self.dataset["validation"]

    def test_docs(self):
        return self.dataset["test"]

    def doc_to_text(self, doc):
        return f"Question: {doc['question']}\nAnswer:"

    def doc_to_target(self, doc):
        return " " + doc['answer']

    def construct_requests(self, doc, ctx):
        return [Instance(
            request_type="generate_until",
            doc=doc,
            arguments=(ctx, {"until": ["\n", "Question:"]}),
            idx=0
        )]

    def process_results(self, doc, results):
        gold = doc['answer'].lower().strip()
        pred = results[0].lower().strip()
        return {"exact_match": int(gold == pred)}

    def aggregation(self):
        return {"exact_match": "mean"}

    def higher_is_better(self):
        return {"exact_match": True}

Summary

AI benchmark datasets are the compass guiding AI research and development. Key takeaways:

Computer Vision:

  • ImageNet: The gold standard for 1,000-class image classification
  • COCO: The standard for object detection and segmentation
  • ADE20K: Primary benchmark for semantic segmentation

NLP:

  • GLUE/SuperGLUE: Comprehensive language understanding evaluation
  • SQuAD: The standard benchmark for machine reading comprehension

LLM Capabilities:

  • MMLU: Knowledge evaluation across 57 disciplines (broadest scope)
  • HumanEval: Code generation capability evaluation
  • GSM8K: Mathematical reasoning evaluation

Comprehensive Evaluation:

  • HELM: Balanced evaluation across 7 dimensions
  • Chatbot Arena: Human preference-based ELO ratings
  • Open LLM Leaderboard: Comparing open-source LLMs

Korean Language:

  • KLUE: 8-task Korean language understanding evaluation
  • KMMLU: Korean knowledge evaluation

When interpreting benchmark results, always consider the possibility of data contamination, measurement bias, and the gap from real-world deployment conditions. A single benchmark cannot capture the full picture — comprehensive evaluation across multiple dimensions better reflects a model's genuine capabilities.


References