Skip to content

Split View: AI 벤치마크 데이터셋 완전 가이드: ImageNet, COCO, GLUE, MMLU, HumanEval

|

AI 벤치마크 데이터셋 완전 가이드: ImageNet, COCO, GLUE, MMLU, HumanEval

목차

  1. AI 벤치마크의 중요성
  2. 컴퓨터 비전 벤치마크
  3. NLP 벤치마크
  4. LLM 능력 벤치마크
  5. LLM 종합 평가
  6. 한국어 벤치마크
  7. 멀티모달 벤치마크
  8. LM-Evaluation-Harness 사용법

1. AI 벤치마크의 중요성

표준화된 평가의 필요성

AI 모델은 어떻게 비교해야 할까요? 두 개의 이미지 분류 모델이 있을 때, 어느 것이 더 우수한지 판단하려면 공통된 기준이 필요합니다. 벤치마크 데이터셋은 바로 이 공통된 기준을 제공합니다.

표준화된 벤치마크가 없다면 각 팀이 자신에게 유리한 데이터셋으로만 평가를 진행할 수 있으므로, 결과를 객관적으로 비교하기 어렵습니다. ImageNet, GLUE, MMLU 같은 표준 벤치마크는 AI 연구 커뮤니티가 동일한 시험지로 경쟁하도록 만들어 진보를 측정하고 방향을 설정하는 데 기여했습니다.

리더보드와 경쟁

벤치마크는 리더보드를 통해 AI 발전을 가시적으로 보여줍니다.

  • ImageNet LSVRC: 2012년 AlexNet이 Top-5 오류율을 26%에서 15.3%로 낮추면서 딥러닝 혁명이 시작되었습니다.
  • GLUE/SuperGLUE: BERT, RoBERTa, T5 등이 인간 수준 성능을 넘어서는 과정을 기록했습니다.
  • HumanEval: GPT-4, Claude, Gemini 등 최신 LLM들이 코드 생성 능력을 경쟁하는 무대가 되었습니다.
  • LMSYS Chatbot Arena: 실제 인간 사용자가 두 모델을 블라인드 테스트하여 ELO 점수를 매깁니다.

벤치마크의 한계와 편향

벤치마크는 강력한 도구이지만 한계도 명확합니다.

1. 데이터셋 오염 (Contamination)

LLM은 인터넷의 방대한 텍스트로 학습됩니다. 벤치마크 테스트 데이터가 학습 데이터에 포함되어 있다면 모델은 실제로 문제를 이해하는 것이 아니라 답을 암기한 것일 수 있습니다. GPT-4 기술 보고서에서도 이 문제를 인정했습니다.

2. 굿하트의 법칙

"측정 지표가 목표가 되면, 더 이상 좋은 측정 지표가 아니다." 연구자들이 특정 벤치마크 점수를 올리는 데만 집중하면, 실제 능력 향상 없이 점수만 높아질 수 있습니다.

3. 편향과 대표성

많은 벤치마크가 영어와 서양 문화권 데이터에 편중되어 있습니다. 한국어, 아랍어, 스와힐리어 등에서의 성능은 영어 벤치마크 점수와 크게 다를 수 있습니다.

4. 정적인 기준

벤치마크는 한번 만들어지면 변하지 않지만, AI 모델은 계속 발전합니다. 2023년에 어려웠던 벤치마크가 2025년에는 포화 상태(near-saturation)에 도달하기도 합니다.

5. 실제 성능과의 괴리

벤치마크 점수가 높다고 해서 실제 사용 환경에서 좋은 성능을 보장하지 않습니다. 사용자 경험, 창의성, 안전성 등 수치화하기 어려운 요소들도 중요합니다.


2. 컴퓨터 비전 벤치마크

ImageNet (ILSVRC)

ImageNet Large Scale Visual Recognition Challenge(ILSVRC)는 컴퓨터 비전 역사상 가장 영향력 있는 벤치마크입니다. 스탠퍼드 대학교의 Fei-Fei Li 교수가 주도한 ImageNet 프로젝트(2009)에서 시작되었으며, 2010년부터 2017년까지 연례 대회로 진행되었습니다.

데이터셋 특성:

  • 1,000개 클래스 (개, 고양이, 자동차 등 일상적 사물)
  • 학습 데이터: 약 120만 장
  • 검증 데이터: 50,000장
  • 테스트 데이터: 100,000장
  • 평균 클래스당 약 1,200장

주요 평가 지표:

  • Top-1 Accuracy: 모델이 예측한 1위 클래스가 실제 정답인 비율
  • Top-5 Accuracy: 모델이 예측한 상위 5개 클래스 중 실제 정답이 포함된 비율

역사적 발전:

연도모델Top-5 오류율
2010NEC-UIUC28.2%
2012AlexNet15.3%
2014VGG-167.3%
2015ResNet-1523.57%
2017SENet2.25%
2021CoAtNet0.95%
2023ViT-22B~0.6%

사람의 Top-5 오류율은 약 5.1%로 추정됩니다. ResNet(2015년)이 이미 인간 수준을 넘어선 이후 연구는 더욱 어려운 변형 벤치마크(ImageNet-A, ImageNet-R, ImageNet-C)로 확장되었습니다.

# PyTorch로 ImageNet 검증 정확도 측정
import torch
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models
from torch.utils.data import DataLoader

def evaluate_imagenet(model, val_dir, batch_size=256):
    # 표준 전처리 (ImageNet 검증 기준)
    val_transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(
            mean=[0.485, 0.456, 0.406],
            std=[0.229, 0.224, 0.225]
        )
    ])

    val_dataset = datasets.ImageFolder(val_dir, transform=val_transform)
    val_loader = DataLoader(
        val_dataset,
        batch_size=batch_size,
        shuffle=False,
        num_workers=8,
        pin_memory=True
    )

    model.eval()
    top1_correct = 0
    top5_correct = 0
    total = 0

    with torch.no_grad():
        for images, labels in val_loader:
            images = images.cuda()
            labels = labels.cuda()

            outputs = model(images)
            _, predicted = outputs.topk(5, 1, True, True)
            predicted = predicted.t()
            correct = predicted.eq(labels.view(1, -1).expand_as(predicted))

            top1_correct += correct[:1].reshape(-1).float().sum(0)
            top5_correct += correct[:5].reshape(-1).float().sum(0)
            total += labels.size(0)

    top1_acc = top1_correct / total * 100
    top5_acc = top5_correct / total * 100
    print(f"Top-1 Accuracy: {top1_acc:.2f}%")
    print(f"Top-5 Accuracy: {top5_acc:.2f}%")
    return top1_acc, top5_acc

# 예시: ResNet-50 평가
model = models.resnet50(pretrained=True).cuda()
evaluate_imagenet(model, '/path/to/imagenet/val')

COCO (Common Objects in Context)

COCO는 Microsoft가 2014년 공개한 대규모 객체 탐지, 세그멘테이션, 이미지 캡셔닝 벤치마크입니다.

데이터셋 특성:

  • 80개 카테고리의 일상적 객체
  • 330,000장 이상의 이미지
  • 150만 개 이상의 객체 인스턴스
  • 각 이미지에 5개의 캡션 (캡셔닝 태스크용)
  • 세밀한 세그멘테이션 마스크 포함

주요 평가 지표:

mAP (mean Average Precision)는 COCO의 핵심 지표입니다. IoU(Intersection over Union) 임계값에 따라 다양한 지표가 있습니다.

from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
import json

def evaluate_coco_detection(annotation_file, result_file):
    # COCO GT 로드
    coco_gt = COCO(annotation_file)

    # 예측 결과 로드
    coco_dt = coco_gt.loadRes(result_file)

    # bbox 평가
    coco_eval = COCOeval(coco_gt, coco_dt, 'bbox')
    coco_eval.evaluate()
    coco_eval.accumulate()
    coco_eval.summarize()

    # 주요 지표 출력
    stats = coco_eval.stats
    print(f"\n=== COCO Detection Results ===")
    print(f"AP @ IoU=0.50:0.95 (COCO primary): {stats[0]:.3f}")
    print(f"AP @ IoU=0.50 (PASCAL VOC style): {stats[1]:.3f}")
    print(f"AP @ IoU=0.75 (strict): {stats[2]:.3f}")
    print(f"AP (small objects, area < 32^2): {stats[3]:.3f}")
    print(f"AP (medium objects): {stats[4]:.3f}")
    print(f"AP (large objects): {stats[5]:.3f}")
    print(f"AR (max=1 det/image): {stats[6]:.3f}")
    print(f"AR (max=10 det/image): {stats[7]:.3f}")
    print(f"AR (max=100 det/image): {stats[8]:.3f}")
    return stats

# COCO 데이터셋 탐색
coco = COCO('instances_val2017.json')
cat_ids = coco.getCatIds(catNms=['person', 'car', 'dog'])
img_ids = coco.getImgIds(catIds=cat_ids[:1])

# 특정 이미지의 어노테이션 확인
img = coco.loadImgs(img_ids[0])[0]
ann_ids = coco.getAnnIds(imgIds=img['id'])
anns = coco.loadAnns(ann_ids)
print(f"이미지: {img['file_name']}, 어노테이션 수: {len(anns)}")
for ann in anns[:3]:
    cat = coco.loadCats(ann['category_id'])[0]
    print(f"  카테고리: {cat['name']}, 면적: {ann['area']:.0f}px²")

최신 COCO 성능 (2025년 기준):

모델AP (box)AP (mask)파라미터
YOLOv8x53.9-68M
DINO (Swin-L)63.3-218M
Co-DINO (Swin-L)64.154.0218M
InternImage-H65.456.12.18B

ADE20K - 시맨틱 세그멘테이션

ADE20K는 MIT CSAIL이 구축한 시맨틱 세그멘테이션 벤치마크로, 150개 카테고리에 걸쳐 25,000장의 이미지를 포함합니다.

주요 지표:

  • mIoU (mean Intersection over Union): 예측 마스크와 실제 마스크 간의 평균 IoU
  • aAcc (allAcc): 픽셀 수준 전체 정확도
  • mAcc: 클래스별 평균 정확도
import numpy as np

def compute_iou(pred_mask, gt_mask, num_classes=150):
    """mIoU 계산"""
    iou_list = []

    for cls in range(num_classes):
        pred_cls = (pred_mask == cls)
        gt_cls = (gt_mask == cls)

        intersection = np.logical_and(pred_cls, gt_cls).sum()
        union = np.logical_or(pred_cls, gt_cls).sum()

        if union == 0:
            continue  # 이 클래스가 이미지에 없으면 건너뜀

        iou = intersection / union
        iou_list.append(iou)

    return np.mean(iou_list) if iou_list else 0.0

# mmsegmentation으로 ADE20K 평가
# pip install mmsegmentation
from mmseg.apis import inference_segmentor, init_segmentor

config_file = 'configs/segformer/segformer_mit-b5_8xb2-160k_ade20k-512x512.py'
checkpoint_file = 'segformer_mit-b5_8x2_512x512_160k_ade20k_20220617_203542-745f14da.pth'

model = init_segmentor(config_file, checkpoint_file, device='cuda:0')
result = inference_segmentor(model, 'test_image.jpg')

Kinetics - 동영상 분류

Kinetics는 Google DeepMind가 제공하는 동영상 행동 인식 벤치마크입니다.

  • Kinetics-400: 400개 행동 클래스, 약 30만 개 클립
  • Kinetics-600: 600개 클래스, 약 50만 개 클립
  • Kinetics-700: 700개 클래스

주요 지표: Top-1, Top-5 정확도 (각 클립에서의 평균)

CIFAR-10/100

소규모 이미지 분류 벤치마크로, 빠른 프로토타이핑과 논문 검증에 자주 사용됩니다.

import torch
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader

# CIFAR-10 로드 및 평가
def evaluate_cifar10(model, batch_size=128):
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
    ])

    testset = torchvision.datasets.CIFAR10(
        root='./data', train=False, download=True, transform=transform
    )
    testloader = DataLoader(testset, batch_size=batch_size, shuffle=False, num_workers=4)

    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for images, labels in testloader:
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    accuracy = 100 * correct / total
    print(f"CIFAR-10 정확도: {accuracy:.2f}%")
    return accuracy

3. NLP 벤치마크

GLUE (General Language Understanding Evaluation)

GLUE는 2018년 뉴욕대와 DeepMind가 공동 발표한 NLP 모델 평가 벤치마크로, 9가지 서로 다른 언어 이해 태스크로 구성됩니다.

GLUE 태스크 구성:

태스크설명데이터셋지표
CoLA문법성 판단8,551Matthews Corr.
SST-2감성 분류 (긍정/부정)67K정확도
MRPC문장 의미 동일성3,700F1/정확도
STS-B문장 유사도 점수7KPearson/Spearman
QQP질문 유사성400KF1/정확도
MNLI자연어 추론 (3분류)393K정확도
QNLI질문-답변 추론105K정확도
RTE텍스트 함의 인식2,500정확도
WNLIWinograd NLI634정확도
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import numpy as np
from sklearn.metrics import matthews_corrcoef, f1_score

def evaluate_glue_cola(model_name="bert-base-uncased"):
    """CoLA 태스크 평가 (문법성 판단)"""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name, num_labels=2
    )

    dataset = load_dataset("glue", "cola")
    val_data = dataset["validation"]

    predictions = []
    labels = []

    model.eval()
    import torch

    for item in val_data:
        inputs = tokenizer(
            item['sentence'],
            return_tensors='pt',
            padding=True,
            truncation=True,
            max_length=128
        )

        with torch.no_grad():
            outputs = model(**inputs)
            pred = outputs.logits.argmax(dim=-1).item()

        predictions.append(pred)
        labels.append(item['label'])

    mcc = matthews_corrcoef(labels, predictions)
    print(f"CoLA Matthews Correlation: {mcc:.4f}")
    return mcc

# SST-2 (감성 분류)
def evaluate_glue_sst2(model_name="textattack/bert-base-uncased-SST-2"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)

    dataset = load_dataset("glue", "sst2")
    val_data = dataset["validation"]

    correct = 0
    total = len(val_data)

    model.eval()
    import torch

    for item in val_data:
        inputs = tokenizer(
            item['sentence'],
            return_tensors='pt',
            truncation=True,
            max_length=128
        )
        with torch.no_grad():
            outputs = model(**inputs)
            pred = outputs.logits.argmax(dim=-1).item()
        if pred == item['label']:
            correct += 1

    acc = correct / total
    print(f"SST-2 정확도: {acc:.4f}")
    return acc

SuperGLUE

GLUE가 포화 상태(인간 수준 달성)에 가까워지자, 2019년 더 어려운 태스크로 구성된 SuperGLUE가 등장했습니다.

SuperGLUE 태스크:

  • BoolQ: 예/아니오 질의응답 (9,427개)
  • CB: 주장-전제 함의 (250개, 3분류)
  • COPA: 원인/결과 추론 (1,000개)
  • MultiRC: 멀티 문장 독해 (9,693개)
  • ReCoRD: 클로즈 스타일 독해 (120K개)
  • RTE: 텍스트 함의 인식 (5,749개)
  • WiC: 단어 의미 중의성 해소 (9,600개)
  • WSC: Winograd 스키마 도전 (554개)

인간 베이스라인: 89.8 / GPT-4 수준 모델: 90+ (인간 수준 초과)

SQuAD 1.1 & 2.0

SQuAD(Stanford Question Answering Dataset)는 위키피디아 단락에서 질문에 대한 답을 추출하는 기계 독해 벤치마크입니다.

  • SQuAD 1.1: 536개 위키피디아 문서, 107,785개 질문-답변 쌍. 모든 질문에 답이 단락 내에 존재
  • SQuAD 2.0: SQuAD 1.1 + 53,775개 대답 불가능한 질문 추가

평가 지표:

  • EM (Exact Match): 예측 답변이 정답과 완전히 일치하는 비율
  • F1 Score: 단어 수준의 부분 일치 점수
from datasets import load_dataset
from transformers import pipeline

def evaluate_squad(model_name="deepset/roberta-base-squad2"):
    """SQuAD 2.0 평가"""
    qa_pipeline = pipeline("question-answering", model=model_name)
    dataset = load_dataset("squad_v2", split="validation")

    em_scores = []
    f1_scores = []
    no_answer_correct = 0
    no_answer_total = 0

    for item in dataset.select(range(200)):  # 빠른 평가를 위해 200개만
        context = item['context']
        question = item['question']
        answers = item['answers']

        result = qa_pipeline(question=question, context=context)
        predicted = result['answer'].lower().strip()

        has_answer = len(answers['text']) > 0

        if not has_answer:
            no_answer_total += 1
            if result['score'] < 0.1:  # 모델이 답 없음을 인식한 경우
                no_answer_correct += 1
            em_scores.append(0)
            f1_scores.append(0)
        else:
            gold_answers = [a.lower().strip() for a in answers['text']]

            # EM 계산
            em = max(int(predicted == gold) for gold in gold_answers)
            em_scores.append(em)

            # F1 계산
            best_f1 = 0
            for gold in gold_answers:
                pred_tokens = set(predicted.split())
                gold_tokens = set(gold.split())
                common = pred_tokens & gold_tokens
                if len(common) == 0:
                    f1 = 0
                else:
                    precision = len(common) / len(pred_tokens)
                    recall = len(common) / len(gold_tokens)
                    f1 = 2 * precision * recall / (precision + recall)
                best_f1 = max(best_f1, f1)
            f1_scores.append(best_f1)

    print(f"SQuAD 2.0 결과 (샘플 200개):")
    print(f"  EM: {sum(em_scores)/len(em_scores)*100:.1f}%")
    print(f"  F1: {sum(f1_scores)/len(f1_scores)*100:.1f}%")
    if no_answer_total > 0:
        print(f"  대답 불가 정확도: {no_answer_correct/no_answer_total*100:.1f}%")

WMT - 기계 번역

WMT(Workshop on Machine Translation)는 기계 번역 모델을 평가하는 연례 대회로, 여러 언어 쌍(영-독, 영-중, 영-한 등)에 대한 번역 품질을 평가합니다.

주요 평가 지표:

  • BLEU (Bilingual Evaluation Understudy): n-gram 정밀도 기반 자동 평가
  • COMET: 인간 평가와 높은 상관성을 보이는 신경망 기반 지표
  • chrF: 문자 수준 n-gram F 점수
from datasets import load_dataset
import sacrebleu

def compute_bleu(predictions, references):
    """BLEU 점수 계산"""
    bleu = sacrebleu.corpus_bleu(predictions, [references])
    print(f"BLEU: {bleu.score:.2f}")
    print(f"BP: {bleu.bp:.3f}")
    print(f"Ratio: {bleu.sys_len/bleu.ref_len:.3f}")
    return bleu.score

# 번역 모델 평가
from transformers import MarianMTModel, MarianTokenizer

def evaluate_translation(src_texts, tgt_texts, model_name="Helsinki-NLP/opus-mt-en-ko"):
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name)

    predictions = []
    for text in src_texts[:100]:  # 100개 샘플
        inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True)
        translated = model.generate(**inputs, max_length=512)
        pred = tokenizer.decode(translated[0], skip_special_tokens=True)
        predictions.append(pred)

    bleu_score = compute_bleu(predictions, tgt_texts[:100])
    return bleu_score

4. LLM 능력 벤치마크

MMLU (Massive Multitask Language Understanding)

MMLU는 UC Berkeley의 Dan Hendrycks가 2020년 발표한 벤치마크로, 57개 학문 분야에 걸친 대학원 수준의 객관식 문제로 LLM의 지식과 추론 능력을 평가합니다.

분야별 구성:

  • STEM: 수학, 물리, 화학, 컴퓨터과학, 엔지니어링
  • 인문학: 역사, 철학, 법학, 윤리학
  • 사회과학: 심리학, 경제학, 정치학, 사회학
  • 기타: 의학, 영양학, 도덕적 시나리오, 전문 회계

각 문제는 4지선다형이며, 총 약 14,000개의 문제로 구성됩니다.

모델별 MMLU 성능:

모델MMLU 점수발표 연도
GPT-3 (175B)43.9%2020
Gopher (280B)60.0%2021
GPT-486.4%2023
Claude 3 Opus86.8%2024
Gemini Ultra90.0%2024
GPT-4o88.7%2024
인간 전문가 추정~90%-
from datasets import load_dataset
import anthropic  # 또는 openai

def evaluate_mmlu(model_fn, subjects=None, num_few_shot=5):
    """MMLU 평가 함수"""
    if subjects is None:
        subjects = ['abstract_algebra', 'anatomy', 'astronomy', 'college_mathematics']

    results = {}

    for subject in subjects:
        dataset = load_dataset("lukaemon/mmlu", subject)
        test_data = dataset['test']
        dev_data = dataset['dev']  # few-shot 예시용

        correct = 0
        total = 0

        # Few-shot 프롬프트 구성
        few_shot_examples = ""
        for i, item in enumerate(dev_data.select(range(num_few_shot))):
            few_shot_examples += f"Q: {item['input']}\n"
            few_shot_examples += f"(A) {item['A']}  (B) {item['B']}  (C) {item['C']}  (D) {item['D']}\n"
            few_shot_examples += f"Answer: {item['target']}\n\n"

        for item in test_data:
            prompt = few_shot_examples
            prompt += f"Q: {item['input']}\n"
            prompt += f"(A) {item['A']}  (B) {item['B']}  (C) {item['C']}  (D) {item['D']}\n"
            prompt += "Answer:"

            response = model_fn(prompt)

            # 응답에서 A/B/C/D 추출
            pred = response.strip()[0] if response.strip() else 'A'

            if pred == item['target']:
                correct += 1
            total += 1

        accuracy = correct / total
        results[subject] = accuracy
        print(f"{subject}: {accuracy:.3f} ({correct}/{total})")

    overall = sum(results.values()) / len(results)
    print(f"\n전체 평균: {overall:.3f}")
    return results

BIG-Bench (Beyond the Imitation Game Benchmark)

Google이 주도한 BIG-Bench는 LLM의 경계를 탐색하는 204개의 다양한 태스크로 구성됩니다. 언어 모델이 아직 잘 수행하지 못하는 창의적 추론, 상식, 수학, 코드 등을 포함합니다.

BIG-Bench Hard: 23개의 어려운 태스크로, 체인-오브-소트(Chain-of-Thought) 프롬프팅으로 성능이 크게 향상됩니다.

from lm_eval.api.task import Task
from lm_eval import evaluator

# lm-evaluation-harness를 통한 BIG-Bench 평가
results = evaluator.simple_evaluate(
    model="hf",
    model_args="pretrained=meta-llama/Llama-3.2-3B-Instruct",
    tasks=["bigbench_causal_judgment", "bigbench_date_understanding"],
    num_fewshot=3,
    batch_size="auto"
)
print(results['results'])

HellaSwag - 상식 추론

HellaSwag는 2019년 발표된 상식 추론 벤치마크로, 이야기의 다음 문장으로 가장 자연스러운 것을 4개 중에 고르는 형태입니다.

from datasets import load_dataset
import torch
from transformers import AutoTokenizer, AutoModelForMultipleChoice

def evaluate_hellaswag(model_name="microsoft/deberta-v2-xxlarge"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForMultipleChoice.from_pretrained(model_name)

    dataset = load_dataset("hellaswag", split="validation")

    correct = 0
    total = min(500, len(dataset))  # 빠른 평가

    for item in dataset.select(range(total)):
        context = item['ctx']
        endings = item['endings']
        label = int(item['label'])

        # 각 선택지와 컨텍스트 조합
        choices = [context + " " + ending for ending in endings]

        encoding = tokenizer(
            [context] * 4,
            choices,
            return_tensors='pt',
            padding=True,
            truncation=True,
            max_length=256
        )

        with torch.no_grad():
            outputs = model(**{k: v.unsqueeze(0) for k, v in encoding.items()})
            logits = outputs.logits
            predicted = logits.argmax(dim=-1).item()

        if predicted == label:
            correct += 1

    accuracy = correct / total
    print(f"HellaSwag 정확도: {accuracy:.4f}")
    return accuracy

ARC (AI2 Reasoning Challenge)

ARC는 AI2(Allen Institute for AI)가 발표한 초등-고등학교 수준의 과학 문제 벤치마크입니다.

  • ARC-Easy: 비교적 쉬운 문제 (5,197개)
  • ARC-Challenge: 검색 기반 모델도 틀리는 어려운 문제 (1,172개)

TruthfulQA - 사실성 평가

TruthfulQA는 모델이 널리 알려진 미신, 오해, 편견에 대해 얼마나 정확하게 답변하는지를 평가합니다.

from datasets import load_dataset
from transformers import pipeline

def evaluate_truthfulqa(model_name="gpt2-xl"):
    """TruthfulQA MC1 (단일 정답 선택) 평가"""
    dataset = load_dataset("truthful_qa", "multiple_choice")
    val_data = dataset["validation"]

    generator = pipeline("text-generation", model=model_name)
    correct = 0
    total = min(100, len(val_data))

    for item in val_data.select(range(total)):
        question = item['question']
        choices = item['mc1_targets']['choices']
        labels = item['mc1_targets']['labels']
        correct_idx = labels.index(1)

        # 프롬프트 구성
        prompt = f"Q: {question}\nOptions:\n"
        for i, choice in enumerate(choices):
            letter = chr(65 + i)  # A, B, C, ...
            prompt += f"{letter}. {choice}\n"
        prompt += "Answer:"

        response = generator(prompt, max_new_tokens=5, do_sample=False)
        generated = response[0]['generated_text'][len(prompt):].strip()
        pred_letter = generated[0] if generated else 'A'
        pred_idx = ord(pred_letter) - 65

        if pred_idx == correct_idx:
            correct += 1

    accuracy = correct / total
    print(f"TruthfulQA MC1 정확도: {accuracy:.4f}")
    return accuracy

GSM8K - 초등 수학

GSM8K(Grade School Math 8K)는 OpenAI가 2021년 발표한 초등학교 수준의 수학 문제 8,500개로 구성된 벤치마크입니다. 각 문제는 자연어로 서술되며, 모델이 단계별로 수학적 추론을 수행하는 능력을 평가합니다.

from datasets import load_dataset
import re

def extract_number(text):
    """텍스트에서 최종 숫자 답 추출"""
    numbers = re.findall(r'-?\d+\.?\d*', text)
    return numbers[-1] if numbers else None

def evaluate_gsm8k_chain_of_thought(model_fn, num_shot=8):
    """Chain-of-Thought로 GSM8K 평가"""
    dataset = load_dataset("gsm8k", "main")
    test_data = dataset['test']

    few_shot_prompt = """Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans × 3 balls = 6 balls. 5 + 6 = 11 balls. The answer is 11.

Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many do they have?
A: They started with 23. Used 20: 23 - 20 = 3. Then bought 6: 3 + 6 = 9. The answer is 9.

"""

    correct = 0
    total = min(200, len(test_data))

    for item in test_data.select(range(total)):
        question = item['question']
        gold_answer = item['answer'].split('####')[-1].strip()

        prompt = few_shot_prompt + f"Q: {question}\nA:"
        response = model_fn(prompt, max_tokens=256)

        pred = extract_number(response)
        gold = extract_number(gold_answer)

        if pred and gold and abs(float(pred) - float(gold)) < 0.01:
            correct += 1

    accuracy = correct / total
    print(f"GSM8K 정확도 (Chain-of-Thought): {accuracy:.4f}")
    return accuracy

HumanEval - 코드 생성 평가

HumanEval은 OpenAI가 2021년 발표한 코드 생성 벤치마크로, 164개의 파이썬 함수 시그니처와 독스트링이 주어지고 모델이 완성된 함수를 작성해야 합니다.

평가 지표: pass@k

k번 시도 중 적어도 한 번 테스트를 통과할 확률입니다.

from datasets import load_dataset
import subprocess
import tempfile
import os

def evaluate_humaneval(model_fn, k=1, n=10, temperature=0.8):
    """HumanEval pass@k 평가"""
    dataset = load_dataset("openai_humaneval")
    test_data = dataset['test']

    task_results = {}

    for item in test_data.select(range(20)):  # 20개만 빠르게 평가
        task_id = item['task_id']
        prompt = item['prompt']
        tests = item['test']
        entry_point = item['entry_point']

        passes = 0

        for attempt in range(n):
            code = model_fn(prompt, temperature=temperature)

            # 코드 실행 및 테스트
            full_code = prompt + code + "\n" + tests + f"\ncheck({entry_point})"

            with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
                f.write(full_code)
                tmp_path = f.name

            try:
                result = subprocess.run(
                    ['python', tmp_path],
                    timeout=10,
                    capture_output=True,
                    text=True
                )
                if result.returncode == 0:
                    passes += 1
            except subprocess.TimeoutExpired:
                pass
            finally:
                os.unlink(tmp_path)

        task_results[task_id] = passes / n

    # pass@1 계산
    pass_at_1 = sum(task_results.values()) / len(task_results)
    print(f"pass@1: {pass_at_1:.4f}")
    return pass_at_1

# 주요 모델 HumanEval 성능 (2025년 기준)
humaneval_scores = {
    "GPT-3 (175B)": 0.0,   # 원본 논문 기준
    "Codex (12B)": 0.288,
    "GPT-4": 0.870,
    "Claude 3.5 Sonnet": 0.900,
    "DeepSeek-Coder-33B": 0.823,
    "Llama 3.1 70B": 0.803,
}

MBPP - 파이썬 프로그래밍

MBPP(Mostly Basic Python Problems)는 Google이 발표한 974개의 파이썬 프로그래밍 문제로, HumanEval보다 더 다양한 난이도를 포함합니다.

from datasets import load_dataset

def evaluate_mbpp_sample():
    """MBPP 데이터셋 탐색"""
    dataset = load_dataset("mbpp")
    test_data = dataset['test']

    print("MBPP 샘플 문제:")
    for item in test_data.select(range(3)):
        print(f"\n태스크 ID: {item['task_id']}")
        print(f"문제: {item['text']}")
        print(f"테스트 케이스: {item['test_list'][:2]}")
        print(f"참고 코드:\n{item['code']}")
        print("-" * 50)

5. LLM 종합 평가

MT-Bench - 멀티턴 대화 평가

MT-Bench는 UC Berkeley의 LMSYS 팀이 개발한 멀티턴 대화 평가 벤치마크로, GPT-4를 심판으로 사용하여 1-10점 척도로 채점합니다.

8개 카테고리, 각 10개 질문:

  • Writing (글쓰기)
  • Roleplay (역할극)
  • Reasoning (추론)
  • Math (수학)
  • Coding (코딩)
  • Extraction (정보 추출)
  • STEM
  • Humanities
import json
from openai import OpenAI

def mt_bench_judge(question, answer, reference_answer=None):
    """GPT-4로 MT-Bench 답변 평가"""
    client = OpenAI()

    system_prompt = """You are a helpful assistant that evaluates AI responses.
Rate the response on a scale of 1-10 based on: accuracy, relevance, completeness, and clarity.
Output format: Score: X/10\nRationale: [brief explanation]"""

    user_prompt = f"""Question: {question}

AI Response: {answer}

{f'Reference Answer: {reference_answer}' if reference_answer else ''}

Please evaluate this response."""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.0
    )

    judge_response = response.choices[0].message.content
    print(f"평가 결과:\n{judge_response}")
    return judge_response

# FastChat의 공식 MT-Bench 사용법
# git clone https://github.com/lm-sys/FastChat
# python -m fastchat.llm_judge.gen_model_answer --model-path your-model
# python -m fastchat.llm_judge.gen_judgment --judge-model gpt-4
# python -m fastchat.llm_judge.show_result

LMSYS Chatbot Arena

Chatbot Arena는 실제 사용자가 두 모델의 응답을 비교하여 더 나은 쪽에 투표하는 방식입니다. ELO 레이팅 시스템을 사용하므로 인간의 실제 선호도를 반영합니다.

2025년 3월 ELO 상위 모델 (참고용):

순위모델ELO
1GPT-4.5~1370
2Gemini 2.0 Ultra~1360
3Claude 3.7 Sonnet~1350
4GPT-4o~1340
5Llama 3.3 70B~1250

HELM (Holistic Evaluation of Language Models)

Stanford CRFM이 개발한 HELM은 단순 정확도를 넘어 다음 7개 측면을 종합 평가합니다:

  1. Accuracy (정확도)
  2. Calibration (확신도 보정)
  3. Robustness (견고성)
  4. Fairness (공정성)
  5. Bias (편향성)
  6. Toxicity (독성)
  7. Efficiency (효율성)
# HELM 평가 실행
pip install crfm-helm

# 기본 평가 (mmlu + summarization + qa)
helm-run \
    --conf src/helm/benchmark/presentation/run_specs_lite.conf \
    --local \
    --max-eval-instances 1000 \
    --num-train-trials 1

# 결과 확인
helm-summarize --suite v1
helm-server

Open LLM Leaderboard (HuggingFace)

HuggingFace의 Open LLM Leaderboard는 오픈소스 LLM을 일관된 기준으로 평가하는 공개 리더보드입니다.

평가 태스크:

  • MMLU (5-shot)
  • ARC Challenge (25-shot)
  • HellaSwag (10-shot)
  • TruthfulQA (0-shot)
  • Winogrande (5-shot)
  • GSM8K (5-shot)
# huggingface_hub로 리더보드 데이터 접근
from huggingface_hub import HfApi
import pandas as pd

def fetch_leaderboard_data():
    """Open LLM Leaderboard 데이터 가져오기"""
    api = HfApi()

    # 리더보드 데이터셋
    dataset_info = api.dataset_info("open-llm-leaderboard/results")
    print(f"마지막 업데이트: {dataset_info.lastModified}")

    # 결과 파일 목록
    files = api.list_repo_files(
        repo_id="open-llm-leaderboard/results",
        repo_type="dataset"
    )

    model_results = []
    for f in list(files)[:5]:  # 처음 5개만
        print(f"파일: {f}")

    return model_results

6. 한국어 벤치마크

KLUE (Korean Language Understanding Evaluation)

KLUE는 2021년 한국전자통신연구원(ETRI) 등이 공동 개발한 한국어 자연어 이해 벤치마크로, 8개 태스크로 구성됩니다.

KLUE 태스크:

태스크유형데이터 규모지표
TC (Topic Classification)문서 분류60K정확도
STS (Semantic Textual Similarity)문장 유사도13KPearson
NLI (Natural Language Inference)자연어 추론30K정확도
NER (Named Entity Recognition)개체명 인식21KEntity F1
RE (Relation Extraction)관계 추출32Kmicro-F1
DP (Dependency Parsing)의존 구문 분석23KUAS/LAS
MRC (Machine Reading Comprehension)기계 독해24KEM/F1
DST (Dialogue State Tracking)대화 상태 추적10KJGA
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

def evaluate_klue_nli(model_name="klue/roberta-large"):
    """KLUE-NLI 평가"""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name, num_labels=3
    )

    dataset = load_dataset("klue", "nli")
    val_data = dataset['validation']

    label_map = {0: "entailment", 1: "neutral", 2: "contradiction"}
    correct = 0
    total = min(500, len(val_data))

    model.eval()
    for item in val_data.select(range(total)):
        premise = item['premise']
        hypothesis = item['hypothesis']
        gold_label = item['label']

        inputs = tokenizer(
            premise, hypothesis,
            return_tensors='pt',
            truncation=True,
            max_length=512
        )

        with torch.no_grad():
            outputs = model(**inputs)
            pred = outputs.logits.argmax(dim=-1).item()

        if pred == gold_label:
            correct += 1

    accuracy = correct / total
    print(f"KLUE-NLI 정확도: {accuracy:.4f}")
    return accuracy

def evaluate_klue_mrc(model_name="klue/roberta-large"):
    """KLUE-MRC (기계 독해) 평가"""
    from transformers import AutoModelForQuestionAnswering, pipeline

    qa_pipeline = pipeline(
        "question-answering",
        model=model_name,
        tokenizer=model_name
    )

    dataset = load_dataset("klue", "mrc")
    val_data = dataset['validation']

    em_scores = []
    f1_scores = []

    for item in val_data.select(range(100)):
        context = item['context']
        question = item['question']
        answers = item['answers']['text']

        result = qa_pipeline(question=question, context=context)
        predicted = result['answer'].strip()

        # EM
        em = max(int(predicted == a) for a in answers)
        em_scores.append(em)

        # F1
        best_f1 = 0
        for gold in answers:
            pred_chars = set(predicted)
            gold_chars = set(gold)
            common = pred_chars & gold_chars
            if common:
                precision = len(common) / len(pred_chars)
                recall = len(common) / len(gold_chars)
                f1 = 2 * precision * recall / (precision + recall)
                best_f1 = max(best_f1, f1)
        f1_scores.append(best_f1)

    print(f"KLUE-MRC EM: {sum(em_scores)/len(em_scores)*100:.1f}%")
    print(f"KLUE-MRC F1: {sum(f1_scores)/len(f1_scores)*100:.1f}%")

KoBEST

KoBEST(Korean Balanced Evaluation of Significant Tasks)는 KAIST가 개발한 한국어 벤치마크로 5개 태스크를 포함합니다:

  • BoolQ: 예/아니오 질의응답
  • COPA: 원인/결과 추론
  • WiC: 단어 의미 중의성 해소
  • HellaSwag: 상식 완성
  • SentiNeg: 부정 감성 이해

KMMLU (한국어 MMLU)

KMMLU는 MMLU를 한국어로 확장한 벤치마크로, 한국어 특화 과목(한국사, 한국 법률, 한국 의학)과 함께 영어 MMLU의 한국어 번역본을 포함합니다.

from datasets import load_dataset

def evaluate_kmmlu_sample():
    """KMMLU 샘플 탐색"""
    # KMMLU 로드 (2024년 공개)
    dataset = load_dataset("HAERAE-HUB/KMMLU")
    test_data = dataset['test']

    print(f"총 문제 수: {len(test_data)}")

    subjects = set(test_data['subject'])
    print(f"과목 수: {len(subjects)}")
    print(f"샘플 과목: {list(subjects)[:10]}")

    # 첫 번째 문제 출력
    item = test_data[0]
    print(f"\n과목: {item['subject']}")
    print(f"문제: {item['question']}")
    print(f"A: {item['A']}")
    print(f"B: {item['B']}")
    print(f"C: {item['C']}")
    print(f"D: {item['D']}")
    print(f"정답: {item['answer']}")

7. 멀티모달 벤치마크

VQA (Visual Question Answering)

VQA는 이미지를 보고 자연어 질문에 답하는 태스크입니다.

  • VQA v2: 약 1.1M개의 (이미지, 질문, 답) 쌍. 이미지당 두 개의 보완적 질문
  • 평가 지표: Accuracy = min(answers/3, 1) - 10명의 어노테이터 중 몇 명이 동의하는지
from datasets import load_dataset
from transformers import BlipProcessor, BlipForQuestionAnswering
import torch
from PIL import Image

def evaluate_vqa_blip():
    """BLIP으로 VQA 평가"""
    processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
    model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")
    model.eval()

    # VQA v2 검증 데이터 (로컬 이미지 경로 필요)
    questions = [
        ("What color is the car?", "test_car.jpg"),
        ("How many people are in the image?", "test_crowd.jpg"),
        ("Is it raining?", "test_outdoor.jpg")
    ]

    for question, image_path in questions:
        try:
            image = Image.open(image_path).convert('RGB')
            inputs = processor(image, question, return_tensors="pt")

            with torch.no_grad():
                out = model.generate(**inputs, max_length=20)

            answer = processor.decode(out[0], skip_special_tokens=True)
            print(f"Q: {question}")
            print(f"A: {answer}\n")
        except FileNotFoundError:
            print(f"이미지 없음: {image_path}")

MMBench

MMBench는 상하이 AI 연구소가 발표한 멀티모달 LLM 평가 벤치마크로, 20개 능력 차원에 걸쳐 3,000개의 객관식 문제를 포함합니다.

평가 차원 (예시):

  • Attribute Recognition (속성 인식)
  • Spatial Relationship (공간 관계)
  • Action Recognition (행동 인식)
  • OCR (광학 문자 인식)
  • Commonsense Reasoning (상식 추론)

MMMU (Massive Multidiscipline Multimodal Understanding)

MMMU는 대학 수준의 멀티모달 이해를 평가하는 벤치마크로, 6개 핵심 분야(Art, Science, Engineering, Medicine, Technology, Humanities)의 30개 과목, 11,550개 문제를 포함합니다.

from datasets import load_dataset

def explore_mmmu():
    """MMMU 데이터셋 탐색"""
    dataset = load_dataset("MMMU/MMMU", "Accounting")
    print(f"Accounting 태스크 검증 데이터: {len(dataset['validation'])} 문제")

    item = dataset['validation'][0]
    print(f"\n질문: {item['question']}")
    print(f"선택지 A: {item['option_A']}")
    print(f"선택지 B: {item['option_B']}")
    print(f"정답: {item['answer']}")

    # 이미지 포함 여부 확인
    if item['image_1']:
        print("이미지 포함 문제")

# 멀티모달 모델 평가 (예: LLaVA)
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration

def evaluate_mmmu_with_llava(model_name="llava-hf/llava-v1.6-mistral-7b-hf"):
    processor = LlavaNextProcessor.from_pretrained(model_name)
    model = LlavaNextForConditionalGeneration.from_pretrained(
        model_name, torch_dtype="auto", device_map="auto"
    )

    dataset = load_dataset("MMMU/MMMU", "Accounting", split="validation")
    correct = 0
    total = min(50, len(dataset))

    for item in dataset.select(range(total)):
        question = item['question']
        options = [item.get(f'option_{c}', '') for c in 'ABCDE' if item.get(f'option_{c}')]
        gold = item['answer']

        if item['image_1']:
            image = item['image_1']
            prompt = f"[INST] [IMG]\nQuestion: {question}\nOptions: {options}\nAnswer with only the option letter. [/INST]"

            inputs = processor(prompt, image, return_tensors='pt').to(model.device)
        else:
            prompt = f"[INST] Question: {question}\nOptions: {options}\nAnswer with only the option letter. [/INST]"
            inputs = processor(prompt, return_tensors='pt').to(model.device)

        with torch.no_grad():
            output = model.generate(**inputs, max_new_tokens=10)
            response = processor.decode(output[0], skip_special_tokens=True)
            pred = response[-1].upper() if response else 'A'

        if pred == gold:
            correct += 1

    acc = correct / total
    print(f"MMMU-Accounting 정확도: {acc:.3f}")
    return acc

8. LM-Evaluation-Harness 사용법

EleutherAI의 lm-evaluation-harness는 LLM 평가의 표준 도구로, 100개 이상의 벤치마크를 지원합니다.

설치 및 기본 사용법

# 설치
git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

# MMLU 평가 (GPT-2)
lm_eval --model hf \
    --model_args pretrained=gpt2 \
    --tasks mmlu \
    --num_fewshot 5 \
    --batch_size 8 \
    --output_path results/gpt2_mmlu

# 여러 태스크 동시 평가
lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-3.2-3B-Instruct \
    --tasks mmlu,arc_challenge,hellaswag,truthfulqa_mc1,gsm8k \
    --num_fewshot 5 \
    --batch_size 4 \
    --output_path results/llama3.2_3b

# HuggingFace 모델 (4비트 양자화로 실행)
lm_eval --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B,load_in_4bit=True \
    --tasks mmlu \
    --num_fewshot 5 \
    --batch_size 1

Python API 사용

import lm_eval
from lm_eval import evaluator, utils
from lm_eval.models.huggingface import HFLM

def run_comprehensive_evaluation(model_path, output_dir="./results"):
    """LM-Evaluation-Harness 종합 평가"""
    import os
    os.makedirs(output_dir, exist_ok=True)

    # 평가할 태스크 정의
    task_groups = {
        "knowledge": ["mmlu", "arc_challenge", "arc_easy"],
        "reasoning": ["hellaswag", "winogrande", "piqa"],
        "truthfulness": ["truthfulqa_mc1"],
        "math": ["gsm8k"],
        "coding": ["humaneval"],
    }

    all_results = {}

    for group, tasks in task_groups.items():
        print(f"\n=== {group.upper()} 평가 중 ===")

        results = evaluator.simple_evaluate(
            model="hf",
            model_args=f"pretrained={model_path}",
            tasks=tasks,
            num_fewshot={"mmlu": 5, "arc_challenge": 25, "hellaswag": 10,
                        "truthfulqa_mc1": 0, "gsm8k": 5, "winogrande": 5,
                        "piqa": 0, "humaneval": 0, "arc_easy": 25}.get(tasks[0], 0),
            batch_size="auto",
            device="cuda" if __import__("torch").cuda.is_available() else "cpu",
        )

        all_results[group] = results['results']

        # 결과 출력
        for task, metrics in results['results'].items():
            if 'acc,none' in metrics:
                print(f"  {task}: {metrics['acc,none']*100:.1f}%")
            elif 'exact_match,strict-match' in metrics:
                print(f"  {task}: {metrics['exact_match,strict-match']*100:.1f}%")

    # 종합 결과 저장
    import json
    with open(f"{output_dir}/evaluation_results.json", "w", encoding="utf-8") as f:
        json.dump(all_results, f, ensure_ascii=False, indent=2)

    print(f"\n결과 저장 완료: {output_dir}/evaluation_results.json")
    return all_results


def compare_models(model_paths, tasks=None):
    """여러 모델 비교 평가"""
    if tasks is None:
        tasks = ["mmlu", "arc_challenge", "hellaswag", "gsm8k"]

    comparison = {}

    for model_path in model_paths:
        print(f"\n평가 중: {model_path}")
        results = evaluator.simple_evaluate(
            model="hf",
            model_args=f"pretrained={model_path}",
            tasks=tasks,
            num_fewshot=5,
            batch_size="auto"
        )

        model_scores = {}
        for task, metrics in results['results'].items():
            for metric, value in metrics.items():
                if isinstance(value, (int, float)) and not metric.endswith('_stderr'):
                    model_scores[f"{task}/{metric}"] = round(value * 100, 2)

        comparison[model_path.split('/')[-1]] = model_scores

    # 비교 테이블 출력
    print("\n" + "="*80)
    print("모델 비교 결과:")
    print("="*80)

    all_metrics = sorted(set().union(*[s.keys() for s in comparison.values()]))
    header = f"{'메트릭':<40}" + "".join(f"{m[:15]:<18}" for m in comparison.keys())
    print(header)
    print("-" * 80)

    for metric in all_metrics:
        if 'acc,none' in metric or 'exact_match' in metric:
            row = f"{metric:<40}"
            for model_name in comparison:
                score = comparison[model_name].get(metric, "N/A")
                row += f"{score:<18}"
            print(row)

    return comparison

커스텀 태스크 추가

# custom_task.py
from lm_eval.api.task import Task, TaskConfig
from lm_eval.api.instance import Instance

class KoreanQATask(Task):
    """한국어 QA 커스텀 태스크"""
    VERSION = 1.0
    DATASET_PATH = "your-org/korean-qa-dataset"
    DATASET_NAME = None

    def has_training_docs(self):
        return False

    def has_validation_docs(self):
        return True

    def has_test_docs(self):
        return True

    def validation_docs(self):
        return self.dataset["validation"]

    def test_docs(self):
        return self.dataset["test"]

    def doc_to_text(self, doc):
        return f"질문: {doc['question']}\n답변:"

    def doc_to_target(self, doc):
        return " " + doc['answer']

    def construct_requests(self, doc, ctx):
        return [Instance(
            request_type="generate_until",
            doc=doc,
            arguments=(ctx, {"until": ["\n", "질문:"]}),
            idx=0
        )]

    def process_results(self, doc, results):
        gold = doc['answer'].lower().strip()
        pred = results[0].lower().strip()
        return {"exact_match": int(gold == pred)}

    def aggregation(self):
        return {"exact_match": "mean"}

    def higher_is_better(self):
        return {"exact_match": True}

마무리

AI 벤치마크 데이터셋은 AI 연구와 개발의 나침반 역할을 합니다. 주요 내용을 정리하면:

컴퓨터 비전:

  • ImageNet: 1,000개 클래스 분류의 황금 기준
  • COCO: 객체 탐지와 세그멘테이션의 표준
  • ADE20K: 시맨틱 세그멘테이션 주요 벤치마크

NLP:

  • GLUE/SuperGLUE: 언어 이해 능력 종합 평가
  • SQuAD: 기계 독해의 표준 벤치마크

LLM 능력:

  • MMLU: 57개 분야 지식 평가 (가장 광범위)
  • HumanEval: 코드 생성 능력 평가
  • GSM8K: 수학적 추론 능력

종합 평가:

  • HELM: 7개 차원 균형 평가
  • Chatbot Arena: 실제 인간 선호도 ELO 기반
  • Open LLM Leaderboard: 오픈소스 LLM 비교

한국어:

  • KLUE: 8개 태스크 한국어 이해 평가
  • KMMLU: 한국어 지식 능력 평가

벤치마크를 해석할 때는 항상 데이터셋 오염 가능성, 측정 편향, 실제 사용 환경과의 괴리를 염두에 두어야 합니다. 단일 벤치마크가 아닌 다양한 측면에서의 종합 평가가 모델의 실제 능력을 더 잘 반영합니다.


참고 자료

AI Benchmark Datasets Complete Guide: ImageNet, COCO, GLUE, MMLU, HumanEval

Table of Contents

  1. Why AI Benchmarks Matter
  2. Computer Vision Benchmarks
  3. NLP Benchmarks
  4. LLM Capability Benchmarks
  5. Comprehensive LLM Evaluation
  6. Korean Language Benchmarks
  7. Multimodal Benchmarks
  8. Using LM-Evaluation-Harness

1. Why AI Benchmarks Matter

The Need for Standardized Evaluation

How should AI models be compared? When two image classification models exist, a common standard is needed to determine which is better. Benchmark datasets provide exactly that common ground.

Without standardized benchmarks, each team could evaluate only on data favorable to them, making objective comparison impossible. Standard benchmarks like ImageNet, GLUE, and MMLU have enabled the AI research community to compete on the same test, measuring progress and setting direction.

Leaderboards and Competition

Benchmarks make AI progress visible through leaderboards.

  • ImageNet LSVRC: AlexNet reduced Top-5 error from 26% to 15.3% in 2012, launching the deep learning revolution.
  • GLUE/SuperGLUE: Documented the journey of BERT, RoBERTa, T5, and others surpassing human-level performance.
  • HumanEval: Became the arena where GPT-4, Claude, Gemini, and others compete on code generation.
  • LMSYS Chatbot Arena: Real human users blindly compare two models and vote, producing ELO ratings.

Limitations and Biases of Benchmarks

Benchmarks are powerful tools with clear limitations.

1. Dataset Contamination

LLMs are trained on vast internet text. If benchmark test data is present in training data, the model may be memorizing answers rather than genuinely solving problems. Even the GPT-4 technical report acknowledged this issue.

2. Goodhart's Law

"When a measure becomes a target, it ceases to be a good measure." Researchers who focus only on improving specific benchmark scores can raise scores without genuine capability improvements.

3. Bias and Representativeness

Many benchmarks are heavily weighted toward English and Western cultural data. Performance in Korean, Arabic, Swahili, and other languages can differ substantially from English benchmark scores.

4. Static Standards

Benchmarks do not change once created, but AI models continually improve. A difficult benchmark in 2023 can reach near-saturation by 2025.

5. Gap from Real-World Performance

High benchmark scores do not guarantee good performance in actual deployment. User experience, creativity, safety, and other hard-to-quantify factors matter just as much.


2. Computer Vision Benchmarks

ImageNet (ILSVRC)

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is the most influential benchmark in computer vision history. Originating from the ImageNet project (2009) led by Professor Fei-Fei Li at Stanford, it ran as an annual competition from 2010 to 2017.

Dataset Characteristics:

  • 1,000 classes (everyday objects: dogs, cats, cars, etc.)
  • Training data: approximately 1.2 million images
  • Validation data: 50,000 images
  • Test data: 100,000 images
  • Average of about 1,200 images per class

Key Metrics:

  • Top-1 Accuracy: Fraction of predictions where the top-1 predicted class is the correct label
  • Top-5 Accuracy: Fraction where the correct label appears in the top 5 predictions

Historical Progress:

YearModelTop-5 Error
2010NEC-UIUC28.2%
2012AlexNet15.3%
2014VGG-167.3%
2015ResNet-1523.57%
2017SENet2.25%
2021CoAtNet0.95%
2023ViT-22B~0.6%

Human Top-5 error is estimated at about 5.1%. After ResNet surpassed human performance in 2015, research expanded to harder variants: ImageNet-A, ImageNet-R, and ImageNet-C.

# Measuring ImageNet validation accuracy with PyTorch
import torch
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models
from torch.utils.data import DataLoader

def evaluate_imagenet(model, val_dir, batch_size=256):
    # Standard preprocessing (ImageNet validation standard)
    val_transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(
            mean=[0.485, 0.456, 0.406],
            std=[0.229, 0.224, 0.225]
        )
    ])

    val_dataset = datasets.ImageFolder(val_dir, transform=val_transform)
    val_loader = DataLoader(
        val_dataset,
        batch_size=batch_size,
        shuffle=False,
        num_workers=8,
        pin_memory=True
    )

    model.eval()
    top1_correct = 0
    top5_correct = 0
    total = 0

    with torch.no_grad():
        for images, labels in val_loader:
            images = images.cuda()
            labels = labels.cuda()

            outputs = model(images)
            _, predicted = outputs.topk(5, 1, True, True)
            predicted = predicted.t()
            correct = predicted.eq(labels.view(1, -1).expand_as(predicted))

            top1_correct += correct[:1].reshape(-1).float().sum(0)
            top5_correct += correct[:5].reshape(-1).float().sum(0)
            total += labels.size(0)

    top1_acc = top1_correct / total * 100
    top5_acc = top5_correct / total * 100
    print(f"Top-1 Accuracy: {top1_acc:.2f}%")
    print(f"Top-5 Accuracy: {top5_acc:.2f}%")
    return top1_acc, top5_acc

# Example: Evaluate ResNet-50
model = models.resnet50(pretrained=True).cuda()
evaluate_imagenet(model, '/path/to/imagenet/val')

COCO (Common Objects in Context)

COCO is a large-scale object detection, segmentation, and image captioning benchmark released by Microsoft in 2014.

Dataset Characteristics:

  • 80 categories of everyday objects
  • 330,000+ images
  • 1.5+ million object instances
  • 5 captions per image (for captioning tasks)
  • Detailed instance segmentation masks

Key Metrics:

mAP (mean Average Precision) is COCO's primary metric. Various metrics exist depending on IoU (Intersection over Union) thresholds.

from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
import json

def evaluate_coco_detection(annotation_file, result_file):
    # Load COCO ground truth
    coco_gt = COCO(annotation_file)

    # Load predictions
    coco_dt = coco_gt.loadRes(result_file)

    # Bounding box evaluation
    coco_eval = COCOeval(coco_gt, coco_dt, 'bbox')
    coco_eval.evaluate()
    coco_eval.accumulate()
    coco_eval.summarize()

    stats = coco_eval.stats
    print(f"\n=== COCO Detection Results ===")
    print(f"AP @ IoU=0.50:0.95 (COCO primary): {stats[0]:.3f}")
    print(f"AP @ IoU=0.50 (PASCAL VOC style): {stats[1]:.3f}")
    print(f"AP @ IoU=0.75 (strict): {stats[2]:.3f}")
    print(f"AP small (area < 32^2): {stats[3]:.3f}")
    print(f"AP medium: {stats[4]:.3f}")
    print(f"AP large: {stats[5]:.3f}")
    print(f"AR (max=1 per image): {stats[6]:.3f}")
    print(f"AR (max=10 per image): {stats[7]:.3f}")
    print(f"AR (max=100 per image): {stats[8]:.3f}")
    return stats

# Explore COCO annotations
coco = COCO('instances_val2017.json')
cat_ids = coco.getCatIds(catNms=['person', 'car', 'dog'])
img_ids = coco.getImgIds(catIds=cat_ids[:1])

img = coco.loadImgs(img_ids[0])[0]
ann_ids = coco.getAnnIds(imgIds=img['id'])
anns = coco.loadAnns(ann_ids)
print(f"Image: {img['file_name']}, Annotations: {len(anns)}")
for ann in anns[:3]:
    cat = coco.loadCats(ann['category_id'])[0]
    print(f"  Category: {cat['name']}, Area: {ann['area']:.0f}px^2")

State-of-the-Art COCO Performance (2025):

ModelAP (box)AP (mask)Parameters
YOLOv8x53.9-68M
DINO (Swin-L)63.3-218M
Co-DINO (Swin-L)64.154.0218M
InternImage-H65.456.12.18B

ADE20K - Semantic Segmentation

ADE20K, built by MIT CSAIL, is a semantic segmentation benchmark covering 150 categories across 25,000 images.

Key Metrics:

  • mIoU (mean Intersection over Union): Average IoU between predicted and ground-truth masks
  • aAcc: Pixel-level overall accuracy
  • mAcc: Per-class mean accuracy
import numpy as np

def compute_miou(pred_mask, gt_mask, num_classes=150):
    """Compute mIoU."""
    iou_list = []

    for cls in range(num_classes):
        pred_cls = (pred_mask == cls)
        gt_cls = (gt_mask == cls)

        intersection = np.logical_and(pred_cls, gt_cls).sum()
        union = np.logical_or(pred_cls, gt_cls).sum()

        if union == 0:
            continue  # Skip if class not present in image

        iou = intersection / union
        iou_list.append(iou)

    return np.mean(iou_list) if iou_list else 0.0

# Evaluation with mmsegmentation
from mmseg.apis import inference_segmentor, init_segmentor

config_file = 'configs/segformer/segformer_mit-b5_8xb2-160k_ade20k-512x512.py'
checkpoint_file = 'segformer_mit-b5_8x2_512x512_160k_ade20k_20220617_203542-745f14da.pth'

model = init_segmentor(config_file, checkpoint_file, device='cuda:0')
result = inference_segmentor(model, 'test_image.jpg')

Kinetics - Video Classification

Kinetics, provided by Google DeepMind, is a video action recognition benchmark.

  • Kinetics-400: 400 action classes, ~300,000 clips
  • Kinetics-600: 600 classes, ~500,000 clips
  • Kinetics-700: 700 classes

Primary metrics: Top-1 and Top-5 accuracy (averaged per clip).

CIFAR-10/100

Small-scale image classification benchmarks widely used for rapid prototyping and paper validation.

import torch
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader

def evaluate_cifar10(model, batch_size=128):
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
    ])

    testset = torchvision.datasets.CIFAR10(
        root='./data', train=False, download=True, transform=transform
    )
    testloader = DataLoader(testset, batch_size=batch_size, shuffle=False, num_workers=4)

    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for images, labels in testloader:
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    accuracy = 100 * correct / total
    print(f"CIFAR-10 Accuracy: {accuracy:.2f}%")
    return accuracy

3. NLP Benchmarks

GLUE (General Language Understanding Evaluation)

GLUE, jointly published by NYU and DeepMind in 2018, is an NLP model evaluation benchmark consisting of 9 different language understanding tasks.

GLUE Task Composition:

TaskDescriptionDatasetMetric
CoLAGrammatical acceptability8,551Matthews Corr.
SST-2Sentiment classification67KAccuracy
MRPCSemantic equivalence3,700F1/Accuracy
STS-BSentence similarity score7KPearson/Spearman
QQPQuestion pair similarity400KF1/Accuracy
MNLINatural language inference (3-way)393KAccuracy
QNLIQuestion-answer inference105KAccuracy
RTETextual entailment2,500Accuracy
WNLIWinograd NLI634Accuracy
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import numpy as np
from sklearn.metrics import matthews_corrcoef

def evaluate_glue_cola(model_name="bert-base-uncased"):
    """Evaluate CoLA (grammatical acceptability)."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name, num_labels=2
    )

    dataset = load_dataset("glue", "cola")
    val_data = dataset["validation"]

    predictions = []
    labels = []

    model.eval()
    import torch

    for item in val_data:
        inputs = tokenizer(
            item['sentence'],
            return_tensors='pt',
            padding=True,
            truncation=True,
            max_length=128
        )

        with torch.no_grad():
            outputs = model(**inputs)
            pred = outputs.logits.argmax(dim=-1).item()

        predictions.append(pred)
        labels.append(item['label'])

    mcc = matthews_corrcoef(labels, predictions)
    print(f"CoLA Matthews Correlation: {mcc:.4f}")
    return mcc

def evaluate_glue_sst2(model_name="textattack/bert-base-uncased-SST-2"):
    """Evaluate SST-2 (sentiment classification)."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)

    dataset = load_dataset("glue", "sst2")
    val_data = dataset["validation"]

    correct = 0
    total = len(val_data)

    model.eval()
    import torch

    for item in val_data:
        inputs = tokenizer(
            item['sentence'],
            return_tensors='pt',
            truncation=True,
            max_length=128
        )
        with torch.no_grad():
            outputs = model(**inputs)
            pred = outputs.logits.argmax(dim=-1).item()
        if pred == item['label']:
            correct += 1

    acc = correct / total
    print(f"SST-2 Accuracy: {acc:.4f}")
    return acc

SuperGLUE

When GLUE approached saturation (near-human performance), SuperGLUE was introduced in 2019 with harder tasks.

SuperGLUE Tasks:

  • BoolQ: Yes/no question answering (9,427)
  • CB: Commitment/entailment (250, 3-way)
  • COPA: Cause/effect reasoning (1,000)
  • MultiRC: Multi-sentence reading comprehension (9,693)
  • ReCoRD: Cloze-style reading comprehension (120K)
  • RTE: Textual entailment recognition (5,749)
  • WiC: Word-in-context disambiguation (9,600)
  • WSC: Winograd Schema Challenge (554)

Human baseline: 89.8 / GPT-4-class models: 90+ (surpassing humans)

SQuAD 1.1 & 2.0

SQuAD (Stanford Question Answering Dataset) is a machine reading comprehension benchmark where answers are extracted from Wikipedia passages.

  • SQuAD 1.1: 536 Wikipedia articles, 107,785 question-answer pairs. All answers exist within the passage.
  • SQuAD 2.0: SQuAD 1.1 + 53,775 unanswerable questions added.

Evaluation Metrics:

  • EM (Exact Match): Fraction of predictions exactly matching the gold answer
  • F1 Score: Token-level partial match score
from datasets import load_dataset
from transformers import pipeline

def evaluate_squad(model_name="deepset/roberta-base-squad2"):
    """Evaluate SQuAD 2.0."""
    qa_pipeline = pipeline("question-answering", model=model_name)
    dataset = load_dataset("squad_v2", split="validation")

    em_scores = []
    f1_scores = []
    no_answer_correct = 0
    no_answer_total = 0

    for item in dataset.select(range(200)):
        context = item['context']
        question = item['question']
        answers = item['answers']

        result = qa_pipeline(question=question, context=context)
        predicted = result['answer'].lower().strip()

        has_answer = len(answers['text']) > 0

        if not has_answer:
            no_answer_total += 1
            if result['score'] < 0.1:
                no_answer_correct += 1
            em_scores.append(0)
            f1_scores.append(0)
        else:
            gold_answers = [a.lower().strip() for a in answers['text']]
            em = max(int(predicted == gold) for gold in gold_answers)
            em_scores.append(em)

            best_f1 = 0
            for gold in gold_answers:
                pred_tokens = set(predicted.split())
                gold_tokens = set(gold.split())
                common = pred_tokens & gold_tokens
                if len(common) == 0:
                    f1 = 0
                else:
                    precision = len(common) / len(pred_tokens)
                    recall = len(common) / len(gold_tokens)
                    f1 = 2 * precision * recall / (precision + recall)
                best_f1 = max(best_f1, f1)
            f1_scores.append(best_f1)

    print(f"SQuAD 2.0 Results (200 samples):")
    print(f"  EM: {sum(em_scores)/len(em_scores)*100:.1f}%")
    print(f"  F1: {sum(f1_scores)/len(f1_scores)*100:.1f}%")
    if no_answer_total > 0:
        print(f"  No-Answer Accuracy: {no_answer_correct/no_answer_total*100:.1f}%")

WMT - Machine Translation

WMT (Workshop on Machine Translation) is an annual competition evaluating machine translation models across multiple language pairs (English-German, English-Chinese, English-Korean, etc.).

Key Metrics:

  • BLEU (Bilingual Evaluation Understudy): Automatic evaluation based on n-gram precision
  • COMET: Neural metric with high correlation to human judgment
  • chrF: Character-level n-gram F-score
import sacrebleu

def compute_bleu(predictions, references):
    """Compute BLEU score."""
    bleu = sacrebleu.corpus_bleu(predictions, [references])
    print(f"BLEU: {bleu.score:.2f}")
    print(f"BP: {bleu.bp:.3f}")
    print(f"Ratio: {bleu.sys_len/bleu.ref_len:.3f}")
    return bleu.score

from transformers import MarianMTModel, MarianTokenizer

def evaluate_translation(src_texts, tgt_texts, model_name="Helsinki-NLP/opus-mt-en-de"):
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name)

    predictions = []
    for text in src_texts[:100]:
        inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True)
        translated = model.generate(**inputs, max_length=512)
        pred = tokenizer.decode(translated[0], skip_special_tokens=True)
        predictions.append(pred)

    bleu_score = compute_bleu(predictions, tgt_texts[:100])
    return bleu_score

4. LLM Capability Benchmarks

MMLU (Massive Multitask Language Understanding)

MMLU, published by Dan Hendrycks at UC Berkeley in 2020, evaluates LLM knowledge and reasoning with graduate-level multiple-choice questions across 57 academic disciplines.

Domain Breakdown:

  • STEM: Mathematics, physics, chemistry, computer science, engineering
  • Humanities: History, philosophy, law, ethics
  • Social Sciences: Psychology, economics, political science, sociology
  • Other: Medicine, nutrition, moral scenarios, professional accounting

Each question is four-choice, with approximately 14,000 questions total.

MMLU Performance by Model:

ModelMMLU ScoreYear
GPT-3 (175B)43.9%2020
Gopher (280B)60.0%2021
GPT-486.4%2023
Claude 3 Opus86.8%2024
Gemini Ultra90.0%2024
GPT-4o88.7%2024
Human expert estimate~90%-
from datasets import load_dataset

def evaluate_mmlu(model_fn, subjects=None, num_few_shot=5):
    """MMLU evaluation function."""
    if subjects is None:
        subjects = ['abstract_algebra', 'anatomy', 'astronomy', 'college_mathematics']

    results = {}

    for subject in subjects:
        dataset = load_dataset("lukaemon/mmlu", subject)
        test_data = dataset['test']
        dev_data = dataset['dev']

        correct = 0
        total = 0

        # Build few-shot prompt
        few_shot_examples = ""
        for i, item in enumerate(dev_data.select(range(num_few_shot))):
            few_shot_examples += f"Q: {item['input']}\n"
            few_shot_examples += f"(A) {item['A']}  (B) {item['B']}  (C) {item['C']}  (D) {item['D']}\n"
            few_shot_examples += f"Answer: {item['target']}\n\n"

        for item in test_data:
            prompt = few_shot_examples
            prompt += f"Q: {item['input']}\n"
            prompt += f"(A) {item['A']}  (B) {item['B']}  (C) {item['C']}  (D) {item['D']}\n"
            prompt += "Answer:"

            response = model_fn(prompt)
            pred = response.strip()[0] if response.strip() else 'A'

            if pred == item['target']:
                correct += 1
            total += 1

        accuracy = correct / total
        results[subject] = accuracy
        print(f"{subject}: {accuracy:.3f} ({correct}/{total})")

    overall = sum(results.values()) / len(results)
    print(f"\nOverall average: {overall:.3f}")
    return results

BIG-Bench (Beyond the Imitation Game Benchmark)

BIG-Bench, led by Google, consists of 204 diverse tasks designed to probe the limits of LLMs. It includes creative reasoning, common sense, mathematics, and code that language models still struggle with.

BIG-Bench Hard: 23 difficult tasks where chain-of-thought prompting dramatically improves performance.

from lm_eval import evaluator

# BIG-Bench evaluation via lm-evaluation-harness
results = evaluator.simple_evaluate(
    model="hf",
    model_args="pretrained=meta-llama/Llama-3.2-3B-Instruct",
    tasks=["bigbench_causal_judgment", "bigbench_date_understanding"],
    num_fewshot=3,
    batch_size="auto"
)
print(results['results'])

HellaSwag - Commonsense Reasoning

HellaSwag, published in 2019, is a commonsense reasoning benchmark where the model selects the most natural sentence continuation from four choices.

from datasets import load_dataset
import torch
from transformers import AutoTokenizer, AutoModelForMultipleChoice

def evaluate_hellaswag(model_name="microsoft/deberta-v2-xxlarge"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForMultipleChoice.from_pretrained(model_name)

    dataset = load_dataset("hellaswag", split="validation")

    correct = 0
    total = min(500, len(dataset))

    for item in dataset.select(range(total)):
        context = item['ctx']
        endings = item['endings']
        label = int(item['label'])

        choices = [context + " " + ending for ending in endings]

        encoding = tokenizer(
            [context] * 4,
            choices,
            return_tensors='pt',
            padding=True,
            truncation=True,
            max_length=256
        )

        with torch.no_grad():
            outputs = model(**{k: v.unsqueeze(0) for k, v in encoding.items()})
            logits = outputs.logits
            predicted = logits.argmax(dim=-1).item()

        if predicted == label:
            correct += 1

    accuracy = correct / total
    print(f"HellaSwag Accuracy: {accuracy:.4f}")
    return accuracy

ARC (AI2 Reasoning Challenge)

ARC, published by AI2 (Allen Institute for AI), is an elementary-to-high-school level science question benchmark.

  • ARC-Easy: Relatively straightforward questions (5,197)
  • ARC-Challenge: Difficult questions that even retrieval-based models get wrong (1,172)

TruthfulQA - Factuality Evaluation

TruthfulQA evaluates how accurately a model responds to questions about widely held misconceptions, myths, and biases.

from datasets import load_dataset
from transformers import pipeline

def evaluate_truthfulqa(model_name="gpt2-xl"):
    """TruthfulQA MC1 (single correct answer) evaluation."""
    dataset = load_dataset("truthful_qa", "multiple_choice")
    val_data = dataset["validation"]

    generator = pipeline("text-generation", model=model_name)
    correct = 0
    total = min(100, len(val_data))

    for item in val_data.select(range(total)):
        question = item['question']
        choices = item['mc1_targets']['choices']
        labels = item['mc1_targets']['labels']
        correct_idx = labels.index(1)

        prompt = f"Q: {question}\nOptions:\n"
        for i, choice in enumerate(choices):
            letter = chr(65 + i)
            prompt += f"{letter}. {choice}\n"
        prompt += "Answer:"

        response = generator(prompt, max_new_tokens=5, do_sample=False)
        generated = response[0]['generated_text'][len(prompt):].strip()
        pred_letter = generated[0] if generated else 'A'
        pred_idx = ord(pred_letter) - 65

        if pred_idx == correct_idx:
            correct += 1

    accuracy = correct / total
    print(f"TruthfulQA MC1 Accuracy: {accuracy:.4f}")
    return accuracy

GSM8K - Grade School Math

GSM8K (Grade School Math 8K), published by OpenAI in 2021, consists of 8,500 grade school math word problems evaluating step-by-step mathematical reasoning.

from datasets import load_dataset
import re

def extract_number(text):
    """Extract the final numeric answer from text."""
    numbers = re.findall(r'-?\d+\.?\d*', text)
    return numbers[-1] if numbers else None

def evaluate_gsm8k_chain_of_thought(model_fn, num_shot=8):
    """Evaluate GSM8K with Chain-of-Thought prompting."""
    dataset = load_dataset("gsm8k", "main")
    test_data = dataset['test']

    few_shot_prompt = """Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans times 3 balls = 6 balls. 5 + 6 = 11 balls. The answer is 11.

Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many do they have?
A: They started with 23. Used 20: 23 - 20 = 3. Then bought 6: 3 + 6 = 9. The answer is 9.

"""

    correct = 0
    total = min(200, len(test_data))

    for item in test_data.select(range(total)):
        question = item['question']
        gold_answer = item['answer'].split('####')[-1].strip()

        prompt = few_shot_prompt + f"Q: {question}\nA:"
        response = model_fn(prompt, max_tokens=256)

        pred = extract_number(response)
        gold = extract_number(gold_answer)

        if pred and gold and abs(float(pred) - float(gold)) < 0.01:
            correct += 1

    accuracy = correct / total
    print(f"GSM8K Accuracy (Chain-of-Thought): {accuracy:.4f}")
    return accuracy

HumanEval - Code Generation

HumanEval, published by OpenAI in 2021, consists of 164 Python function signatures with docstrings where the model must write the complete function.

Metric: pass@k

The probability of passing at least one test in k attempts.

from datasets import load_dataset
import subprocess
import tempfile
import os

def evaluate_humaneval(model_fn, k=1, n=10, temperature=0.8):
    """HumanEval pass@k evaluation."""
    dataset = load_dataset("openai_humaneval")
    test_data = dataset['test']

    task_results = {}

    for item in test_data.select(range(20)):
        task_id = item['task_id']
        prompt = item['prompt']
        tests = item['test']
        entry_point = item['entry_point']

        passes = 0

        for attempt in range(n):
            code = model_fn(prompt, temperature=temperature)

            full_code = prompt + code + "\n" + tests + f"\ncheck({entry_point})"

            with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
                f.write(full_code)
                tmp_path = f.name

            try:
                result = subprocess.run(
                    ['python', tmp_path],
                    timeout=10,
                    capture_output=True,
                    text=True
                )
                if result.returncode == 0:
                    passes += 1
            except subprocess.TimeoutExpired:
                pass
            finally:
                os.unlink(tmp_path)

        task_results[task_id] = passes / n

    pass_at_1 = sum(task_results.values()) / len(task_results)
    print(f"pass@1: {pass_at_1:.4f}")
    return pass_at_1

# HumanEval performance of major models (2025)
humaneval_scores = {
    "GPT-3 (175B)": 0.0,
    "Codex (12B)": 0.288,
    "GPT-4": 0.870,
    "Claude 3.5 Sonnet": 0.900,
    "DeepSeek-Coder-33B": 0.823,
    "Llama 3.1 70B": 0.803,
}

MBPP - Python Programming

MBPP (Mostly Basic Python Problems), published by Google, consists of 974 Python programming problems covering a wider range of difficulty than HumanEval.

from datasets import load_dataset

def explore_mbpp():
    """Explore the MBPP dataset."""
    dataset = load_dataset("mbpp")
    test_data = dataset['test']

    print("Sample MBPP problems:")
    for item in test_data.select(range(3)):
        print(f"\nTask ID: {item['task_id']}")
        print(f"Problem: {item['text']}")
        print(f"Test cases: {item['test_list'][:2]}")
        print(f"Reference code:\n{item['code']}")
        print("-" * 50)

5. Comprehensive LLM Evaluation

MT-Bench - Multi-Turn Dialogue Evaluation

MT-Bench, developed by the LMSYS team at UC Berkeley, is a multi-turn dialogue evaluation benchmark that uses GPT-4 as a judge, scoring responses on a 1-10 scale.

8 categories, 10 questions each:

  • Writing
  • Roleplay
  • Reasoning
  • Math
  • Coding
  • Extraction
  • STEM
  • Humanities
from openai import OpenAI

def mt_bench_judge(question, answer, reference_answer=None):
    """Evaluate MT-Bench response using GPT-4."""
    client = OpenAI()

    system_prompt = """You are a helpful assistant that evaluates AI responses.
Rate the response on a scale of 1-10 based on: accuracy, relevance, completeness, and clarity.
Output format: Score: X/10\nRationale: [brief explanation]"""

    user_prompt = f"""Question: {question}

AI Response: {answer}

{f'Reference Answer: {reference_answer}' if reference_answer else ''}

Please evaluate this response."""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.0
    )

    judge_response = response.choices[0].message.content
    print(f"Evaluation:\n{judge_response}")
    return judge_response

# Official MT-Bench usage with FastChat:
# git clone https://github.com/lm-sys/FastChat
# python -m fastchat.llm_judge.gen_model_answer --model-path your-model
# python -m fastchat.llm_judge.gen_judgment --judge-model gpt-4
# python -m fastchat.llm_judge.show_result

LMSYS Chatbot Arena

Chatbot Arena has real users compare responses from two anonymous models and vote for the better one. Using the ELO rating system, it reflects genuine human preferences.

Top Models by ELO (March 2025, approximate):

RankModelELO
1GPT-4.5~1370
2Gemini 2.0 Ultra~1360
3Claude 3.7 Sonnet~1350
4GPT-4o~1340
5Llama 3.3 70B~1250

HELM (Holistic Evaluation of Language Models)

HELM, developed by Stanford CRFM, evaluates models across 7 dimensions beyond simple accuracy:

  1. Accuracy
  2. Calibration
  3. Robustness
  4. Fairness
  5. Bias
  6. Toxicity
  7. Efficiency
# Run HELM evaluation
pip install crfm-helm

helm-run \
    --conf src/helm/benchmark/presentation/run_specs_lite.conf \
    --local \
    --max-eval-instances 1000 \
    --num-train-trials 1

# View results
helm-summarize --suite v1
helm-server

Open LLM Leaderboard (HuggingFace)

The HuggingFace Open LLM Leaderboard is a public leaderboard evaluating open-source LLMs on consistent benchmarks.

Evaluation Tasks:

  • MMLU (5-shot)
  • ARC Challenge (25-shot)
  • HellaSwag (10-shot)
  • TruthfulQA (0-shot)
  • Winogrande (5-shot)
  • GSM8K (5-shot)
from huggingface_hub import HfApi

def fetch_leaderboard_data():
    """Fetch Open LLM Leaderboard data."""
    api = HfApi()

    dataset_info = api.dataset_info("open-llm-leaderboard/results")
    print(f"Last updated: {dataset_info.lastModified}")

    files = api.list_repo_files(
        repo_id="open-llm-leaderboard/results",
        repo_type="dataset"
    )

    for f in list(files)[:5]:
        print(f"File: {f}")

6. Korean Language Benchmarks

KLUE (Korean Language Understanding Evaluation)

KLUE, jointly developed in 2021 by ETRI and other Korean institutions, is a Korean language understanding benchmark consisting of 8 tasks.

KLUE Tasks:

TaskTypeData SizeMetric
TC (Topic Classification)Document classification60KAccuracy
STS (Semantic Textual Similarity)Sentence similarity13KPearson
NLI (Natural Language Inference)3-way classification30KAccuracy
NER (Named Entity Recognition)Entity extraction21KEntity F1
RE (Relation Extraction)Relation classification32Kmicro-F1
DP (Dependency Parsing)Syntactic analysis23KUAS/LAS
MRC (Machine Reading Comprehension)Reading comprehension24KEM/F1
DST (Dialogue State Tracking)Dialogue tracking10KJGA
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

def evaluate_klue_nli(model_name="klue/roberta-large"):
    """Evaluate KLUE-NLI."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name, num_labels=3
    )

    dataset = load_dataset("klue", "nli")
    val_data = dataset['validation']

    correct = 0
    total = min(500, len(val_data))

    model.eval()
    for item in val_data.select(range(total)):
        premise = item['premise']
        hypothesis = item['hypothesis']
        gold_label = item['label']

        inputs = tokenizer(
            premise, hypothesis,
            return_tensors='pt',
            truncation=True,
            max_length=512
        )

        with torch.no_grad():
            outputs = model(**inputs)
            pred = outputs.logits.argmax(dim=-1).item()

        if pred == gold_label:
            correct += 1

    accuracy = correct / total
    print(f"KLUE-NLI Accuracy: {accuracy:.4f}")
    return accuracy

def evaluate_klue_mrc(model_name="klue/roberta-large"):
    """Evaluate KLUE-MRC (Machine Reading Comprehension)."""
    from transformers import pipeline

    qa_pipeline = pipeline(
        "question-answering",
        model=model_name,
        tokenizer=model_name
    )

    dataset = load_dataset("klue", "mrc")
    val_data = dataset['validation']

    em_scores = []
    f1_scores = []

    for item in val_data.select(range(100)):
        context = item['context']
        question = item['question']
        answers = item['answers']['text']

        result = qa_pipeline(question=question, context=context)
        predicted = result['answer'].strip()

        em = max(int(predicted == a) for a in answers)
        em_scores.append(em)

        best_f1 = 0
        for gold in answers:
            pred_chars = set(predicted)
            gold_chars = set(gold)
            common = pred_chars & gold_chars
            if common:
                precision = len(common) / len(pred_chars)
                recall = len(common) / len(gold_chars)
                f1 = 2 * precision * recall / (precision + recall)
                best_f1 = max(best_f1, f1)
        f1_scores.append(best_f1)

    print(f"KLUE-MRC EM: {sum(em_scores)/len(em_scores)*100:.1f}%")
    print(f"KLUE-MRC F1: {sum(f1_scores)/len(f1_scores)*100:.1f}%")

KoBEST

KoBEST (Korean Balanced Evaluation of Significant Tasks), developed by KAIST, includes 5 tasks:

  • BoolQ: Yes/no question answering
  • COPA: Cause/effect reasoning
  • WiC: Word-in-context disambiguation
  • HellaSwag: Commonsense completion
  • SentiNeg: Negation sentiment understanding

KMMLU (Korean MMLU)

KMMLU extends MMLU to Korean, including Korean-specific subjects (Korean history, Korean law, Korean medicine) alongside Korean translations of MMLU topics.

from datasets import load_dataset

def evaluate_kmmlu_sample():
    """Explore the KMMLU dataset."""
    dataset = load_dataset("HAERAE-HUB/KMMLU")
    test_data = dataset['test']

    print(f"Total questions: {len(test_data)}")

    subjects = set(test_data['subject'])
    print(f"Number of subjects: {len(subjects)}")
    print(f"Sample subjects: {list(subjects)[:10]}")

    item = test_data[0]
    print(f"\nSubject: {item['subject']}")
    print(f"Question: {item['question']}")
    print(f"A: {item['A']}")
    print(f"B: {item['B']}")
    print(f"C: {item['C']}")
    print(f"D: {item['D']}")
    print(f"Answer: {item['answer']}")

7. Multimodal Benchmarks

VQA (Visual Question Answering)

VQA is the task of answering natural language questions about images.

  • VQA v2: ~1.1M (image, question, answer) triples. Two complementary questions per image.
  • Metric: Accuracy = min(answers/3, 1) — how many of 10 annotators agree
from transformers import BlipProcessor, BlipForQuestionAnswering
import torch
from PIL import Image

def evaluate_vqa_blip():
    """VQA evaluation with BLIP."""
    processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
    model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")
    model.eval()

    questions = [
        ("What color is the car?", "test_car.jpg"),
        ("How many people are in the image?", "test_crowd.jpg"),
        ("Is it raining?", "test_outdoor.jpg")
    ]

    for question, image_path in questions:
        try:
            image = Image.open(image_path).convert('RGB')
            inputs = processor(image, question, return_tensors="pt")

            with torch.no_grad():
                out = model.generate(**inputs, max_length=20)

            answer = processor.decode(out[0], skip_special_tokens=True)
            print(f"Q: {question}")
            print(f"A: {answer}\n")
        except FileNotFoundError:
            print(f"Image not found: {image_path}")

MMBench

MMBench, published by Shanghai AI Lab, is a multimodal LLM evaluation benchmark covering 20 capability dimensions with 3,000 multiple-choice questions.

Sample Dimensions:

  • Attribute Recognition
  • Spatial Relationship
  • Action Recognition
  • OCR
  • Commonsense Reasoning

MMMU (Massive Multidiscipline Multimodal Understanding)

MMMU evaluates university-level multimodal understanding across 6 core disciplines (Art, Science, Engineering, Medicine, Technology, Humanities), 30 subjects, and 11,550 questions.

from datasets import load_dataset
import torch

def explore_mmmu():
    """Explore the MMMU dataset."""
    dataset = load_dataset("MMMU/MMMU", "Accounting")
    print(f"Accounting validation: {len(dataset['validation'])} questions")

    item = dataset['validation'][0]
    print(f"\nQuestion: {item['question']}")
    print(f"Option A: {item['option_A']}")
    print(f"Option B: {item['option_B']}")
    print(f"Answer: {item['answer']}")

    if item['image_1']:
        print("Image included")

def evaluate_mmmu_with_llava(model_name="llava-hf/llava-v1.6-mistral-7b-hf"):
    from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration

    processor = LlavaNextProcessor.from_pretrained(model_name)
    model = LlavaNextForConditionalGeneration.from_pretrained(
        model_name, torch_dtype="auto", device_map="auto"
    )

    dataset = load_dataset("MMMU/MMMU", "Accounting", split="validation")
    correct = 0
    total = min(50, len(dataset))

    for item in dataset.select(range(total)):
        question = item['question']
        options = [item.get(f'option_{c}', '') for c in 'ABCDE' if item.get(f'option_{c}')]
        gold = item['answer']

        if item['image_1']:
            image = item['image_1']
            prompt = f"[INST] [IMG]\nQuestion: {question}\nOptions: {options}\nAnswer with only the option letter. [/INST]"
            inputs = processor(prompt, image, return_tensors='pt').to(model.device)
        else:
            prompt = f"[INST] Question: {question}\nOptions: {options}\nAnswer with only the option letter. [/INST]"
            inputs = processor(prompt, return_tensors='pt').to(model.device)

        with torch.no_grad():
            output = model.generate(**inputs, max_new_tokens=10)
            response = processor.decode(output[0], skip_special_tokens=True)
            pred = response[-1].upper() if response else 'A'

        if pred == gold:
            correct += 1

    acc = correct / total
    print(f"MMMU-Accounting Accuracy: {acc:.3f}")
    return acc

8. Using LM-Evaluation-Harness

EleutherAI's lm-evaluation-harness is the standard tool for LLM evaluation, supporting 100+ benchmarks.

Installation and Basic Usage

# Install
git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

# Evaluate MMLU with GPT-2
lm_eval --model hf \
    --model_args pretrained=gpt2 \
    --tasks mmlu \
    --num_fewshot 5 \
    --batch_size 8 \
    --output_path results/gpt2_mmlu

# Evaluate multiple tasks simultaneously
lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-3.2-3B-Instruct \
    --tasks mmlu,arc_challenge,hellaswag,truthfulqa_mc1,gsm8k \
    --num_fewshot 5 \
    --batch_size 4 \
    --output_path results/llama3.2_3b

# Run with 4-bit quantization
lm_eval --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B,load_in_4bit=True \
    --tasks mmlu \
    --num_fewshot 5 \
    --batch_size 1

Python API

import lm_eval
from lm_eval import evaluator
import json
import os

def run_comprehensive_evaluation(model_path, output_dir="./results"):
    """Run comprehensive LM-Evaluation-Harness evaluation."""
    os.makedirs(output_dir, exist_ok=True)

    task_groups = {
        "knowledge": ["mmlu", "arc_challenge", "arc_easy"],
        "reasoning": ["hellaswag", "winogrande", "piqa"],
        "truthfulness": ["truthfulqa_mc1"],
        "math": ["gsm8k"],
        "coding": ["humaneval"],
    }

    all_results = {}

    fewshot_map = {
        "mmlu": 5, "arc_challenge": 25, "arc_easy": 25,
        "hellaswag": 10, "winogrande": 5, "piqa": 0,
        "truthfulqa_mc1": 0, "gsm8k": 5, "humaneval": 0
    }

    for group, tasks in task_groups.items():
        print(f"\n=== Evaluating {group.upper()} ===")

        results = evaluator.simple_evaluate(
            model="hf",
            model_args=f"pretrained={model_path}",
            tasks=tasks,
            num_fewshot=fewshot_map.get(tasks[0], 0),
            batch_size="auto",
            device="cuda" if __import__("torch").cuda.is_available() else "cpu",
        )

        all_results[group] = results['results']

        for task, metrics in results['results'].items():
            if 'acc,none' in metrics:
                print(f"  {task}: {metrics['acc,none']*100:.1f}%")
            elif 'exact_match,strict-match' in metrics:
                print(f"  {task}: {metrics['exact_match,strict-match']*100:.1f}%")

    with open(f"{output_dir}/evaluation_results.json", "w", encoding="utf-8") as f:
        json.dump(all_results, f, ensure_ascii=False, indent=2)

    print(f"\nResults saved to: {output_dir}/evaluation_results.json")
    return all_results


def compare_models(model_paths, tasks=None):
    """Compare multiple models on the same tasks."""
    if tasks is None:
        tasks = ["mmlu", "arc_challenge", "hellaswag", "gsm8k"]

    comparison = {}

    for model_path in model_paths:
        print(f"\nEvaluating: {model_path}")
        results = evaluator.simple_evaluate(
            model="hf",
            model_args=f"pretrained={model_path}",
            tasks=tasks,
            num_fewshot=5,
            batch_size="auto"
        )

        model_scores = {}
        for task, metrics in results['results'].items():
            for metric, value in metrics.items():
                if isinstance(value, (int, float)) and not metric.endswith('_stderr'):
                    model_scores[f"{task}/{metric}"] = round(value * 100, 2)

        comparison[model_path.split('/')[-1]] = model_scores

    print("\n" + "="*80)
    print("Model Comparison Results:")
    print("="*80)

    all_metrics = sorted(set().union(*[s.keys() for s in comparison.values()]))
    header = f"{'Metric':<40}" + "".join(f"{m[:15]:<18}" for m in comparison.keys())
    print(header)
    print("-" * 80)

    for metric in all_metrics:
        if 'acc,none' in metric or 'exact_match' in metric:
            row = f"{metric:<40}"
            for model_name in comparison:
                score = comparison[model_name].get(metric, "N/A")
                row += f"{score:<18}"
            print(row)

    return comparison

Adding a Custom Task

# custom_task.py
from lm_eval.api.task import Task
from lm_eval.api.instance import Instance

class CustomQATask(Task):
    """Custom Q&A task for lm-evaluation-harness."""
    VERSION = 1.0
    DATASET_PATH = "your-org/your-qa-dataset"
    DATASET_NAME = None

    def has_training_docs(self):
        return False

    def has_validation_docs(self):
        return True

    def has_test_docs(self):
        return True

    def validation_docs(self):
        return self.dataset["validation"]

    def test_docs(self):
        return self.dataset["test"]

    def doc_to_text(self, doc):
        return f"Question: {doc['question']}\nAnswer:"

    def doc_to_target(self, doc):
        return " " + doc['answer']

    def construct_requests(self, doc, ctx):
        return [Instance(
            request_type="generate_until",
            doc=doc,
            arguments=(ctx, {"until": ["\n", "Question:"]}),
            idx=0
        )]

    def process_results(self, doc, results):
        gold = doc['answer'].lower().strip()
        pred = results[0].lower().strip()
        return {"exact_match": int(gold == pred)}

    def aggregation(self):
        return {"exact_match": "mean"}

    def higher_is_better(self):
        return {"exact_match": True}

Summary

AI benchmark datasets are the compass guiding AI research and development. Key takeaways:

Computer Vision:

  • ImageNet: The gold standard for 1,000-class image classification
  • COCO: The standard for object detection and segmentation
  • ADE20K: Primary benchmark for semantic segmentation

NLP:

  • GLUE/SuperGLUE: Comprehensive language understanding evaluation
  • SQuAD: The standard benchmark for machine reading comprehension

LLM Capabilities:

  • MMLU: Knowledge evaluation across 57 disciplines (broadest scope)
  • HumanEval: Code generation capability evaluation
  • GSM8K: Mathematical reasoning evaluation

Comprehensive Evaluation:

  • HELM: Balanced evaluation across 7 dimensions
  • Chatbot Arena: Human preference-based ELO ratings
  • Open LLM Leaderboard: Comparing open-source LLMs

Korean Language:

  • KLUE: 8-task Korean language understanding evaluation
  • KMMLU: Korean knowledge evaluation

When interpreting benchmark results, always consider the possibility of data contamination, measurement bias, and the gap from real-world deployment conditions. A single benchmark cannot capture the full picture — comprehensive evaluation across multiple dimensions better reflects a model's genuine capabilities.


References