Split View: AI 벤치마크 데이터셋 완전 가이드: ImageNet, COCO, GLUE, MMLU, HumanEval
AI 벤치마크 데이터셋 완전 가이드: ImageNet, COCO, GLUE, MMLU, HumanEval
목차
- AI 벤치마크의 중요성
- 컴퓨터 비전 벤치마크
- NLP 벤치마크
- LLM 능력 벤치마크
- LLM 종합 평가
- 한국어 벤치마크
- 멀티모달 벤치마크
- LM-Evaluation-Harness 사용법
1. AI 벤치마크의 중요성
표준화된 평가의 필요성
AI 모델은 어떻게 비교해야 할까요? 두 개의 이미지 분류 모델이 있을 때, 어느 것이 더 우수한지 판단하려면 공통된 기준이 필요합니다. 벤치마크 데이터셋은 바로 이 공통된 기준을 제공합니다.
표준화된 벤치마크가 없다면 각 팀이 자신에게 유리한 데이터셋으로만 평가를 진행할 수 있으므로, 결과를 객관적으로 비교하기 어렵습니다. ImageNet, GLUE, MMLU 같은 표준 벤치마크는 AI 연구 커뮤니티가 동일한 시험지로 경쟁하도록 만들어 진보를 측정하고 방향을 설정하는 데 기여했습니다.
리더보드와 경쟁
벤치마크는 리더보드를 통해 AI 발전을 가시적으로 보여줍니다.
- ImageNet LSVRC: 2012년 AlexNet이 Top-5 오류율을 26%에서 15.3%로 낮추면서 딥러닝 혁명이 시작되었습니다.
- GLUE/SuperGLUE: BERT, RoBERTa, T5 등이 인간 수준 성능을 넘어서는 과정을 기록했습니다.
- HumanEval: GPT-4, Claude, Gemini 등 최신 LLM들이 코드 생성 능력을 경쟁하는 무대가 되었습니다.
- LMSYS Chatbot Arena: 실제 인간 사용자가 두 모델을 블라인드 테스트하여 ELO 점수를 매깁니다.
벤치마크의 한계와 편향
벤치마크는 강력한 도구이지만 한계도 명확합니다.
1. 데이터셋 오염 (Contamination)
LLM은 인터넷의 방대한 텍스트로 학습됩니다. 벤치마크 테스트 데이터가 학습 데이터에 포함되어 있다면 모델은 실제로 문제를 이해하는 것이 아니라 답을 암기한 것일 수 있습니다. GPT-4 기술 보고서에서도 이 문제를 인정했습니다.
2. 굿하트의 법칙
"측정 지표가 목표가 되면, 더 이상 좋은 측정 지표가 아니다." 연구자들이 특정 벤치마크 점수를 올리는 데만 집중하면, 실제 능력 향상 없이 점수만 높아질 수 있습니다.
3. 편향과 대표성
많은 벤치마크가 영어와 서양 문화권 데이터에 편중되어 있습니다. 한국어, 아랍어, 스와힐리어 등에서의 성능은 영어 벤치마크 점수와 크게 다를 수 있습니다.
4. 정적인 기준
벤치마크는 한번 만들어지면 변하지 않지만, AI 모델은 계속 발전합니다. 2023년에 어려웠던 벤치마크가 2025년에는 포화 상태(near-saturation)에 도달하기도 합니다.
5. 실제 성능과의 괴리
벤치마크 점수가 높다고 해서 실제 사용 환경에서 좋은 성능을 보장하지 않습니다. 사용자 경험, 창의성, 안전성 등 수치화하기 어려운 요소들도 중요합니다.
2. 컴퓨터 비전 벤치마크
ImageNet (ILSVRC)
ImageNet Large Scale Visual Recognition Challenge(ILSVRC)는 컴퓨터 비전 역사상 가장 영향력 있는 벤치마크입니다. 스탠퍼드 대학교의 Fei-Fei Li 교수가 주도한 ImageNet 프로젝트(2009)에서 시작되었으며, 2010년부터 2017년까지 연례 대회로 진행되었습니다.
데이터셋 특성:
- 1,000개 클래스 (개, 고양이, 자동차 등 일상적 사물)
- 학습 데이터: 약 120만 장
- 검증 데이터: 50,000장
- 테스트 데이터: 100,000장
- 평균 클래스당 약 1,200장
주요 평가 지표:
- Top-1 Accuracy: 모델이 예측한 1위 클래스가 실제 정답인 비율
- Top-5 Accuracy: 모델이 예측한 상위 5개 클래스 중 실제 정답이 포함된 비율
역사적 발전:
| 연도 | 모델 | Top-5 오류율 |
|---|---|---|
| 2010 | NEC-UIUC | 28.2% |
| 2012 | AlexNet | 15.3% |
| 2014 | VGG-16 | 7.3% |
| 2015 | ResNet-152 | 3.57% |
| 2017 | SENet | 2.25% |
| 2021 | CoAtNet | 0.95% |
| 2023 | ViT-22B | ~0.6% |
사람의 Top-5 오류율은 약 5.1%로 추정됩니다. ResNet(2015년)이 이미 인간 수준을 넘어선 이후 연구는 더욱 어려운 변형 벤치마크(ImageNet-A, ImageNet-R, ImageNet-C)로 확장되었습니다.
# PyTorch로 ImageNet 검증 정확도 측정
import torch
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models
from torch.utils.data import DataLoader
def evaluate_imagenet(model, val_dir, batch_size=256):
# 표준 전처리 (ImageNet 검증 기준)
val_transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
val_dataset = datasets.ImageFolder(val_dir, transform=val_transform)
val_loader = DataLoader(
val_dataset,
batch_size=batch_size,
shuffle=False,
num_workers=8,
pin_memory=True
)
model.eval()
top1_correct = 0
top5_correct = 0
total = 0
with torch.no_grad():
for images, labels in val_loader:
images = images.cuda()
labels = labels.cuda()
outputs = model(images)
_, predicted = outputs.topk(5, 1, True, True)
predicted = predicted.t()
correct = predicted.eq(labels.view(1, -1).expand_as(predicted))
top1_correct += correct[:1].reshape(-1).float().sum(0)
top5_correct += correct[:5].reshape(-1).float().sum(0)
total += labels.size(0)
top1_acc = top1_correct / total * 100
top5_acc = top5_correct / total * 100
print(f"Top-1 Accuracy: {top1_acc:.2f}%")
print(f"Top-5 Accuracy: {top5_acc:.2f}%")
return top1_acc, top5_acc
# 예시: ResNet-50 평가
model = models.resnet50(pretrained=True).cuda()
evaluate_imagenet(model, '/path/to/imagenet/val')
COCO (Common Objects in Context)
COCO는 Microsoft가 2014년 공개한 대규모 객체 탐지, 세그멘테이션, 이미지 캡셔닝 벤치마크입니다.
데이터셋 특성:
- 80개 카테고리의 일상적 객체
- 330,000장 이상의 이미지
- 150만 개 이상의 객체 인스턴스
- 각 이미지에 5개의 캡션 (캡셔닝 태스크용)
- 세밀한 세그멘테이션 마스크 포함
주요 평가 지표:
mAP (mean Average Precision)는 COCO의 핵심 지표입니다. IoU(Intersection over Union) 임계값에 따라 다양한 지표가 있습니다.
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
import json
def evaluate_coco_detection(annotation_file, result_file):
# COCO GT 로드
coco_gt = COCO(annotation_file)
# 예측 결과 로드
coco_dt = coco_gt.loadRes(result_file)
# bbox 평가
coco_eval = COCOeval(coco_gt, coco_dt, 'bbox')
coco_eval.evaluate()
coco_eval.accumulate()
coco_eval.summarize()
# 주요 지표 출력
stats = coco_eval.stats
print(f"\n=== COCO Detection Results ===")
print(f"AP @ IoU=0.50:0.95 (COCO primary): {stats[0]:.3f}")
print(f"AP @ IoU=0.50 (PASCAL VOC style): {stats[1]:.3f}")
print(f"AP @ IoU=0.75 (strict): {stats[2]:.3f}")
print(f"AP (small objects, area < 32^2): {stats[3]:.3f}")
print(f"AP (medium objects): {stats[4]:.3f}")
print(f"AP (large objects): {stats[5]:.3f}")
print(f"AR (max=1 det/image): {stats[6]:.3f}")
print(f"AR (max=10 det/image): {stats[7]:.3f}")
print(f"AR (max=100 det/image): {stats[8]:.3f}")
return stats
# COCO 데이터셋 탐색
coco = COCO('instances_val2017.json')
cat_ids = coco.getCatIds(catNms=['person', 'car', 'dog'])
img_ids = coco.getImgIds(catIds=cat_ids[:1])
# 특정 이미지의 어노테이션 확인
img = coco.loadImgs(img_ids[0])[0]
ann_ids = coco.getAnnIds(imgIds=img['id'])
anns = coco.loadAnns(ann_ids)
print(f"이미지: {img['file_name']}, 어노테이션 수: {len(anns)}")
for ann in anns[:3]:
cat = coco.loadCats(ann['category_id'])[0]
print(f" 카테고리: {cat['name']}, 면적: {ann['area']:.0f}px²")
최신 COCO 성능 (2025년 기준):
| 모델 | AP (box) | AP (mask) | 파라미터 |
|---|---|---|---|
| YOLOv8x | 53.9 | - | 68M |
| DINO (Swin-L) | 63.3 | - | 218M |
| Co-DINO (Swin-L) | 64.1 | 54.0 | 218M |
| InternImage-H | 65.4 | 56.1 | 2.18B |
ADE20K - 시맨틱 세그멘테이션
ADE20K는 MIT CSAIL이 구축한 시맨틱 세그멘테이션 벤치마크로, 150개 카테고리에 걸쳐 25,000장의 이미지를 포함합니다.
주요 지표:
- mIoU (mean Intersection over Union): 예측 마스크와 실제 마스크 간의 평균 IoU
- aAcc (allAcc): 픽셀 수준 전체 정확도
- mAcc: 클래스별 평균 정확도
import numpy as np
def compute_iou(pred_mask, gt_mask, num_classes=150):
"""mIoU 계산"""
iou_list = []
for cls in range(num_classes):
pred_cls = (pred_mask == cls)
gt_cls = (gt_mask == cls)
intersection = np.logical_and(pred_cls, gt_cls).sum()
union = np.logical_or(pred_cls, gt_cls).sum()
if union == 0:
continue # 이 클래스가 이미지에 없으면 건너뜀
iou = intersection / union
iou_list.append(iou)
return np.mean(iou_list) if iou_list else 0.0
# mmsegmentation으로 ADE20K 평가
# pip install mmsegmentation
from mmseg.apis import inference_segmentor, init_segmentor
config_file = 'configs/segformer/segformer_mit-b5_8xb2-160k_ade20k-512x512.py'
checkpoint_file = 'segformer_mit-b5_8x2_512x512_160k_ade20k_20220617_203542-745f14da.pth'
model = init_segmentor(config_file, checkpoint_file, device='cuda:0')
result = inference_segmentor(model, 'test_image.jpg')
Kinetics - 동영상 분류
Kinetics는 Google DeepMind가 제공하는 동영상 행동 인식 벤치마크입니다.
- Kinetics-400: 400개 행동 클래스, 약 30만 개 클립
- Kinetics-600: 600개 클래스, 약 50만 개 클립
- Kinetics-700: 700개 클래스
주요 지표: Top-1, Top-5 정확도 (각 클립에서의 평균)
CIFAR-10/100
소규모 이미지 분류 벤치마크로, 빠른 프로토타이핑과 논문 검증에 자주 사용됩니다.
import torch
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
# CIFAR-10 로드 및 평가
def evaluate_cifar10(model, batch_size=128):
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])
testset = torchvision.datasets.CIFAR10(
root='./data', train=False, download=True, transform=transform
)
testloader = DataLoader(testset, batch_size=batch_size, shuffle=False, num_workers=4)
model.eval()
correct = 0
total = 0
with torch.no_grad():
for images, labels in testloader:
outputs = model(images)
_, predicted = torch.max(outputs, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
accuracy = 100 * correct / total
print(f"CIFAR-10 정확도: {accuracy:.2f}%")
return accuracy
3. NLP 벤치마크
GLUE (General Language Understanding Evaluation)
GLUE는 2018년 뉴욕대와 DeepMind가 공동 발표한 NLP 모델 평가 벤치마크로, 9가지 서로 다른 언어 이해 태스크로 구성됩니다.
GLUE 태스크 구성:
| 태스크 | 설명 | 데이터셋 | 지표 |
|---|---|---|---|
| CoLA | 문법성 판단 | 8,551 | Matthews Corr. |
| SST-2 | 감성 분류 (긍정/부정) | 67K | 정확도 |
| MRPC | 문장 의미 동일성 | 3,700 | F1/정확도 |
| STS-B | 문장 유사도 점수 | 7K | Pearson/Spearman |
| QQP | 질문 유사성 | 400K | F1/정확도 |
| MNLI | 자연어 추론 (3분류) | 393K | 정확도 |
| QNLI | 질문-답변 추론 | 105K | 정확도 |
| RTE | 텍스트 함의 인식 | 2,500 | 정확도 |
| WNLI | Winograd NLI | 634 | 정확도 |
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import numpy as np
from sklearn.metrics import matthews_corrcoef, f1_score
def evaluate_glue_cola(model_name="bert-base-uncased"):
"""CoLA 태스크 평가 (문법성 판단)"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=2
)
dataset = load_dataset("glue", "cola")
val_data = dataset["validation"]
predictions = []
labels = []
model.eval()
import torch
for item in val_data:
inputs = tokenizer(
item['sentence'],
return_tensors='pt',
padding=True,
truncation=True,
max_length=128
)
with torch.no_grad():
outputs = model(**inputs)
pred = outputs.logits.argmax(dim=-1).item()
predictions.append(pred)
labels.append(item['label'])
mcc = matthews_corrcoef(labels, predictions)
print(f"CoLA Matthews Correlation: {mcc:.4f}")
return mcc
# SST-2 (감성 분류)
def evaluate_glue_sst2(model_name="textattack/bert-base-uncased-SST-2"):
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
dataset = load_dataset("glue", "sst2")
val_data = dataset["validation"]
correct = 0
total = len(val_data)
model.eval()
import torch
for item in val_data:
inputs = tokenizer(
item['sentence'],
return_tensors='pt',
truncation=True,
max_length=128
)
with torch.no_grad():
outputs = model(**inputs)
pred = outputs.logits.argmax(dim=-1).item()
if pred == item['label']:
correct += 1
acc = correct / total
print(f"SST-2 정확도: {acc:.4f}")
return acc
SuperGLUE
GLUE가 포화 상태(인간 수준 달성)에 가까워지자, 2019년 더 어려운 태스크로 구성된 SuperGLUE가 등장했습니다.
SuperGLUE 태스크:
- BoolQ: 예/아니오 질의응답 (9,427개)
- CB: 주장-전제 함의 (250개, 3분류)
- COPA: 원인/결과 추론 (1,000개)
- MultiRC: 멀티 문장 독해 (9,693개)
- ReCoRD: 클로즈 스타일 독해 (120K개)
- RTE: 텍스트 함의 인식 (5,749개)
- WiC: 단어 의미 중의성 해소 (9,600개)
- WSC: Winograd 스키마 도전 (554개)
인간 베이스라인: 89.8 / GPT-4 수준 모델: 90+ (인간 수준 초과)
SQuAD 1.1 & 2.0
SQuAD(Stanford Question Answering Dataset)는 위키피디아 단락에서 질문에 대한 답을 추출하는 기계 독해 벤치마크입니다.
- SQuAD 1.1: 536개 위키피디아 문서, 107,785개 질문-답변 쌍. 모든 질문에 답이 단락 내에 존재
- SQuAD 2.0: SQuAD 1.1 + 53,775개 대답 불가능한 질문 추가
평가 지표:
- EM (Exact Match): 예측 답변이 정답과 완전히 일치하는 비율
- F1 Score: 단어 수준의 부분 일치 점수
from datasets import load_dataset
from transformers import pipeline
def evaluate_squad(model_name="deepset/roberta-base-squad2"):
"""SQuAD 2.0 평가"""
qa_pipeline = pipeline("question-answering", model=model_name)
dataset = load_dataset("squad_v2", split="validation")
em_scores = []
f1_scores = []
no_answer_correct = 0
no_answer_total = 0
for item in dataset.select(range(200)): # 빠른 평가를 위해 200개만
context = item['context']
question = item['question']
answers = item['answers']
result = qa_pipeline(question=question, context=context)
predicted = result['answer'].lower().strip()
has_answer = len(answers['text']) > 0
if not has_answer:
no_answer_total += 1
if result['score'] < 0.1: # 모델이 답 없음을 인식한 경우
no_answer_correct += 1
em_scores.append(0)
f1_scores.append(0)
else:
gold_answers = [a.lower().strip() for a in answers['text']]
# EM 계산
em = max(int(predicted == gold) for gold in gold_answers)
em_scores.append(em)
# F1 계산
best_f1 = 0
for gold in gold_answers:
pred_tokens = set(predicted.split())
gold_tokens = set(gold.split())
common = pred_tokens & gold_tokens
if len(common) == 0:
f1 = 0
else:
precision = len(common) / len(pred_tokens)
recall = len(common) / len(gold_tokens)
f1 = 2 * precision * recall / (precision + recall)
best_f1 = max(best_f1, f1)
f1_scores.append(best_f1)
print(f"SQuAD 2.0 결과 (샘플 200개):")
print(f" EM: {sum(em_scores)/len(em_scores)*100:.1f}%")
print(f" F1: {sum(f1_scores)/len(f1_scores)*100:.1f}%")
if no_answer_total > 0:
print(f" 대답 불가 정확도: {no_answer_correct/no_answer_total*100:.1f}%")
WMT - 기계 번역
WMT(Workshop on Machine Translation)는 기계 번역 모델을 평가하는 연례 대회로, 여러 언어 쌍(영-독, 영-중, 영-한 등)에 대한 번역 품질을 평가합니다.
주요 평가 지표:
- BLEU (Bilingual Evaluation Understudy): n-gram 정밀도 기반 자동 평가
- COMET: 인간 평가와 높은 상관성을 보이는 신경망 기반 지표
- chrF: 문자 수준 n-gram F 점수
from datasets import load_dataset
import sacrebleu
def compute_bleu(predictions, references):
"""BLEU 점수 계산"""
bleu = sacrebleu.corpus_bleu(predictions, [references])
print(f"BLEU: {bleu.score:.2f}")
print(f"BP: {bleu.bp:.3f}")
print(f"Ratio: {bleu.sys_len/bleu.ref_len:.3f}")
return bleu.score
# 번역 모델 평가
from transformers import MarianMTModel, MarianTokenizer
def evaluate_translation(src_texts, tgt_texts, model_name="Helsinki-NLP/opus-mt-en-ko"):
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
predictions = []
for text in src_texts[:100]: # 100개 샘플
inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True)
translated = model.generate(**inputs, max_length=512)
pred = tokenizer.decode(translated[0], skip_special_tokens=True)
predictions.append(pred)
bleu_score = compute_bleu(predictions, tgt_texts[:100])
return bleu_score
4. LLM 능력 벤치마크
MMLU (Massive Multitask Language Understanding)
MMLU는 UC Berkeley의 Dan Hendrycks가 2020년 발표한 벤치마크로, 57개 학문 분야에 걸친 대학원 수준의 객관식 문제로 LLM의 지식과 추론 능력을 평가합니다.
분야별 구성:
- STEM: 수학, 물리, 화학, 컴퓨터과학, 엔지니어링
- 인문학: 역사, 철학, 법학, 윤리학
- 사회과학: 심리학, 경제학, 정치학, 사회학
- 기타: 의학, 영양학, 도덕적 시나리오, 전문 회계
각 문제는 4지선다형이며, 총 약 14,000개의 문제로 구성됩니다.
모델별 MMLU 성능:
| 모델 | MMLU 점수 | 발표 연도 |
|---|---|---|
| GPT-3 (175B) | 43.9% | 2020 |
| Gopher (280B) | 60.0% | 2021 |
| GPT-4 | 86.4% | 2023 |
| Claude 3 Opus | 86.8% | 2024 |
| Gemini Ultra | 90.0% | 2024 |
| GPT-4o | 88.7% | 2024 |
| 인간 전문가 추정 | ~90% | - |
from datasets import load_dataset
import anthropic # 또는 openai
def evaluate_mmlu(model_fn, subjects=None, num_few_shot=5):
"""MMLU 평가 함수"""
if subjects is None:
subjects = ['abstract_algebra', 'anatomy', 'astronomy', 'college_mathematics']
results = {}
for subject in subjects:
dataset = load_dataset("lukaemon/mmlu", subject)
test_data = dataset['test']
dev_data = dataset['dev'] # few-shot 예시용
correct = 0
total = 0
# Few-shot 프롬프트 구성
few_shot_examples = ""
for i, item in enumerate(dev_data.select(range(num_few_shot))):
few_shot_examples += f"Q: {item['input']}\n"
few_shot_examples += f"(A) {item['A']} (B) {item['B']} (C) {item['C']} (D) {item['D']}\n"
few_shot_examples += f"Answer: {item['target']}\n\n"
for item in test_data:
prompt = few_shot_examples
prompt += f"Q: {item['input']}\n"
prompt += f"(A) {item['A']} (B) {item['B']} (C) {item['C']} (D) {item['D']}\n"
prompt += "Answer:"
response = model_fn(prompt)
# 응답에서 A/B/C/D 추출
pred = response.strip()[0] if response.strip() else 'A'
if pred == item['target']:
correct += 1
total += 1
accuracy = correct / total
results[subject] = accuracy
print(f"{subject}: {accuracy:.3f} ({correct}/{total})")
overall = sum(results.values()) / len(results)
print(f"\n전체 평균: {overall:.3f}")
return results
BIG-Bench (Beyond the Imitation Game Benchmark)
Google이 주도한 BIG-Bench는 LLM의 경계를 탐색하는 204개의 다양한 태스크로 구성됩니다. 언어 모델이 아직 잘 수행하지 못하는 창의적 추론, 상식, 수학, 코드 등을 포함합니다.
BIG-Bench Hard: 23개의 어려운 태스크로, 체인-오브-소트(Chain-of-Thought) 프롬프팅으로 성능이 크게 향상됩니다.
from lm_eval.api.task import Task
from lm_eval import evaluator
# lm-evaluation-harness를 통한 BIG-Bench 평가
results = evaluator.simple_evaluate(
model="hf",
model_args="pretrained=meta-llama/Llama-3.2-3B-Instruct",
tasks=["bigbench_causal_judgment", "bigbench_date_understanding"],
num_fewshot=3,
batch_size="auto"
)
print(results['results'])
HellaSwag - 상식 추론
HellaSwag는 2019년 발표된 상식 추론 벤치마크로, 이야기의 다음 문장으로 가장 자연스러운 것을 4개 중에 고르는 형태입니다.
from datasets import load_dataset
import torch
from transformers import AutoTokenizer, AutoModelForMultipleChoice
def evaluate_hellaswag(model_name="microsoft/deberta-v2-xxlarge"):
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMultipleChoice.from_pretrained(model_name)
dataset = load_dataset("hellaswag", split="validation")
correct = 0
total = min(500, len(dataset)) # 빠른 평가
for item in dataset.select(range(total)):
context = item['ctx']
endings = item['endings']
label = int(item['label'])
# 각 선택지와 컨텍스트 조합
choices = [context + " " + ending for ending in endings]
encoding = tokenizer(
[context] * 4,
choices,
return_tensors='pt',
padding=True,
truncation=True,
max_length=256
)
with torch.no_grad():
outputs = model(**{k: v.unsqueeze(0) for k, v in encoding.items()})
logits = outputs.logits
predicted = logits.argmax(dim=-1).item()
if predicted == label:
correct += 1
accuracy = correct / total
print(f"HellaSwag 정확도: {accuracy:.4f}")
return accuracy
ARC (AI2 Reasoning Challenge)
ARC는 AI2(Allen Institute for AI)가 발표한 초등-고등학교 수준의 과학 문제 벤치마크입니다.
- ARC-Easy: 비교적 쉬운 문제 (5,197개)
- ARC-Challenge: 검색 기반 모델도 틀리는 어려운 문제 (1,172개)
TruthfulQA - 사실성 평가
TruthfulQA는 모델이 널리 알려진 미신, 오해, 편견에 대해 얼마나 정확하게 답변하는지를 평가합니다.
from datasets import load_dataset
from transformers import pipeline
def evaluate_truthfulqa(model_name="gpt2-xl"):
"""TruthfulQA MC1 (단일 정답 선택) 평가"""
dataset = load_dataset("truthful_qa", "multiple_choice")
val_data = dataset["validation"]
generator = pipeline("text-generation", model=model_name)
correct = 0
total = min(100, len(val_data))
for item in val_data.select(range(total)):
question = item['question']
choices = item['mc1_targets']['choices']
labels = item['mc1_targets']['labels']
correct_idx = labels.index(1)
# 프롬프트 구성
prompt = f"Q: {question}\nOptions:\n"
for i, choice in enumerate(choices):
letter = chr(65 + i) # A, B, C, ...
prompt += f"{letter}. {choice}\n"
prompt += "Answer:"
response = generator(prompt, max_new_tokens=5, do_sample=False)
generated = response[0]['generated_text'][len(prompt):].strip()
pred_letter = generated[0] if generated else 'A'
pred_idx = ord(pred_letter) - 65
if pred_idx == correct_idx:
correct += 1
accuracy = correct / total
print(f"TruthfulQA MC1 정확도: {accuracy:.4f}")
return accuracy
GSM8K - 초등 수학
GSM8K(Grade School Math 8K)는 OpenAI가 2021년 발표한 초등학교 수준의 수학 문제 8,500개로 구성된 벤치마크입니다. 각 문제는 자연어로 서술되며, 모델이 단계별로 수학적 추론을 수행하는 능력을 평가합니다.
from datasets import load_dataset
import re
def extract_number(text):
"""텍스트에서 최종 숫자 답 추출"""
numbers = re.findall(r'-?\d+\.?\d*', text)
return numbers[-1] if numbers else None
def evaluate_gsm8k_chain_of_thought(model_fn, num_shot=8):
"""Chain-of-Thought로 GSM8K 평가"""
dataset = load_dataset("gsm8k", "main")
test_data = dataset['test']
few_shot_prompt = """Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans × 3 balls = 6 balls. 5 + 6 = 11 balls. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many do they have?
A: They started with 23. Used 20: 23 - 20 = 3. Then bought 6: 3 + 6 = 9. The answer is 9.
"""
correct = 0
total = min(200, len(test_data))
for item in test_data.select(range(total)):
question = item['question']
gold_answer = item['answer'].split('####')[-1].strip()
prompt = few_shot_prompt + f"Q: {question}\nA:"
response = model_fn(prompt, max_tokens=256)
pred = extract_number(response)
gold = extract_number(gold_answer)
if pred and gold and abs(float(pred) - float(gold)) < 0.01:
correct += 1
accuracy = correct / total
print(f"GSM8K 정확도 (Chain-of-Thought): {accuracy:.4f}")
return accuracy
HumanEval - 코드 생성 평가
HumanEval은 OpenAI가 2021년 발표한 코드 생성 벤치마크로, 164개의 파이썬 함수 시그니처와 독스트링이 주어지고 모델이 완성된 함수를 작성해야 합니다.
평가 지표: pass@k
k번 시도 중 적어도 한 번 테스트를 통과할 확률입니다.
from datasets import load_dataset
import subprocess
import tempfile
import os
def evaluate_humaneval(model_fn, k=1, n=10, temperature=0.8):
"""HumanEval pass@k 평가"""
dataset = load_dataset("openai_humaneval")
test_data = dataset['test']
task_results = {}
for item in test_data.select(range(20)): # 20개만 빠르게 평가
task_id = item['task_id']
prompt = item['prompt']
tests = item['test']
entry_point = item['entry_point']
passes = 0
for attempt in range(n):
code = model_fn(prompt, temperature=temperature)
# 코드 실행 및 테스트
full_code = prompt + code + "\n" + tests + f"\ncheck({entry_point})"
with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
f.write(full_code)
tmp_path = f.name
try:
result = subprocess.run(
['python', tmp_path],
timeout=10,
capture_output=True,
text=True
)
if result.returncode == 0:
passes += 1
except subprocess.TimeoutExpired:
pass
finally:
os.unlink(tmp_path)
task_results[task_id] = passes / n
# pass@1 계산
pass_at_1 = sum(task_results.values()) / len(task_results)
print(f"pass@1: {pass_at_1:.4f}")
return pass_at_1
# 주요 모델 HumanEval 성능 (2025년 기준)
humaneval_scores = {
"GPT-3 (175B)": 0.0, # 원본 논문 기준
"Codex (12B)": 0.288,
"GPT-4": 0.870,
"Claude 3.5 Sonnet": 0.900,
"DeepSeek-Coder-33B": 0.823,
"Llama 3.1 70B": 0.803,
}
MBPP - 파이썬 프로그래밍
MBPP(Mostly Basic Python Problems)는 Google이 발표한 974개의 파이썬 프로그래밍 문제로, HumanEval보다 더 다양한 난이도를 포함합니다.
from datasets import load_dataset
def evaluate_mbpp_sample():
"""MBPP 데이터셋 탐색"""
dataset = load_dataset("mbpp")
test_data = dataset['test']
print("MBPP 샘플 문제:")
for item in test_data.select(range(3)):
print(f"\n태스크 ID: {item['task_id']}")
print(f"문제: {item['text']}")
print(f"테스트 케이스: {item['test_list'][:2]}")
print(f"참고 코드:\n{item['code']}")
print("-" * 50)
5. LLM 종합 평가
MT-Bench - 멀티턴 대화 평가
MT-Bench는 UC Berkeley의 LMSYS 팀이 개발한 멀티턴 대화 평가 벤치마크로, GPT-4를 심판으로 사용하여 1-10점 척도로 채점합니다.
8개 카테고리, 각 10개 질문:
- Writing (글쓰기)
- Roleplay (역할극)
- Reasoning (추론)
- Math (수학)
- Coding (코딩)
- Extraction (정보 추출)
- STEM
- Humanities
import json
from openai import OpenAI
def mt_bench_judge(question, answer, reference_answer=None):
"""GPT-4로 MT-Bench 답변 평가"""
client = OpenAI()
system_prompt = """You are a helpful assistant that evaluates AI responses.
Rate the response on a scale of 1-10 based on: accuracy, relevance, completeness, and clarity.
Output format: Score: X/10\nRationale: [brief explanation]"""
user_prompt = f"""Question: {question}
AI Response: {answer}
{f'Reference Answer: {reference_answer}' if reference_answer else ''}
Please evaluate this response."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0.0
)
judge_response = response.choices[0].message.content
print(f"평가 결과:\n{judge_response}")
return judge_response
# FastChat의 공식 MT-Bench 사용법
# git clone https://github.com/lm-sys/FastChat
# python -m fastchat.llm_judge.gen_model_answer --model-path your-model
# python -m fastchat.llm_judge.gen_judgment --judge-model gpt-4
# python -m fastchat.llm_judge.show_result
LMSYS Chatbot Arena
Chatbot Arena는 실제 사용자가 두 모델의 응답을 비교하여 더 나은 쪽에 투표하는 방식입니다. ELO 레이팅 시스템을 사용하므로 인간의 실제 선호도를 반영합니다.
2025년 3월 ELO 상위 모델 (참고용):
| 순위 | 모델 | ELO |
|---|---|---|
| 1 | GPT-4.5 | ~1370 |
| 2 | Gemini 2.0 Ultra | ~1360 |
| 3 | Claude 3.7 Sonnet | ~1350 |
| 4 | GPT-4o | ~1340 |
| 5 | Llama 3.3 70B | ~1250 |
HELM (Holistic Evaluation of Language Models)
Stanford CRFM이 개발한 HELM은 단순 정확도를 넘어 다음 7개 측면을 종합 평가합니다:
- Accuracy (정확도)
- Calibration (확신도 보정)
- Robustness (견고성)
- Fairness (공정성)
- Bias (편향성)
- Toxicity (독성)
- Efficiency (효율성)
# HELM 평가 실행
pip install crfm-helm
# 기본 평가 (mmlu + summarization + qa)
helm-run \
--conf src/helm/benchmark/presentation/run_specs_lite.conf \
--local \
--max-eval-instances 1000 \
--num-train-trials 1
# 결과 확인
helm-summarize --suite v1
helm-server
Open LLM Leaderboard (HuggingFace)
HuggingFace의 Open LLM Leaderboard는 오픈소스 LLM을 일관된 기준으로 평가하는 공개 리더보드입니다.
평가 태스크:
- MMLU (5-shot)
- ARC Challenge (25-shot)
- HellaSwag (10-shot)
- TruthfulQA (0-shot)
- Winogrande (5-shot)
- GSM8K (5-shot)
# huggingface_hub로 리더보드 데이터 접근
from huggingface_hub import HfApi
import pandas as pd
def fetch_leaderboard_data():
"""Open LLM Leaderboard 데이터 가져오기"""
api = HfApi()
# 리더보드 데이터셋
dataset_info = api.dataset_info("open-llm-leaderboard/results")
print(f"마지막 업데이트: {dataset_info.lastModified}")
# 결과 파일 목록
files = api.list_repo_files(
repo_id="open-llm-leaderboard/results",
repo_type="dataset"
)
model_results = []
for f in list(files)[:5]: # 처음 5개만
print(f"파일: {f}")
return model_results
6. 한국어 벤치마크
KLUE (Korean Language Understanding Evaluation)
KLUE는 2021년 한국전자통신연구원(ETRI) 등이 공동 개발한 한국어 자연어 이해 벤치마크로, 8개 태스크로 구성됩니다.
KLUE 태스크:
| 태스크 | 유형 | 데이터 규모 | 지표 |
|---|---|---|---|
| TC (Topic Classification) | 문서 분류 | 60K | 정확도 |
| STS (Semantic Textual Similarity) | 문장 유사도 | 13K | Pearson |
| NLI (Natural Language Inference) | 자연어 추론 | 30K | 정확도 |
| NER (Named Entity Recognition) | 개체명 인식 | 21K | Entity F1 |
| RE (Relation Extraction) | 관계 추출 | 32K | micro-F1 |
| DP (Dependency Parsing) | 의존 구문 분석 | 23K | UAS/LAS |
| MRC (Machine Reading Comprehension) | 기계 독해 | 24K | EM/F1 |
| DST (Dialogue State Tracking) | 대화 상태 추적 | 10K | JGA |
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
def evaluate_klue_nli(model_name="klue/roberta-large"):
"""KLUE-NLI 평가"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=3
)
dataset = load_dataset("klue", "nli")
val_data = dataset['validation']
label_map = {0: "entailment", 1: "neutral", 2: "contradiction"}
correct = 0
total = min(500, len(val_data))
model.eval()
for item in val_data.select(range(total)):
premise = item['premise']
hypothesis = item['hypothesis']
gold_label = item['label']
inputs = tokenizer(
premise, hypothesis,
return_tensors='pt',
truncation=True,
max_length=512
)
with torch.no_grad():
outputs = model(**inputs)
pred = outputs.logits.argmax(dim=-1).item()
if pred == gold_label:
correct += 1
accuracy = correct / total
print(f"KLUE-NLI 정확도: {accuracy:.4f}")
return accuracy
def evaluate_klue_mrc(model_name="klue/roberta-large"):
"""KLUE-MRC (기계 독해) 평가"""
from transformers import AutoModelForQuestionAnswering, pipeline
qa_pipeline = pipeline(
"question-answering",
model=model_name,
tokenizer=model_name
)
dataset = load_dataset("klue", "mrc")
val_data = dataset['validation']
em_scores = []
f1_scores = []
for item in val_data.select(range(100)):
context = item['context']
question = item['question']
answers = item['answers']['text']
result = qa_pipeline(question=question, context=context)
predicted = result['answer'].strip()
# EM
em = max(int(predicted == a) for a in answers)
em_scores.append(em)
# F1
best_f1 = 0
for gold in answers:
pred_chars = set(predicted)
gold_chars = set(gold)
common = pred_chars & gold_chars
if common:
precision = len(common) / len(pred_chars)
recall = len(common) / len(gold_chars)
f1 = 2 * precision * recall / (precision + recall)
best_f1 = max(best_f1, f1)
f1_scores.append(best_f1)
print(f"KLUE-MRC EM: {sum(em_scores)/len(em_scores)*100:.1f}%")
print(f"KLUE-MRC F1: {sum(f1_scores)/len(f1_scores)*100:.1f}%")
KoBEST
KoBEST(Korean Balanced Evaluation of Significant Tasks)는 KAIST가 개발한 한국어 벤치마크로 5개 태스크를 포함합니다:
- BoolQ: 예/아니오 질의응답
- COPA: 원인/결과 추론
- WiC: 단어 의미 중의성 해소
- HellaSwag: 상식 완성
- SentiNeg: 부정 감성 이해
KMMLU (한국어 MMLU)
KMMLU는 MMLU를 한국어로 확장한 벤치마크로, 한국어 특화 과목(한국사, 한국 법률, 한국 의학)과 함께 영어 MMLU의 한국어 번역본을 포함합니다.
from datasets import load_dataset
def evaluate_kmmlu_sample():
"""KMMLU 샘플 탐색"""
# KMMLU 로드 (2024년 공개)
dataset = load_dataset("HAERAE-HUB/KMMLU")
test_data = dataset['test']
print(f"총 문제 수: {len(test_data)}")
subjects = set(test_data['subject'])
print(f"과목 수: {len(subjects)}")
print(f"샘플 과목: {list(subjects)[:10]}")
# 첫 번째 문제 출력
item = test_data[0]
print(f"\n과목: {item['subject']}")
print(f"문제: {item['question']}")
print(f"A: {item['A']}")
print(f"B: {item['B']}")
print(f"C: {item['C']}")
print(f"D: {item['D']}")
print(f"정답: {item['answer']}")
7. 멀티모달 벤치마크
VQA (Visual Question Answering)
VQA는 이미지를 보고 자연어 질문에 답하는 태스크입니다.
- VQA v2: 약 1.1M개의 (이미지, 질문, 답) 쌍. 이미지당 두 개의 보완적 질문
- 평가 지표: Accuracy = min(answers/3, 1) - 10명의 어노테이터 중 몇 명이 동의하는지
from datasets import load_dataset
from transformers import BlipProcessor, BlipForQuestionAnswering
import torch
from PIL import Image
def evaluate_vqa_blip():
"""BLIP으로 VQA 평가"""
processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")
model.eval()
# VQA v2 검증 데이터 (로컬 이미지 경로 필요)
questions = [
("What color is the car?", "test_car.jpg"),
("How many people are in the image?", "test_crowd.jpg"),
("Is it raining?", "test_outdoor.jpg")
]
for question, image_path in questions:
try:
image = Image.open(image_path).convert('RGB')
inputs = processor(image, question, return_tensors="pt")
with torch.no_grad():
out = model.generate(**inputs, max_length=20)
answer = processor.decode(out[0], skip_special_tokens=True)
print(f"Q: {question}")
print(f"A: {answer}\n")
except FileNotFoundError:
print(f"이미지 없음: {image_path}")
MMBench
MMBench는 상하이 AI 연구소가 발표한 멀티모달 LLM 평가 벤치마크로, 20개 능력 차원에 걸쳐 3,000개의 객관식 문제를 포함합니다.
평가 차원 (예시):
- Attribute Recognition (속성 인식)
- Spatial Relationship (공간 관계)
- Action Recognition (행동 인식)
- OCR (광학 문자 인식)
- Commonsense Reasoning (상식 추론)
MMMU (Massive Multidiscipline Multimodal Understanding)
MMMU는 대학 수준의 멀티모달 이해를 평가하는 벤치마크로, 6개 핵심 분야(Art, Science, Engineering, Medicine, Technology, Humanities)의 30개 과목, 11,550개 문제를 포함합니다.
from datasets import load_dataset
def explore_mmmu():
"""MMMU 데이터셋 탐색"""
dataset = load_dataset("MMMU/MMMU", "Accounting")
print(f"Accounting 태스크 검증 데이터: {len(dataset['validation'])} 문제")
item = dataset['validation'][0]
print(f"\n질문: {item['question']}")
print(f"선택지 A: {item['option_A']}")
print(f"선택지 B: {item['option_B']}")
print(f"정답: {item['answer']}")
# 이미지 포함 여부 확인
if item['image_1']:
print("이미지 포함 문제")
# 멀티모달 모델 평가 (예: LLaVA)
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
def evaluate_mmmu_with_llava(model_name="llava-hf/llava-v1.6-mistral-7b-hf"):
processor = LlavaNextProcessor.from_pretrained(model_name)
model = LlavaNextForConditionalGeneration.from_pretrained(
model_name, torch_dtype="auto", device_map="auto"
)
dataset = load_dataset("MMMU/MMMU", "Accounting", split="validation")
correct = 0
total = min(50, len(dataset))
for item in dataset.select(range(total)):
question = item['question']
options = [item.get(f'option_{c}', '') for c in 'ABCDE' if item.get(f'option_{c}')]
gold = item['answer']
if item['image_1']:
image = item['image_1']
prompt = f"[INST] [IMG]\nQuestion: {question}\nOptions: {options}\nAnswer with only the option letter. [/INST]"
inputs = processor(prompt, image, return_tensors='pt').to(model.device)
else:
prompt = f"[INST] Question: {question}\nOptions: {options}\nAnswer with only the option letter. [/INST]"
inputs = processor(prompt, return_tensors='pt').to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=10)
response = processor.decode(output[0], skip_special_tokens=True)
pred = response[-1].upper() if response else 'A'
if pred == gold:
correct += 1
acc = correct / total
print(f"MMMU-Accounting 정확도: {acc:.3f}")
return acc
8. LM-Evaluation-Harness 사용법
EleutherAI의 lm-evaluation-harness는 LLM 평가의 표준 도구로, 100개 이상의 벤치마크를 지원합니다.
설치 및 기본 사용법
# 설치
git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
# MMLU 평가 (GPT-2)
lm_eval --model hf \
--model_args pretrained=gpt2 \
--tasks mmlu \
--num_fewshot 5 \
--batch_size 8 \
--output_path results/gpt2_mmlu
# 여러 태스크 동시 평가
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-3.2-3B-Instruct \
--tasks mmlu,arc_challenge,hellaswag,truthfulqa_mc1,gsm8k \
--num_fewshot 5 \
--batch_size 4 \
--output_path results/llama3.2_3b
# HuggingFace 모델 (4비트 양자화로 실행)
lm_eval --model hf \
--model_args pretrained=meta-llama/Meta-Llama-3-8B,load_in_4bit=True \
--tasks mmlu \
--num_fewshot 5 \
--batch_size 1
Python API 사용
import lm_eval
from lm_eval import evaluator, utils
from lm_eval.models.huggingface import HFLM
def run_comprehensive_evaluation(model_path, output_dir="./results"):
"""LM-Evaluation-Harness 종합 평가"""
import os
os.makedirs(output_dir, exist_ok=True)
# 평가할 태스크 정의
task_groups = {
"knowledge": ["mmlu", "arc_challenge", "arc_easy"],
"reasoning": ["hellaswag", "winogrande", "piqa"],
"truthfulness": ["truthfulqa_mc1"],
"math": ["gsm8k"],
"coding": ["humaneval"],
}
all_results = {}
for group, tasks in task_groups.items():
print(f"\n=== {group.upper()} 평가 중 ===")
results = evaluator.simple_evaluate(
model="hf",
model_args=f"pretrained={model_path}",
tasks=tasks,
num_fewshot={"mmlu": 5, "arc_challenge": 25, "hellaswag": 10,
"truthfulqa_mc1": 0, "gsm8k": 5, "winogrande": 5,
"piqa": 0, "humaneval": 0, "arc_easy": 25}.get(tasks[0], 0),
batch_size="auto",
device="cuda" if __import__("torch").cuda.is_available() else "cpu",
)
all_results[group] = results['results']
# 결과 출력
for task, metrics in results['results'].items():
if 'acc,none' in metrics:
print(f" {task}: {metrics['acc,none']*100:.1f}%")
elif 'exact_match,strict-match' in metrics:
print(f" {task}: {metrics['exact_match,strict-match']*100:.1f}%")
# 종합 결과 저장
import json
with open(f"{output_dir}/evaluation_results.json", "w", encoding="utf-8") as f:
json.dump(all_results, f, ensure_ascii=False, indent=2)
print(f"\n결과 저장 완료: {output_dir}/evaluation_results.json")
return all_results
def compare_models(model_paths, tasks=None):
"""여러 모델 비교 평가"""
if tasks is None:
tasks = ["mmlu", "arc_challenge", "hellaswag", "gsm8k"]
comparison = {}
for model_path in model_paths:
print(f"\n평가 중: {model_path}")
results = evaluator.simple_evaluate(
model="hf",
model_args=f"pretrained={model_path}",
tasks=tasks,
num_fewshot=5,
batch_size="auto"
)
model_scores = {}
for task, metrics in results['results'].items():
for metric, value in metrics.items():
if isinstance(value, (int, float)) and not metric.endswith('_stderr'):
model_scores[f"{task}/{metric}"] = round(value * 100, 2)
comparison[model_path.split('/')[-1]] = model_scores
# 비교 테이블 출력
print("\n" + "="*80)
print("모델 비교 결과:")
print("="*80)
all_metrics = sorted(set().union(*[s.keys() for s in comparison.values()]))
header = f"{'메트릭':<40}" + "".join(f"{m[:15]:<18}" for m in comparison.keys())
print(header)
print("-" * 80)
for metric in all_metrics:
if 'acc,none' in metric or 'exact_match' in metric:
row = f"{metric:<40}"
for model_name in comparison:
score = comparison[model_name].get(metric, "N/A")
row += f"{score:<18}"
print(row)
return comparison
커스텀 태스크 추가
# custom_task.py
from lm_eval.api.task import Task, TaskConfig
from lm_eval.api.instance import Instance
class KoreanQATask(Task):
"""한국어 QA 커스텀 태스크"""
VERSION = 1.0
DATASET_PATH = "your-org/korean-qa-dataset"
DATASET_NAME = None
def has_training_docs(self):
return False
def has_validation_docs(self):
return True
def has_test_docs(self):
return True
def validation_docs(self):
return self.dataset["validation"]
def test_docs(self):
return self.dataset["test"]
def doc_to_text(self, doc):
return f"질문: {doc['question']}\n답변:"
def doc_to_target(self, doc):
return " " + doc['answer']
def construct_requests(self, doc, ctx):
return [Instance(
request_type="generate_until",
doc=doc,
arguments=(ctx, {"until": ["\n", "질문:"]}),
idx=0
)]
def process_results(self, doc, results):
gold = doc['answer'].lower().strip()
pred = results[0].lower().strip()
return {"exact_match": int(gold == pred)}
def aggregation(self):
return {"exact_match": "mean"}
def higher_is_better(self):
return {"exact_match": True}
마무리
AI 벤치마크 데이터셋은 AI 연구와 개발의 나침반 역할을 합니다. 주요 내용을 정리하면:
컴퓨터 비전:
- ImageNet: 1,000개 클래스 분류의 황금 기준
- COCO: 객체 탐지와 세그멘테이션의 표준
- ADE20K: 시맨틱 세그멘테이션 주요 벤치마크
NLP:
- GLUE/SuperGLUE: 언어 이해 능력 종합 평가
- SQuAD: 기계 독해의 표준 벤치마크
LLM 능력:
- MMLU: 57개 분야 지식 평가 (가장 광범위)
- HumanEval: 코드 생성 능력 평가
- GSM8K: 수학적 추론 능력
종합 평가:
- HELM: 7개 차원 균형 평가
- Chatbot Arena: 실제 인간 선호도 ELO 기반
- Open LLM Leaderboard: 오픈소스 LLM 비교
한국어:
- KLUE: 8개 태스크 한국어 이해 평가
- KMMLU: 한국어 지식 능력 평가
벤치마크를 해석할 때는 항상 데이터셋 오염 가능성, 측정 편향, 실제 사용 환경과의 괴리를 염두에 두어야 합니다. 단일 벤치마크가 아닌 다양한 측면에서의 종합 평가가 모델의 실제 능력을 더 잘 반영합니다.
참고 자료
AI Benchmark Datasets Complete Guide: ImageNet, COCO, GLUE, MMLU, HumanEval
Table of Contents
- Why AI Benchmarks Matter
- Computer Vision Benchmarks
- NLP Benchmarks
- LLM Capability Benchmarks
- Comprehensive LLM Evaluation
- Korean Language Benchmarks
- Multimodal Benchmarks
- Using LM-Evaluation-Harness
1. Why AI Benchmarks Matter
The Need for Standardized Evaluation
How should AI models be compared? When two image classification models exist, a common standard is needed to determine which is better. Benchmark datasets provide exactly that common ground.
Without standardized benchmarks, each team could evaluate only on data favorable to them, making objective comparison impossible. Standard benchmarks like ImageNet, GLUE, and MMLU have enabled the AI research community to compete on the same test, measuring progress and setting direction.
Leaderboards and Competition
Benchmarks make AI progress visible through leaderboards.
- ImageNet LSVRC: AlexNet reduced Top-5 error from 26% to 15.3% in 2012, launching the deep learning revolution.
- GLUE/SuperGLUE: Documented the journey of BERT, RoBERTa, T5, and others surpassing human-level performance.
- HumanEval: Became the arena where GPT-4, Claude, Gemini, and others compete on code generation.
- LMSYS Chatbot Arena: Real human users blindly compare two models and vote, producing ELO ratings.
Limitations and Biases of Benchmarks
Benchmarks are powerful tools with clear limitations.
1. Dataset Contamination
LLMs are trained on vast internet text. If benchmark test data is present in training data, the model may be memorizing answers rather than genuinely solving problems. Even the GPT-4 technical report acknowledged this issue.
2. Goodhart's Law
"When a measure becomes a target, it ceases to be a good measure." Researchers who focus only on improving specific benchmark scores can raise scores without genuine capability improvements.
3. Bias and Representativeness
Many benchmarks are heavily weighted toward English and Western cultural data. Performance in Korean, Arabic, Swahili, and other languages can differ substantially from English benchmark scores.
4. Static Standards
Benchmarks do not change once created, but AI models continually improve. A difficult benchmark in 2023 can reach near-saturation by 2025.
5. Gap from Real-World Performance
High benchmark scores do not guarantee good performance in actual deployment. User experience, creativity, safety, and other hard-to-quantify factors matter just as much.
2. Computer Vision Benchmarks
ImageNet (ILSVRC)
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is the most influential benchmark in computer vision history. Originating from the ImageNet project (2009) led by Professor Fei-Fei Li at Stanford, it ran as an annual competition from 2010 to 2017.
Dataset Characteristics:
- 1,000 classes (everyday objects: dogs, cats, cars, etc.)
- Training data: approximately 1.2 million images
- Validation data: 50,000 images
- Test data: 100,000 images
- Average of about 1,200 images per class
Key Metrics:
- Top-1 Accuracy: Fraction of predictions where the top-1 predicted class is the correct label
- Top-5 Accuracy: Fraction where the correct label appears in the top 5 predictions
Historical Progress:
| Year | Model | Top-5 Error |
|---|---|---|
| 2010 | NEC-UIUC | 28.2% |
| 2012 | AlexNet | 15.3% |
| 2014 | VGG-16 | 7.3% |
| 2015 | ResNet-152 | 3.57% |
| 2017 | SENet | 2.25% |
| 2021 | CoAtNet | 0.95% |
| 2023 | ViT-22B | ~0.6% |
Human Top-5 error is estimated at about 5.1%. After ResNet surpassed human performance in 2015, research expanded to harder variants: ImageNet-A, ImageNet-R, and ImageNet-C.
# Measuring ImageNet validation accuracy with PyTorch
import torch
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models
from torch.utils.data import DataLoader
def evaluate_imagenet(model, val_dir, batch_size=256):
# Standard preprocessing (ImageNet validation standard)
val_transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
val_dataset = datasets.ImageFolder(val_dir, transform=val_transform)
val_loader = DataLoader(
val_dataset,
batch_size=batch_size,
shuffle=False,
num_workers=8,
pin_memory=True
)
model.eval()
top1_correct = 0
top5_correct = 0
total = 0
with torch.no_grad():
for images, labels in val_loader:
images = images.cuda()
labels = labels.cuda()
outputs = model(images)
_, predicted = outputs.topk(5, 1, True, True)
predicted = predicted.t()
correct = predicted.eq(labels.view(1, -1).expand_as(predicted))
top1_correct += correct[:1].reshape(-1).float().sum(0)
top5_correct += correct[:5].reshape(-1).float().sum(0)
total += labels.size(0)
top1_acc = top1_correct / total * 100
top5_acc = top5_correct / total * 100
print(f"Top-1 Accuracy: {top1_acc:.2f}%")
print(f"Top-5 Accuracy: {top5_acc:.2f}%")
return top1_acc, top5_acc
# Example: Evaluate ResNet-50
model = models.resnet50(pretrained=True).cuda()
evaluate_imagenet(model, '/path/to/imagenet/val')
COCO (Common Objects in Context)
COCO is a large-scale object detection, segmentation, and image captioning benchmark released by Microsoft in 2014.
Dataset Characteristics:
- 80 categories of everyday objects
- 330,000+ images
- 1.5+ million object instances
- 5 captions per image (for captioning tasks)
- Detailed instance segmentation masks
Key Metrics:
mAP (mean Average Precision) is COCO's primary metric. Various metrics exist depending on IoU (Intersection over Union) thresholds.
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
import json
def evaluate_coco_detection(annotation_file, result_file):
# Load COCO ground truth
coco_gt = COCO(annotation_file)
# Load predictions
coco_dt = coco_gt.loadRes(result_file)
# Bounding box evaluation
coco_eval = COCOeval(coco_gt, coco_dt, 'bbox')
coco_eval.evaluate()
coco_eval.accumulate()
coco_eval.summarize()
stats = coco_eval.stats
print(f"\n=== COCO Detection Results ===")
print(f"AP @ IoU=0.50:0.95 (COCO primary): {stats[0]:.3f}")
print(f"AP @ IoU=0.50 (PASCAL VOC style): {stats[1]:.3f}")
print(f"AP @ IoU=0.75 (strict): {stats[2]:.3f}")
print(f"AP small (area < 32^2): {stats[3]:.3f}")
print(f"AP medium: {stats[4]:.3f}")
print(f"AP large: {stats[5]:.3f}")
print(f"AR (max=1 per image): {stats[6]:.3f}")
print(f"AR (max=10 per image): {stats[7]:.3f}")
print(f"AR (max=100 per image): {stats[8]:.3f}")
return stats
# Explore COCO annotations
coco = COCO('instances_val2017.json')
cat_ids = coco.getCatIds(catNms=['person', 'car', 'dog'])
img_ids = coco.getImgIds(catIds=cat_ids[:1])
img = coco.loadImgs(img_ids[0])[0]
ann_ids = coco.getAnnIds(imgIds=img['id'])
anns = coco.loadAnns(ann_ids)
print(f"Image: {img['file_name']}, Annotations: {len(anns)}")
for ann in anns[:3]:
cat = coco.loadCats(ann['category_id'])[0]
print(f" Category: {cat['name']}, Area: {ann['area']:.0f}px^2")
State-of-the-Art COCO Performance (2025):
| Model | AP (box) | AP (mask) | Parameters |
|---|---|---|---|
| YOLOv8x | 53.9 | - | 68M |
| DINO (Swin-L) | 63.3 | - | 218M |
| Co-DINO (Swin-L) | 64.1 | 54.0 | 218M |
| InternImage-H | 65.4 | 56.1 | 2.18B |
ADE20K - Semantic Segmentation
ADE20K, built by MIT CSAIL, is a semantic segmentation benchmark covering 150 categories across 25,000 images.
Key Metrics:
- mIoU (mean Intersection over Union): Average IoU between predicted and ground-truth masks
- aAcc: Pixel-level overall accuracy
- mAcc: Per-class mean accuracy
import numpy as np
def compute_miou(pred_mask, gt_mask, num_classes=150):
"""Compute mIoU."""
iou_list = []
for cls in range(num_classes):
pred_cls = (pred_mask == cls)
gt_cls = (gt_mask == cls)
intersection = np.logical_and(pred_cls, gt_cls).sum()
union = np.logical_or(pred_cls, gt_cls).sum()
if union == 0:
continue # Skip if class not present in image
iou = intersection / union
iou_list.append(iou)
return np.mean(iou_list) if iou_list else 0.0
# Evaluation with mmsegmentation
from mmseg.apis import inference_segmentor, init_segmentor
config_file = 'configs/segformer/segformer_mit-b5_8xb2-160k_ade20k-512x512.py'
checkpoint_file = 'segformer_mit-b5_8x2_512x512_160k_ade20k_20220617_203542-745f14da.pth'
model = init_segmentor(config_file, checkpoint_file, device='cuda:0')
result = inference_segmentor(model, 'test_image.jpg')
Kinetics - Video Classification
Kinetics, provided by Google DeepMind, is a video action recognition benchmark.
- Kinetics-400: 400 action classes, ~300,000 clips
- Kinetics-600: 600 classes, ~500,000 clips
- Kinetics-700: 700 classes
Primary metrics: Top-1 and Top-5 accuracy (averaged per clip).
CIFAR-10/100
Small-scale image classification benchmarks widely used for rapid prototyping and paper validation.
import torch
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
def evaluate_cifar10(model, batch_size=128):
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])
testset = torchvision.datasets.CIFAR10(
root='./data', train=False, download=True, transform=transform
)
testloader = DataLoader(testset, batch_size=batch_size, shuffle=False, num_workers=4)
model.eval()
correct = 0
total = 0
with torch.no_grad():
for images, labels in testloader:
outputs = model(images)
_, predicted = torch.max(outputs, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
accuracy = 100 * correct / total
print(f"CIFAR-10 Accuracy: {accuracy:.2f}%")
return accuracy
3. NLP Benchmarks
GLUE (General Language Understanding Evaluation)
GLUE, jointly published by NYU and DeepMind in 2018, is an NLP model evaluation benchmark consisting of 9 different language understanding tasks.
GLUE Task Composition:
| Task | Description | Dataset | Metric |
|---|---|---|---|
| CoLA | Grammatical acceptability | 8,551 | Matthews Corr. |
| SST-2 | Sentiment classification | 67K | Accuracy |
| MRPC | Semantic equivalence | 3,700 | F1/Accuracy |
| STS-B | Sentence similarity score | 7K | Pearson/Spearman |
| QQP | Question pair similarity | 400K | F1/Accuracy |
| MNLI | Natural language inference (3-way) | 393K | Accuracy |
| QNLI | Question-answer inference | 105K | Accuracy |
| RTE | Textual entailment | 2,500 | Accuracy |
| WNLI | Winograd NLI | 634 | Accuracy |
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import numpy as np
from sklearn.metrics import matthews_corrcoef
def evaluate_glue_cola(model_name="bert-base-uncased"):
"""Evaluate CoLA (grammatical acceptability)."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=2
)
dataset = load_dataset("glue", "cola")
val_data = dataset["validation"]
predictions = []
labels = []
model.eval()
import torch
for item in val_data:
inputs = tokenizer(
item['sentence'],
return_tensors='pt',
padding=True,
truncation=True,
max_length=128
)
with torch.no_grad():
outputs = model(**inputs)
pred = outputs.logits.argmax(dim=-1).item()
predictions.append(pred)
labels.append(item['label'])
mcc = matthews_corrcoef(labels, predictions)
print(f"CoLA Matthews Correlation: {mcc:.4f}")
return mcc
def evaluate_glue_sst2(model_name="textattack/bert-base-uncased-SST-2"):
"""Evaluate SST-2 (sentiment classification)."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
dataset = load_dataset("glue", "sst2")
val_data = dataset["validation"]
correct = 0
total = len(val_data)
model.eval()
import torch
for item in val_data:
inputs = tokenizer(
item['sentence'],
return_tensors='pt',
truncation=True,
max_length=128
)
with torch.no_grad():
outputs = model(**inputs)
pred = outputs.logits.argmax(dim=-1).item()
if pred == item['label']:
correct += 1
acc = correct / total
print(f"SST-2 Accuracy: {acc:.4f}")
return acc
SuperGLUE
When GLUE approached saturation (near-human performance), SuperGLUE was introduced in 2019 with harder tasks.
SuperGLUE Tasks:
- BoolQ: Yes/no question answering (9,427)
- CB: Commitment/entailment (250, 3-way)
- COPA: Cause/effect reasoning (1,000)
- MultiRC: Multi-sentence reading comprehension (9,693)
- ReCoRD: Cloze-style reading comprehension (120K)
- RTE: Textual entailment recognition (5,749)
- WiC: Word-in-context disambiguation (9,600)
- WSC: Winograd Schema Challenge (554)
Human baseline: 89.8 / GPT-4-class models: 90+ (surpassing humans)
SQuAD 1.1 & 2.0
SQuAD (Stanford Question Answering Dataset) is a machine reading comprehension benchmark where answers are extracted from Wikipedia passages.
- SQuAD 1.1: 536 Wikipedia articles, 107,785 question-answer pairs. All answers exist within the passage.
- SQuAD 2.0: SQuAD 1.1 + 53,775 unanswerable questions added.
Evaluation Metrics:
- EM (Exact Match): Fraction of predictions exactly matching the gold answer
- F1 Score: Token-level partial match score
from datasets import load_dataset
from transformers import pipeline
def evaluate_squad(model_name="deepset/roberta-base-squad2"):
"""Evaluate SQuAD 2.0."""
qa_pipeline = pipeline("question-answering", model=model_name)
dataset = load_dataset("squad_v2", split="validation")
em_scores = []
f1_scores = []
no_answer_correct = 0
no_answer_total = 0
for item in dataset.select(range(200)):
context = item['context']
question = item['question']
answers = item['answers']
result = qa_pipeline(question=question, context=context)
predicted = result['answer'].lower().strip()
has_answer = len(answers['text']) > 0
if not has_answer:
no_answer_total += 1
if result['score'] < 0.1:
no_answer_correct += 1
em_scores.append(0)
f1_scores.append(0)
else:
gold_answers = [a.lower().strip() for a in answers['text']]
em = max(int(predicted == gold) for gold in gold_answers)
em_scores.append(em)
best_f1 = 0
for gold in gold_answers:
pred_tokens = set(predicted.split())
gold_tokens = set(gold.split())
common = pred_tokens & gold_tokens
if len(common) == 0:
f1 = 0
else:
precision = len(common) / len(pred_tokens)
recall = len(common) / len(gold_tokens)
f1 = 2 * precision * recall / (precision + recall)
best_f1 = max(best_f1, f1)
f1_scores.append(best_f1)
print(f"SQuAD 2.0 Results (200 samples):")
print(f" EM: {sum(em_scores)/len(em_scores)*100:.1f}%")
print(f" F1: {sum(f1_scores)/len(f1_scores)*100:.1f}%")
if no_answer_total > 0:
print(f" No-Answer Accuracy: {no_answer_correct/no_answer_total*100:.1f}%")
WMT - Machine Translation
WMT (Workshop on Machine Translation) is an annual competition evaluating machine translation models across multiple language pairs (English-German, English-Chinese, English-Korean, etc.).
Key Metrics:
- BLEU (Bilingual Evaluation Understudy): Automatic evaluation based on n-gram precision
- COMET: Neural metric with high correlation to human judgment
- chrF: Character-level n-gram F-score
import sacrebleu
def compute_bleu(predictions, references):
"""Compute BLEU score."""
bleu = sacrebleu.corpus_bleu(predictions, [references])
print(f"BLEU: {bleu.score:.2f}")
print(f"BP: {bleu.bp:.3f}")
print(f"Ratio: {bleu.sys_len/bleu.ref_len:.3f}")
return bleu.score
from transformers import MarianMTModel, MarianTokenizer
def evaluate_translation(src_texts, tgt_texts, model_name="Helsinki-NLP/opus-mt-en-de"):
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
predictions = []
for text in src_texts[:100]:
inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True)
translated = model.generate(**inputs, max_length=512)
pred = tokenizer.decode(translated[0], skip_special_tokens=True)
predictions.append(pred)
bleu_score = compute_bleu(predictions, tgt_texts[:100])
return bleu_score
4. LLM Capability Benchmarks
MMLU (Massive Multitask Language Understanding)
MMLU, published by Dan Hendrycks at UC Berkeley in 2020, evaluates LLM knowledge and reasoning with graduate-level multiple-choice questions across 57 academic disciplines.
Domain Breakdown:
- STEM: Mathematics, physics, chemistry, computer science, engineering
- Humanities: History, philosophy, law, ethics
- Social Sciences: Psychology, economics, political science, sociology
- Other: Medicine, nutrition, moral scenarios, professional accounting
Each question is four-choice, with approximately 14,000 questions total.
MMLU Performance by Model:
| Model | MMLU Score | Year |
|---|---|---|
| GPT-3 (175B) | 43.9% | 2020 |
| Gopher (280B) | 60.0% | 2021 |
| GPT-4 | 86.4% | 2023 |
| Claude 3 Opus | 86.8% | 2024 |
| Gemini Ultra | 90.0% | 2024 |
| GPT-4o | 88.7% | 2024 |
| Human expert estimate | ~90% | - |
from datasets import load_dataset
def evaluate_mmlu(model_fn, subjects=None, num_few_shot=5):
"""MMLU evaluation function."""
if subjects is None:
subjects = ['abstract_algebra', 'anatomy', 'astronomy', 'college_mathematics']
results = {}
for subject in subjects:
dataset = load_dataset("lukaemon/mmlu", subject)
test_data = dataset['test']
dev_data = dataset['dev']
correct = 0
total = 0
# Build few-shot prompt
few_shot_examples = ""
for i, item in enumerate(dev_data.select(range(num_few_shot))):
few_shot_examples += f"Q: {item['input']}\n"
few_shot_examples += f"(A) {item['A']} (B) {item['B']} (C) {item['C']} (D) {item['D']}\n"
few_shot_examples += f"Answer: {item['target']}\n\n"
for item in test_data:
prompt = few_shot_examples
prompt += f"Q: {item['input']}\n"
prompt += f"(A) {item['A']} (B) {item['B']} (C) {item['C']} (D) {item['D']}\n"
prompt += "Answer:"
response = model_fn(prompt)
pred = response.strip()[0] if response.strip() else 'A'
if pred == item['target']:
correct += 1
total += 1
accuracy = correct / total
results[subject] = accuracy
print(f"{subject}: {accuracy:.3f} ({correct}/{total})")
overall = sum(results.values()) / len(results)
print(f"\nOverall average: {overall:.3f}")
return results
BIG-Bench (Beyond the Imitation Game Benchmark)
BIG-Bench, led by Google, consists of 204 diverse tasks designed to probe the limits of LLMs. It includes creative reasoning, common sense, mathematics, and code that language models still struggle with.
BIG-Bench Hard: 23 difficult tasks where chain-of-thought prompting dramatically improves performance.
from lm_eval import evaluator
# BIG-Bench evaluation via lm-evaluation-harness
results = evaluator.simple_evaluate(
model="hf",
model_args="pretrained=meta-llama/Llama-3.2-3B-Instruct",
tasks=["bigbench_causal_judgment", "bigbench_date_understanding"],
num_fewshot=3,
batch_size="auto"
)
print(results['results'])
HellaSwag - Commonsense Reasoning
HellaSwag, published in 2019, is a commonsense reasoning benchmark where the model selects the most natural sentence continuation from four choices.
from datasets import load_dataset
import torch
from transformers import AutoTokenizer, AutoModelForMultipleChoice
def evaluate_hellaswag(model_name="microsoft/deberta-v2-xxlarge"):
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMultipleChoice.from_pretrained(model_name)
dataset = load_dataset("hellaswag", split="validation")
correct = 0
total = min(500, len(dataset))
for item in dataset.select(range(total)):
context = item['ctx']
endings = item['endings']
label = int(item['label'])
choices = [context + " " + ending for ending in endings]
encoding = tokenizer(
[context] * 4,
choices,
return_tensors='pt',
padding=True,
truncation=True,
max_length=256
)
with torch.no_grad():
outputs = model(**{k: v.unsqueeze(0) for k, v in encoding.items()})
logits = outputs.logits
predicted = logits.argmax(dim=-1).item()
if predicted == label:
correct += 1
accuracy = correct / total
print(f"HellaSwag Accuracy: {accuracy:.4f}")
return accuracy
ARC (AI2 Reasoning Challenge)
ARC, published by AI2 (Allen Institute for AI), is an elementary-to-high-school level science question benchmark.
- ARC-Easy: Relatively straightforward questions (5,197)
- ARC-Challenge: Difficult questions that even retrieval-based models get wrong (1,172)
TruthfulQA - Factuality Evaluation
TruthfulQA evaluates how accurately a model responds to questions about widely held misconceptions, myths, and biases.
from datasets import load_dataset
from transformers import pipeline
def evaluate_truthfulqa(model_name="gpt2-xl"):
"""TruthfulQA MC1 (single correct answer) evaluation."""
dataset = load_dataset("truthful_qa", "multiple_choice")
val_data = dataset["validation"]
generator = pipeline("text-generation", model=model_name)
correct = 0
total = min(100, len(val_data))
for item in val_data.select(range(total)):
question = item['question']
choices = item['mc1_targets']['choices']
labels = item['mc1_targets']['labels']
correct_idx = labels.index(1)
prompt = f"Q: {question}\nOptions:\n"
for i, choice in enumerate(choices):
letter = chr(65 + i)
prompt += f"{letter}. {choice}\n"
prompt += "Answer:"
response = generator(prompt, max_new_tokens=5, do_sample=False)
generated = response[0]['generated_text'][len(prompt):].strip()
pred_letter = generated[0] if generated else 'A'
pred_idx = ord(pred_letter) - 65
if pred_idx == correct_idx:
correct += 1
accuracy = correct / total
print(f"TruthfulQA MC1 Accuracy: {accuracy:.4f}")
return accuracy
GSM8K - Grade School Math
GSM8K (Grade School Math 8K), published by OpenAI in 2021, consists of 8,500 grade school math word problems evaluating step-by-step mathematical reasoning.
from datasets import load_dataset
import re
def extract_number(text):
"""Extract the final numeric answer from text."""
numbers = re.findall(r'-?\d+\.?\d*', text)
return numbers[-1] if numbers else None
def evaluate_gsm8k_chain_of_thought(model_fn, num_shot=8):
"""Evaluate GSM8K with Chain-of-Thought prompting."""
dataset = load_dataset("gsm8k", "main")
test_data = dataset['test']
few_shot_prompt = """Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans times 3 balls = 6 balls. 5 + 6 = 11 balls. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many do they have?
A: They started with 23. Used 20: 23 - 20 = 3. Then bought 6: 3 + 6 = 9. The answer is 9.
"""
correct = 0
total = min(200, len(test_data))
for item in test_data.select(range(total)):
question = item['question']
gold_answer = item['answer'].split('####')[-1].strip()
prompt = few_shot_prompt + f"Q: {question}\nA:"
response = model_fn(prompt, max_tokens=256)
pred = extract_number(response)
gold = extract_number(gold_answer)
if pred and gold and abs(float(pred) - float(gold)) < 0.01:
correct += 1
accuracy = correct / total
print(f"GSM8K Accuracy (Chain-of-Thought): {accuracy:.4f}")
return accuracy
HumanEval - Code Generation
HumanEval, published by OpenAI in 2021, consists of 164 Python function signatures with docstrings where the model must write the complete function.
Metric: pass@k
The probability of passing at least one test in k attempts.
from datasets import load_dataset
import subprocess
import tempfile
import os
def evaluate_humaneval(model_fn, k=1, n=10, temperature=0.8):
"""HumanEval pass@k evaluation."""
dataset = load_dataset("openai_humaneval")
test_data = dataset['test']
task_results = {}
for item in test_data.select(range(20)):
task_id = item['task_id']
prompt = item['prompt']
tests = item['test']
entry_point = item['entry_point']
passes = 0
for attempt in range(n):
code = model_fn(prompt, temperature=temperature)
full_code = prompt + code + "\n" + tests + f"\ncheck({entry_point})"
with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
f.write(full_code)
tmp_path = f.name
try:
result = subprocess.run(
['python', tmp_path],
timeout=10,
capture_output=True,
text=True
)
if result.returncode == 0:
passes += 1
except subprocess.TimeoutExpired:
pass
finally:
os.unlink(tmp_path)
task_results[task_id] = passes / n
pass_at_1 = sum(task_results.values()) / len(task_results)
print(f"pass@1: {pass_at_1:.4f}")
return pass_at_1
# HumanEval performance of major models (2025)
humaneval_scores = {
"GPT-3 (175B)": 0.0,
"Codex (12B)": 0.288,
"GPT-4": 0.870,
"Claude 3.5 Sonnet": 0.900,
"DeepSeek-Coder-33B": 0.823,
"Llama 3.1 70B": 0.803,
}
MBPP - Python Programming
MBPP (Mostly Basic Python Problems), published by Google, consists of 974 Python programming problems covering a wider range of difficulty than HumanEval.
from datasets import load_dataset
def explore_mbpp():
"""Explore the MBPP dataset."""
dataset = load_dataset("mbpp")
test_data = dataset['test']
print("Sample MBPP problems:")
for item in test_data.select(range(3)):
print(f"\nTask ID: {item['task_id']}")
print(f"Problem: {item['text']}")
print(f"Test cases: {item['test_list'][:2]}")
print(f"Reference code:\n{item['code']}")
print("-" * 50)
5. Comprehensive LLM Evaluation
MT-Bench - Multi-Turn Dialogue Evaluation
MT-Bench, developed by the LMSYS team at UC Berkeley, is a multi-turn dialogue evaluation benchmark that uses GPT-4 as a judge, scoring responses on a 1-10 scale.
8 categories, 10 questions each:
- Writing
- Roleplay
- Reasoning
- Math
- Coding
- Extraction
- STEM
- Humanities
from openai import OpenAI
def mt_bench_judge(question, answer, reference_answer=None):
"""Evaluate MT-Bench response using GPT-4."""
client = OpenAI()
system_prompt = """You are a helpful assistant that evaluates AI responses.
Rate the response on a scale of 1-10 based on: accuracy, relevance, completeness, and clarity.
Output format: Score: X/10\nRationale: [brief explanation]"""
user_prompt = f"""Question: {question}
AI Response: {answer}
{f'Reference Answer: {reference_answer}' if reference_answer else ''}
Please evaluate this response."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0.0
)
judge_response = response.choices[0].message.content
print(f"Evaluation:\n{judge_response}")
return judge_response
# Official MT-Bench usage with FastChat:
# git clone https://github.com/lm-sys/FastChat
# python -m fastchat.llm_judge.gen_model_answer --model-path your-model
# python -m fastchat.llm_judge.gen_judgment --judge-model gpt-4
# python -m fastchat.llm_judge.show_result
LMSYS Chatbot Arena
Chatbot Arena has real users compare responses from two anonymous models and vote for the better one. Using the ELO rating system, it reflects genuine human preferences.
Top Models by ELO (March 2025, approximate):
| Rank | Model | ELO |
|---|---|---|
| 1 | GPT-4.5 | ~1370 |
| 2 | Gemini 2.0 Ultra | ~1360 |
| 3 | Claude 3.7 Sonnet | ~1350 |
| 4 | GPT-4o | ~1340 |
| 5 | Llama 3.3 70B | ~1250 |
HELM (Holistic Evaluation of Language Models)
HELM, developed by Stanford CRFM, evaluates models across 7 dimensions beyond simple accuracy:
- Accuracy
- Calibration
- Robustness
- Fairness
- Bias
- Toxicity
- Efficiency
# Run HELM evaluation
pip install crfm-helm
helm-run \
--conf src/helm/benchmark/presentation/run_specs_lite.conf \
--local \
--max-eval-instances 1000 \
--num-train-trials 1
# View results
helm-summarize --suite v1
helm-server
Open LLM Leaderboard (HuggingFace)
The HuggingFace Open LLM Leaderboard is a public leaderboard evaluating open-source LLMs on consistent benchmarks.
Evaluation Tasks:
- MMLU (5-shot)
- ARC Challenge (25-shot)
- HellaSwag (10-shot)
- TruthfulQA (0-shot)
- Winogrande (5-shot)
- GSM8K (5-shot)
from huggingface_hub import HfApi
def fetch_leaderboard_data():
"""Fetch Open LLM Leaderboard data."""
api = HfApi()
dataset_info = api.dataset_info("open-llm-leaderboard/results")
print(f"Last updated: {dataset_info.lastModified}")
files = api.list_repo_files(
repo_id="open-llm-leaderboard/results",
repo_type="dataset"
)
for f in list(files)[:5]:
print(f"File: {f}")
6. Korean Language Benchmarks
KLUE (Korean Language Understanding Evaluation)
KLUE, jointly developed in 2021 by ETRI and other Korean institutions, is a Korean language understanding benchmark consisting of 8 tasks.
KLUE Tasks:
| Task | Type | Data Size | Metric |
|---|---|---|---|
| TC (Topic Classification) | Document classification | 60K | Accuracy |
| STS (Semantic Textual Similarity) | Sentence similarity | 13K | Pearson |
| NLI (Natural Language Inference) | 3-way classification | 30K | Accuracy |
| NER (Named Entity Recognition) | Entity extraction | 21K | Entity F1 |
| RE (Relation Extraction) | Relation classification | 32K | micro-F1 |
| DP (Dependency Parsing) | Syntactic analysis | 23K | UAS/LAS |
| MRC (Machine Reading Comprehension) | Reading comprehension | 24K | EM/F1 |
| DST (Dialogue State Tracking) | Dialogue tracking | 10K | JGA |
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
def evaluate_klue_nli(model_name="klue/roberta-large"):
"""Evaluate KLUE-NLI."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=3
)
dataset = load_dataset("klue", "nli")
val_data = dataset['validation']
correct = 0
total = min(500, len(val_data))
model.eval()
for item in val_data.select(range(total)):
premise = item['premise']
hypothesis = item['hypothesis']
gold_label = item['label']
inputs = tokenizer(
premise, hypothesis,
return_tensors='pt',
truncation=True,
max_length=512
)
with torch.no_grad():
outputs = model(**inputs)
pred = outputs.logits.argmax(dim=-1).item()
if pred == gold_label:
correct += 1
accuracy = correct / total
print(f"KLUE-NLI Accuracy: {accuracy:.4f}")
return accuracy
def evaluate_klue_mrc(model_name="klue/roberta-large"):
"""Evaluate KLUE-MRC (Machine Reading Comprehension)."""
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model=model_name,
tokenizer=model_name
)
dataset = load_dataset("klue", "mrc")
val_data = dataset['validation']
em_scores = []
f1_scores = []
for item in val_data.select(range(100)):
context = item['context']
question = item['question']
answers = item['answers']['text']
result = qa_pipeline(question=question, context=context)
predicted = result['answer'].strip()
em = max(int(predicted == a) for a in answers)
em_scores.append(em)
best_f1 = 0
for gold in answers:
pred_chars = set(predicted)
gold_chars = set(gold)
common = pred_chars & gold_chars
if common:
precision = len(common) / len(pred_chars)
recall = len(common) / len(gold_chars)
f1 = 2 * precision * recall / (precision + recall)
best_f1 = max(best_f1, f1)
f1_scores.append(best_f1)
print(f"KLUE-MRC EM: {sum(em_scores)/len(em_scores)*100:.1f}%")
print(f"KLUE-MRC F1: {sum(f1_scores)/len(f1_scores)*100:.1f}%")
KoBEST
KoBEST (Korean Balanced Evaluation of Significant Tasks), developed by KAIST, includes 5 tasks:
- BoolQ: Yes/no question answering
- COPA: Cause/effect reasoning
- WiC: Word-in-context disambiguation
- HellaSwag: Commonsense completion
- SentiNeg: Negation sentiment understanding
KMMLU (Korean MMLU)
KMMLU extends MMLU to Korean, including Korean-specific subjects (Korean history, Korean law, Korean medicine) alongside Korean translations of MMLU topics.
from datasets import load_dataset
def evaluate_kmmlu_sample():
"""Explore the KMMLU dataset."""
dataset = load_dataset("HAERAE-HUB/KMMLU")
test_data = dataset['test']
print(f"Total questions: {len(test_data)}")
subjects = set(test_data['subject'])
print(f"Number of subjects: {len(subjects)}")
print(f"Sample subjects: {list(subjects)[:10]}")
item = test_data[0]
print(f"\nSubject: {item['subject']}")
print(f"Question: {item['question']}")
print(f"A: {item['A']}")
print(f"B: {item['B']}")
print(f"C: {item['C']}")
print(f"D: {item['D']}")
print(f"Answer: {item['answer']}")
7. Multimodal Benchmarks
VQA (Visual Question Answering)
VQA is the task of answering natural language questions about images.
- VQA v2: ~1.1M (image, question, answer) triples. Two complementary questions per image.
- Metric: Accuracy = min(answers/3, 1) — how many of 10 annotators agree
from transformers import BlipProcessor, BlipForQuestionAnswering
import torch
from PIL import Image
def evaluate_vqa_blip():
"""VQA evaluation with BLIP."""
processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")
model.eval()
questions = [
("What color is the car?", "test_car.jpg"),
("How many people are in the image?", "test_crowd.jpg"),
("Is it raining?", "test_outdoor.jpg")
]
for question, image_path in questions:
try:
image = Image.open(image_path).convert('RGB')
inputs = processor(image, question, return_tensors="pt")
with torch.no_grad():
out = model.generate(**inputs, max_length=20)
answer = processor.decode(out[0], skip_special_tokens=True)
print(f"Q: {question}")
print(f"A: {answer}\n")
except FileNotFoundError:
print(f"Image not found: {image_path}")
MMBench
MMBench, published by Shanghai AI Lab, is a multimodal LLM evaluation benchmark covering 20 capability dimensions with 3,000 multiple-choice questions.
Sample Dimensions:
- Attribute Recognition
- Spatial Relationship
- Action Recognition
- OCR
- Commonsense Reasoning
MMMU (Massive Multidiscipline Multimodal Understanding)
MMMU evaluates university-level multimodal understanding across 6 core disciplines (Art, Science, Engineering, Medicine, Technology, Humanities), 30 subjects, and 11,550 questions.
from datasets import load_dataset
import torch
def explore_mmmu():
"""Explore the MMMU dataset."""
dataset = load_dataset("MMMU/MMMU", "Accounting")
print(f"Accounting validation: {len(dataset['validation'])} questions")
item = dataset['validation'][0]
print(f"\nQuestion: {item['question']}")
print(f"Option A: {item['option_A']}")
print(f"Option B: {item['option_B']}")
print(f"Answer: {item['answer']}")
if item['image_1']:
print("Image included")
def evaluate_mmmu_with_llava(model_name="llava-hf/llava-v1.6-mistral-7b-hf"):
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
processor = LlavaNextProcessor.from_pretrained(model_name)
model = LlavaNextForConditionalGeneration.from_pretrained(
model_name, torch_dtype="auto", device_map="auto"
)
dataset = load_dataset("MMMU/MMMU", "Accounting", split="validation")
correct = 0
total = min(50, len(dataset))
for item in dataset.select(range(total)):
question = item['question']
options = [item.get(f'option_{c}', '') for c in 'ABCDE' if item.get(f'option_{c}')]
gold = item['answer']
if item['image_1']:
image = item['image_1']
prompt = f"[INST] [IMG]\nQuestion: {question}\nOptions: {options}\nAnswer with only the option letter. [/INST]"
inputs = processor(prompt, image, return_tensors='pt').to(model.device)
else:
prompt = f"[INST] Question: {question}\nOptions: {options}\nAnswer with only the option letter. [/INST]"
inputs = processor(prompt, return_tensors='pt').to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=10)
response = processor.decode(output[0], skip_special_tokens=True)
pred = response[-1].upper() if response else 'A'
if pred == gold:
correct += 1
acc = correct / total
print(f"MMMU-Accounting Accuracy: {acc:.3f}")
return acc
8. Using LM-Evaluation-Harness
EleutherAI's lm-evaluation-harness is the standard tool for LLM evaluation, supporting 100+ benchmarks.
Installation and Basic Usage
# Install
git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
# Evaluate MMLU with GPT-2
lm_eval --model hf \
--model_args pretrained=gpt2 \
--tasks mmlu \
--num_fewshot 5 \
--batch_size 8 \
--output_path results/gpt2_mmlu
# Evaluate multiple tasks simultaneously
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-3.2-3B-Instruct \
--tasks mmlu,arc_challenge,hellaswag,truthfulqa_mc1,gsm8k \
--num_fewshot 5 \
--batch_size 4 \
--output_path results/llama3.2_3b
# Run with 4-bit quantization
lm_eval --model hf \
--model_args pretrained=meta-llama/Meta-Llama-3-8B,load_in_4bit=True \
--tasks mmlu \
--num_fewshot 5 \
--batch_size 1
Python API
import lm_eval
from lm_eval import evaluator
import json
import os
def run_comprehensive_evaluation(model_path, output_dir="./results"):
"""Run comprehensive LM-Evaluation-Harness evaluation."""
os.makedirs(output_dir, exist_ok=True)
task_groups = {
"knowledge": ["mmlu", "arc_challenge", "arc_easy"],
"reasoning": ["hellaswag", "winogrande", "piqa"],
"truthfulness": ["truthfulqa_mc1"],
"math": ["gsm8k"],
"coding": ["humaneval"],
}
all_results = {}
fewshot_map = {
"mmlu": 5, "arc_challenge": 25, "arc_easy": 25,
"hellaswag": 10, "winogrande": 5, "piqa": 0,
"truthfulqa_mc1": 0, "gsm8k": 5, "humaneval": 0
}
for group, tasks in task_groups.items():
print(f"\n=== Evaluating {group.upper()} ===")
results = evaluator.simple_evaluate(
model="hf",
model_args=f"pretrained={model_path}",
tasks=tasks,
num_fewshot=fewshot_map.get(tasks[0], 0),
batch_size="auto",
device="cuda" if __import__("torch").cuda.is_available() else "cpu",
)
all_results[group] = results['results']
for task, metrics in results['results'].items():
if 'acc,none' in metrics:
print(f" {task}: {metrics['acc,none']*100:.1f}%")
elif 'exact_match,strict-match' in metrics:
print(f" {task}: {metrics['exact_match,strict-match']*100:.1f}%")
with open(f"{output_dir}/evaluation_results.json", "w", encoding="utf-8") as f:
json.dump(all_results, f, ensure_ascii=False, indent=2)
print(f"\nResults saved to: {output_dir}/evaluation_results.json")
return all_results
def compare_models(model_paths, tasks=None):
"""Compare multiple models on the same tasks."""
if tasks is None:
tasks = ["mmlu", "arc_challenge", "hellaswag", "gsm8k"]
comparison = {}
for model_path in model_paths:
print(f"\nEvaluating: {model_path}")
results = evaluator.simple_evaluate(
model="hf",
model_args=f"pretrained={model_path}",
tasks=tasks,
num_fewshot=5,
batch_size="auto"
)
model_scores = {}
for task, metrics in results['results'].items():
for metric, value in metrics.items():
if isinstance(value, (int, float)) and not metric.endswith('_stderr'):
model_scores[f"{task}/{metric}"] = round(value * 100, 2)
comparison[model_path.split('/')[-1]] = model_scores
print("\n" + "="*80)
print("Model Comparison Results:")
print("="*80)
all_metrics = sorted(set().union(*[s.keys() for s in comparison.values()]))
header = f"{'Metric':<40}" + "".join(f"{m[:15]:<18}" for m in comparison.keys())
print(header)
print("-" * 80)
for metric in all_metrics:
if 'acc,none' in metric or 'exact_match' in metric:
row = f"{metric:<40}"
for model_name in comparison:
score = comparison[model_name].get(metric, "N/A")
row += f"{score:<18}"
print(row)
return comparison
Adding a Custom Task
# custom_task.py
from lm_eval.api.task import Task
from lm_eval.api.instance import Instance
class CustomQATask(Task):
"""Custom Q&A task for lm-evaluation-harness."""
VERSION = 1.0
DATASET_PATH = "your-org/your-qa-dataset"
DATASET_NAME = None
def has_training_docs(self):
return False
def has_validation_docs(self):
return True
def has_test_docs(self):
return True
def validation_docs(self):
return self.dataset["validation"]
def test_docs(self):
return self.dataset["test"]
def doc_to_text(self, doc):
return f"Question: {doc['question']}\nAnswer:"
def doc_to_target(self, doc):
return " " + doc['answer']
def construct_requests(self, doc, ctx):
return [Instance(
request_type="generate_until",
doc=doc,
arguments=(ctx, {"until": ["\n", "Question:"]}),
idx=0
)]
def process_results(self, doc, results):
gold = doc['answer'].lower().strip()
pred = results[0].lower().strip()
return {"exact_match": int(gold == pred)}
def aggregation(self):
return {"exact_match": "mean"}
def higher_is_better(self):
return {"exact_match": True}
Summary
AI benchmark datasets are the compass guiding AI research and development. Key takeaways:
Computer Vision:
- ImageNet: The gold standard for 1,000-class image classification
- COCO: The standard for object detection and segmentation
- ADE20K: Primary benchmark for semantic segmentation
NLP:
- GLUE/SuperGLUE: Comprehensive language understanding evaluation
- SQuAD: The standard benchmark for machine reading comprehension
LLM Capabilities:
- MMLU: Knowledge evaluation across 57 disciplines (broadest scope)
- HumanEval: Code generation capability evaluation
- GSM8K: Mathematical reasoning evaluation
Comprehensive Evaluation:
- HELM: Balanced evaluation across 7 dimensions
- Chatbot Arena: Human preference-based ELO ratings
- Open LLM Leaderboard: Comparing open-source LLMs
Korean Language:
- KLUE: 8-task Korean language understanding evaluation
- KMMLU: Korean knowledge evaluation
When interpreting benchmark results, always consider the possibility of data contamination, measurement bias, and the gap from real-world deployment conditions. A single benchmark cannot capture the full picture — comprehensive evaluation across multiple dimensions better reflects a model's genuine capabilities.