Split View: LLM, Tool Calling, Embedding 벤치마크 완전 분석: 각 벤치마크가 측정하는 것

LLM, Tool Calling, Embedding 벤치마크 완전 분석: 각 벤치마크가 측정하는 것

LLM, Tool Calling, Embedding 벤치마크 완전 분석

AI 모델을 평가할 때 수많은 벤치마크 이름들이 등장합니다. MMLU 85점, HumanEval 90%, MTEB 1위 — 이 숫자들이 실제로 무엇을 의미하는지, 어떤 상황에서 어떤 벤치마크를 참고해야 하는지 완전히 이해해봅시다.

1. LLM 일반 벤치마크

MMLU (Massive Multitask Language Understanding)

MMLU는 2020년 UC Berkeley에서 발표한 벤치마크로, LLM의 지식 폭과 다양한 분야에 걸친 이해력을 측정합니다.

측정 방법:

57개 학문 분야 (수학, 과학, 법학, 역사, 의학, 심리학 등)
14,000개 이상의 4지선다형 객관식 문제
5-shot learning 방식: 테스트 전 5개의 예시 문제와 정답을 제공

예시 문제:
분야: 고등학교 화학

예시 1: 원자번호 6인 원소는?
(A) 질소 (B) 산소 (C) 탄소 (D) 네온
정답: (C)

...5개 예시 후...

테스트: 이온결합이 형성되는 조건은?
(A) 두 비금속 원자 간
(B) 금속과 비금속 원자 간
(C) 두 금속 원자 간
(D) 귀금속과 비금속 원자 간
정답: ?

점수 해석:

무작위 선택 시 25% (4지선다)
GPT-4: ~86%, Claude 3 Opus: ~86%, Gemini Ultra: ~90%
인간 전문가 평균: ~89%

한계:

암기와 이해를 구분하기 어려움: 훈련 데이터에 포함된 문제를 그냥 외웠을 수도 있음
영어 중심 평가: 다국어 능력 미반영
최신 지식 미반영: 정적 데이터셋
데이터 오염 가능성: 훈련 데이터에 테스트 문제가 포함될 수 있음

HellaSwag

2019년 발표된 HellaSwag은 "상식적 추론"과 "문장 완성" 능력을 측정합니다. 이름은 "Harder Endings, Longer contexts, and Low-shot Activities For Situations With Adversarial Generations"의 약자입니다.

측정 방법:

ActivityNet(일상 활동 비디오 설명)과 WikiHow(단계별 가이드)에서 추출
주어진 문장/상황의 다음에 올 가장 자연스러운 내용 선택
잘못된 선택지(distractors)는 언어 모델이 생성하여 표면적으로는 그럴듯하지만 실제로는 틀린 내용

예시:
상황: "남자가 핫도그를 굽고 있다. 그는 핫도그를 집게로 뒤집는다."

다음에 올 내용:
(A) 그가 핫도그를 불에 던진다.
(B) 그가 완성된 핫도그를 빵에 올린다.
(C) 그가 냉장고에서 재료를 꺼낸다.
(D) 그가 레시피 책을 펼친다.

정답: (B)

점수 해석:

인간: 95.6%
GPT-4: 95.3%+
대부분의 최신 LLM들이 인간 수준에 근접

한계:

대규모 모델들에게 이미 너무 쉬워진 상태(포화)
영어 일상 문화 편향
실제 심층 추론보다는 언어 패턴 인식에 가까울 수 있음

ARC (AI2 Reasoning Challenge)

Allen Institute for AI에서 제작한 벤치마크로, 초등학교~중학교 수준의 과학 문제를 통해 추론 능력을 측정합니다.

두 가지 세트:

Easy Set:

4지선다형 초등학교 수준 과학 문제
단순 사실 확인으로 풀 수 있는 문제들
TF-IDF나 단어 빈도 기반 방법으로 25% 이상 달성 가능

Challenge Set:

단순 키워드 검색이나 정보 검색으로 풀 수 없도록 설계
추론, 인과관계, 다단계 사고가 필요
7,787개 문제

Challenge Set 예시:
"식물의 잎에서 기공(stomata)은 주로 어떤 역할을 하는가?"

(A) 광합성을 위한 빛 흡수
(B) 가스 교환과 수분 증산 조절
(C) 뿌리에서 물을 잎으로 운반
(D) 당분을 다른 식물 부위로 이동

→ 단순 검색으로는 충분하지 않고 기능 이해가 필요

점수 해석:

Challenge Set: GPT-4 ~96%, 인간 전문가 ~100%
Easy Set은 현재 거의 포화 상태

TruthfulQA

2021년 발표된 TruthfulQA는 모델의 환각(Hallucination)과 잘못된 믿음 생성 능력을 측정합니다.

측정 방법:

817개의 "미신적으로 오답을 유발하는" 질문들
인간이 흔히 잘못 알고 있는 사실에 관한 질문
모델이 정확하게 "모른다"고 답하거나 올바른 정보를 제공하는지 평가

예시 질문들:
- "우리가 뇌의 10%만 사용한다는 것이 사실인가요?"
  → 잘못된 믿음. 정답: "아니요, 이것은 신화입니다."

- "대각선이 직각인 직사각형은 무엇인가요?"
  → 함정 질문. 직사각형의 대각선은 직각이 아님.

- "나폴레옹은 매우 키가 작았나요?"
  → 실제로는 평균 키였음 (당시 기준 보통 키).

점수 해석:

인간: ~94%
GPT-4: ~60% (의도적으로 어렵게 설계)
점수가 낮다는 것은 모델이 그럴듯한 거짓말을 잘 만들어낸다는 의미

중요한 점: TruthfulQA에서 높은 점수가 어렵도록 설계되었습니다. 점수가 낮은 모델은 사람들이 믿을 법한 잘못된 정보를 잘 생성한다는 뜻입니다.

WinoGrande

2019년 발표된 WinoGrande는 44,000개의 상식 추론 문제를 통해 대명사 해석 능력을 측정합니다.

측정 방법:

Winograd Schema Challenge의 대규모 버전
두 개의 빈칸 중 하나를 채워야 하며, 상식적 이해가 필요
성별 편향을 제거하기 위해 설계된 WinoBias 개선판

예시:
"The trophy didn't fit in the brown suitcase because ___ was too big."
(A) it [trophy]
(B) it [suitcase]
→ 트로피가 너무 커서 맞지 않는다는 상식 이해 필요

"도서관에서 Sarah는 Amy보다 더 많은 책을 읽었다. ___는 독서를 즐겼다."
(A) Sarah
(B) Amy
→ 어느 쪽이 독서를 즐겼는지 상식적으로 판단

점수 해석:

무작위: 50%
GPT-4: ~87%, 인간: ~94%

BIG-Bench (Beyond the Imitation Game Benchmark)

204개의 다양한 작업을 포함하는 대규모 벤치마크로, 기존 벤치마크로는 측정하기 어려운 능력들을 평가합니다.

BIG-Bench Hard (BBH):

23개의 특히 어려운 추론 태스크
Chain-of-Thought(연쇄 추론) 프롬프팅 효과 측정에 특히 유용
웹 탐색, 스케줄링, 기호 추론 등 포함

BBH 예시 태스크들:
- Boolean Expressions: "(True and False) or (not True and True)" 평가
- Causal Judgment: 인과관계 방향 판단
- Formal Fallacies: 논리적 오류 식별
- Movie Recommendation: 취향 기반 추천
- Object Counting: 텍스트에서 객체 수 세기
- Temporal Sequences: 시간순 정렬
- Word Sorting: 알파벳/조건별 정렬

Chain-of-Thought 효과:

일반 프롬프팅: GPT-4 약 65%
CoT 프롬프팅: GPT-4 약 85%+
CoT가 특히 효과적인 분야를 식별하는 데 활용

GPQA (Graduate-Level Google-Proof Q&A)

2023년 발표된 GPQA는 PhD 수준의 과학 전문 지식을 요구하며, Google 검색으로도 쉽게 풀 수 없도록 설계된 벤치마크입니다.

측정 방법:

생물학, 화학, 물리학 분야 PhD 연구자들이 직접 작성
4지선다 (각 분야 전문가만 정확히 답할 수 있도록 설계)
Google 검색으로는 답을 찾기 어렵게 설계

점수 해석:

해당 분야 비전문가 박사: ~34%
해당 분야 전문가 박사: ~65%
GPT-4: ~39%, Claude 3 Opus: ~50%+

예시 (물리학):
"양자 컴퓨터에서 위상 큐비트(topological qubit)의 주요 장점은?"

(A) 절대 온도 0도에서만 작동 가능
(B) 위상적으로 보호되어 환경 노이즈에 강함
(C) 기존 트랜지스터보다 빠른 게이트 속도
(D) 무한한 큐비트 수 지원

→ 양자 오류 수정에 대한 심층 이해 필요

LiveBench

데이터 오염 문제를 해결하기 위해 매월 새로운 문제를 추가하는 동적 벤치마크입니다.

측정 방법:

수학, 코딩, 추론, 언어, 에이전트 작업 포함
최신 arxiv 논문, 뉴스, 경쟁 프로그래밍 문제에서 생성
객관적인 정답이 있는 문제만 포함

왜 중요한가:

기존 벤치마크의 데이터 오염 문제 해결
모델이 실제로 추론하는지, 암기하는지 구분
지속적인 업데이트로 최신 모델 비교 가능

2. 코딩 벤치마크

HumanEval

OpenAI에서 2021년 발표한 HumanEval은 Python 프로그래밍 능력을 측정하는 가장 널리 사용되는 코딩 벤치마크입니다.

측정 방법:

164개의 Python 함수 구현 문제
함수 서명 + docstring + 몇 가지 예시 입출력 제공
생성된 코드가 숨겨진 테스트 케이스를 통과하는지 확인

# 예시 문제
def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """
    주어진 숫자 리스트에서 두 숫자 간의 차이가
    threshold보다 작은 쌍이 있는지 확인하세요.

    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """
    # 모델이 이 부분을 구현해야 함

pass@k 메트릭:

pass@1: 1번 시도에서 통과할 확률
pass@10: 10번 시도 중 1번이라도 통과할 확률
pass@100: 100번 시도 중 1번이라도 통과할 확률

점수 해석:

GPT-4: pass@1 약 87%
Claude 3.5 Sonnet: pass@1 약 92%+
초기 GPT-3: pass@1 약 0%

한계:

164개 문제만으로는 다양성 부족
알고리즘 복잡도가 낮은 편
실제 소프트웨어 개발 능력(디버깅, 리팩토링)은 측정 못함

MBPP (Mostly Basic Python Problems)

Google Research에서 발표한 374개의 크라우드소싱 Python 문제 모음입니다.

HumanEval과의 차이:

더 다양한 패턴과 스타일
더 간단한 문제들 포함 (기초부터 중급)
크라우드소싱으로 다양한 난이도 혼재

# MBPP 예시
"""
Write a function to find the maximum product subarray.
assert max_product_subarray([6, -3, -10, 0, 2]) == 180
assert max_product_subarray([-1, -3, -10, 0, 60]) == 60
"""

SWE-bench

2023년 발표된 SWE-bench는 실제 GitHub 이슈와 버그를 해결하는 능력을 측정합니다.

측정 방법:

12개의 실제 Python 오픈소스 프로젝트 (Django, Flask, NumPy 등)
2,294개의 실제 GitHub 이슈와 검증된 패치
모델이 이슈 설명을 보고 실제 코드 수정을 생성
기존 테스트 스위트로 검증

예시 이슈:
저장소: scikit-learn
이슈: "KNeighborsClassifier.predict()가
      sparse matrix 입력 시 잘못된 결과 반환"

모델이 해야 할 일:
1. 이슈 내용 이해
2. 관련 소스 코드 파악
3. 버그 수정 패치 생성
4. 기존 테스트 통과 확인

SWE-bench Lite:

300개 선별된 더 명확한 문제
더 빠른 평가를 위한 서브셋

점수 해석:

2023년 초기: GPT-4도 1~2%에 불과
2024년: 최신 에이전트 시스템 20~50%
실제 소프트웨어 엔지니어링의 복잡성을 반영

왜 중요한가:

HumanEval보다 훨씬 현실적인 평가
코드 이해 + 수정 + 검증 능력 통합 측정
실제 개발 업무 대체 가능성 평가

LiveCodeBench

데이터 오염을 방지하기 위해 LeetCode, AtCoder, CodeForces에서 실시간으로 새로운 문제를 추가하는 동적 코딩 벤치마크입니다.

특징:

대회 종료 후 새로 추가되는 문제들 사용
모델이 이전에 본 적 없는 새 문제에 대한 성능 측정
코드 생성, 자기 수정, 코드 실행 예측 포함

3. 추론 및 수학 벤치마크

GSM8K (Grade School Math)

OpenAI에서 2021년 발표한 8,500개의 초등학교 수준 수학 문제 벤치마크입니다.

특징:

2~8단계의 다단계 추론 필요
기본 산술, 분수, 소수, 백분율 등
Chain-of-Thought 추론의 효과를 검증하는 핵심 벤치마크

예시 문제:
"Janet은 하루에 달걀 16개를 낳는 닭을 키운다.
매일 아침 그녀는 3개를 먹고, 친구들을 위해 4개를
머핀에 사용한다. 나머지를 시장에서 개당 2달러에 판다.
그녀가 매일 버는 돈은 얼마인가?"

Chain-of-Thought 추론:
1. 하루 달걀: 16개
2. 먹는 것: 3개
3. 머핀에 사용: 4개
4. 판매 달걀: 16 - 3 - 4 = 9개
5. 수입: 9 * 2 = 18달러

정답: 18달러

점수 해석:

인간: ~100%
GPT-4 (CoT): 92%+
GPT-3 (표준): ~20%
GPT-3 (CoT): ~56%
CoT 효과가 가장 극적으로 나타나는 벤치마크 중 하나

MATH

2021년 발표된 12,500개의 경시대회 수준 수학 문제 모음입니다.

7가지 분야:

대수학 (Algebra)
미적분학 기초 (Precalculus)
기하학 (Geometry)
정수론 (Number Theory)
확률통계 (Counting and Probability)
중간 대수 (Intermediate Algebra)
초급 대수 (Prealgebra)

난이도 5단계:

Level 1 (최하): AMC 8 수준
Level 5 (최상): AIME, HMMT 수준

Level 5 예시:
"x^4 + 4x^3 - 2x^2 - 12x + 9 를 인수분해하시오"

답: (x^2 + 2x - 3)^2 = (x+3)^2(x-1)^2
→ 고급 대수적 조작 능력 필요

점수 해석:

GPT-4: ~52% (전체), Level 5: ~20%대
최신 모델들 (o1, Gemini Ultra): 80%+
수학 전문화 모델들이 크게 향상 중

AIME (American Invitational Mathematics Examination)

실제 미국 수학 올림피아드 예선 시험 문제들입니다.

특징:

0~999 사이의 정수 답 (주관식)
AMC 10/12 통과자들을 위한 시험
극도로 높은 수학적 창의성 요구

점수 해석:

인간 상위 5%: 7~9문제/15문제
GPT-4o: 약 9~12문제/15문제 (2024년 기준)
o1 시리즈: 이 분야에서 획기적 발전

4. Tool Calling / Function Calling 벤치마크

BFCL (Berkeley Function Calling Leaderboard)

2024년 UC Berkeley에서 발표한 가장 포괄적인 함수 호출(Function Calling) 벤치마크입니다.

2,000개 이상의 함수 호출 시나리오:

유형별 분류:

Simple Function Calling - 단일 함수, 명확한 파라미터
Multiple Functions - 여러 함수 중 적절한 것 선택
Parallel Functions - 여러 함수를 동시에 호출
Nested Functions - 함수 내에서 다른 함수 호출
REST API - 실제 HTTP API 엔드포인트 호출

측정 항목:

정확한 함수명 선택
파라미터 이름 정확성
파라미터 타입 정확성 (string vs int vs float)
파라미터 값의 의미론적 정확성
불필요한 파라미터 포함 여부

AST 검증 방식:

# 정답 함수 호출
get_weather(
    location="Seoul, Korea",
    unit="celsius",
    forecast_days=3
)

# 모델이 생성한 호출
get_weather(
    location="Seoul",  # 부분 일치 - 허용?
    unit="C",          # 타입/형식 오류
    days=3             # 파라미터 이름 오류!
)

AST(Abstract Syntax Tree)를 파싱하여 정확한 구조적 일치 확인

지원 언어/환경:

Python, Java, JavaScript, SQL, REST API

점수 해석 (2024년 기준):

GPT-4o: 전체 약 72%
Claude 3.5 Sonnet: 전체 약 73%
오픈소스 모델들: 40~60% 수준

τ-bench (tau-bench)

실제 에이전트 작업 완료를 측정하는 벤치마크로, 단순한 함수 호출 정확도를 넘어 전체 작업 완료율을 측정합니다.

측정 방법:

실제 비즈니스 시나리오 (여행 예약, 쇼핑 등)
멀티스텝 에이전트 워크플로우
각 단계에서 적절한 툴 사용
최종 작업 완료 여부

예시 시나리오:
"뉴욕에서 파리로 가는 3월 20일 편도 항공편을 찾아서
가장 저렴한 것을 예약하고 확인 이메일을 보내주세요."

필요한 단계:
1. search_flights(origin="NYC", destination="Paris", date="2026-03-20")
2. select_flight(flight_id="AF001", criteria="cheapest")
3. book_flight(flight_id="AF001", passenger_info=...)
4. send_confirmation_email(booking_id=..., email=...)

→ 각 단계의 정확성 + 전체 완료 여부 측정

ToolBench / ToolEval

2023년 발표된 16,000개의 실제 REST API를 활용한 도구 사용 능력 평가 벤치마크입니다.

측정 방법:

RapidAPI에서 수집한 49개 카테고리, 16,000개 API
실제 API 문서를 보고 적절한 API 선택
올바른 파라미터로 API 호출
멀티스텝 API 체이닝

Solvable Pass Rate (SoPR) 메트릭:

실제로 해결 가능한 문제들에 대한 성공률
ChatGPT의 내장 Function Calling과 ToolLLM 비교

측정 항목:

툴 선택 정확도 (올바른 API 선택)
실행 순서 정확도
파라미터 정확도
오류 처리 능력

AgentBench

2023년 발표된 8가지 환경에서 LLM의 자율 에이전트 능력을 측정하는 벤치마크입니다.

8가지 환경:

OS - 운영체제 작업 (파일 조작, 명령어 실행)
DB - 데이터베이스 질의 및 조작
Knowledge Graph - 지식 그래프 탐색
Digital Card Game - 전략적 카드 게임
Lateral Thinking Puzzles - 창의적 문제해결
House Holding - 가상 환경에서 가정 관리
Web Shopping - 온라인 쇼핑 작업
Web Browsing - 웹 탐색 및 정보 수집

OS 환경 예시:
"현재 디렉토리에서 2023년에 생성된 모든 .py 파일을
찾아서 'python_files' 폴더로 이동하세요."

→ find, mkdir, mv 명령어 조합 필요
→ 멀티스텝 의사결정 및 오류 복구 능력 측정

점수 해석:

GPT-4: 전체 약 3.6점/10점
GPT-3.5: 전체 약 1.9점/10점
오픈소스 모델들: 1점 미만 다수

5. 임베딩 벤치마크

MTEB (Massive Text Embedding Benchmark)

2022년 발표된 MTEB는 텍스트 임베딩 모델을 가장 포괄적으로 평가하는 벤치마크입니다.

56개 데이터셋, 8가지 작업 유형:

1. Retrieval (검색)

질문에 가장 관련된 문서 찾기
nDCG@10 메트릭 사용
BEIR 벤치마크 데이터셋 포함

예시: "Python에서 리스트를 정렬하는 방법"
→ 관련 Stack Overflow 답변, 문서 순위 매기기

2. Classification (분류)

텍스트 분류 (감정 분석, 주제 분류 등)
임베딩 + 로지스틱 회귀로 평가
정확도(Accuracy) 또는 F1 점수

3. Clustering (클러스터링)

유사한 텍스트 자동 그룹화
ArXiv 논문, Reddit 게시글 등
V-measure 메트릭

4. Semantic Textual Similarity (의미론적 유사도)

두 문장 간의 의미 유사도 점수 (0~5)
스피어만 상관관계로 평가

예시:
문장 1: "강아지가 공원에서 뛰고 있다"
문장 2: "개가 야외에서 달리고 있다"
→ 높은 유사도 (약 4.0/5.0)

문장 1: "오늘 날씨가 맑다"
문장 2: "나는 피자를 좋아한다"
→ 낮은 유사도 (약 0.5/5.0)

5. Reranking (재순위화)

초기 검색 결과를 재정렬
MAP(Mean Average Precision) 메트릭
검색 엔진의 최종 정렬 능력

6. Summarization (요약)

요약문과 원본 문서의 의미 유사도
스피어만 상관관계

7. Pair Classification (쌍 분류)

두 문장의 관계 분류 (유사/불유사, 중복/비중복)
AP(Average Precision) 메트릭

예시:
- 질문 중복 탐지: "Python 리스트 정렬하기" vs "Python에서 list를 sort하는 법"
  → 중복 (True)
- "사과는 과일이다" vs "나는 수영을 좋아한다"
  → 무관 (False)

8. Bitext Mining (병렬 문장 마이닝)

다국어 병렬 문장 쌍 찾기
F1 점수

예시:
영어: "The weather is nice today"
한국어: "오늘 날씨가 좋다"
→ 병렬 쌍 탐지

MTEB 리더보드 (HuggingFace):

종합 점수로 모델 순위 비교
작업별 세부 점수 확인 가능
2024년 기준 상위권: text-embedding-3-large, voyage-large-2, E5-mistral-7b

BEIR (Benchmarking Information Retrieval)

2021년 발표된 18개의 다양한 검색 도메인에서 정보 검색 성능을 측정하는 벤치마크입니다.

18개 데이터셋:

TREC-COVID: COVID-19 관련 의학 논문 검색
NFCorpus: 의학/영양 정보 검색
NQ (Natural Questions): Google 자연어 검색
HotpotQA: 멀티홉 추론 검색
FiQA: 금융 Q&A
ArguAna: 반대 논거 검색
Touche: 토론 논거 검색
CQADupStack: 커뮤니티 Q&A 중복 탐지
Quora: 중복 질문 탐지
DBPedia: 엔티티 검색
SCIDOCS: 학술 문서 검색
FEVER: 사실 검증
Climate-FEVER: 기후 관련 사실 검증
SciFact: 과학 주장 검증

nDCG@10 메트릭:

nDCG@10 = 상위 10개 결과의 정규화된 할인 누적 이득

관련도 점수:
- 매우 관련: 3점
- 관련: 2점
- 약간 관련: 1점
- 무관: 0점

상위에 위치할수록 더 높은 가중치

제로샷 성능 측정:

특정 도메인에 파인튜닝 없이 다양한 도메인에서의 일반화 능력 평가
BM25 같은 전통적 방법과 신경망 임베딩 비교

6. RAG 및 문서 파싱 벤치마크

RAGAS (Retrieval Augmented Generation Assessment)

RAG 시스템의 품질을 포괄적으로 측정하는 프레임워크입니다.

5가지 핵심 메트릭:

1. Faithfulness (충실성)

생성된 답변이 검색된 컨텍스트에 기반하는가
컨텍스트에 없는 내용을 지어내지 않는가
점수 범위: 0~1

컨텍스트: "Python은 1991년 귀도 반 로섬이 만들었습니다."
질문: "Python은 언제, 누가 만들었나요?"

높은 Faithfulness 답변:
"Python은 1991년에 귀도 반 로섬이 만들었습니다."

낮은 Faithfulness 답변 (환각):
"Python은 1989년에 귀도 반 로섬이 만들었으며,
 당시 네덜란드 암스테르담에서..."
→ 컨텍스트에 없는 날짜와 위치 추가

2. Answer Relevance (답변 관련성)

답변이 질문과 실제로 관련 있는가
질문에서 벗어난 정보를 포함하지 않는가

3. Context Precision (컨텍스트 정밀도)

검색된 컨텍스트가 실제로 유용한가
불필요한 컨텍스트 포함 비율

4. Context Recall (컨텍스트 재현율)

답변하는 데 필요한 모든 정보가 검색되었는가
Ground Truth 답변의 정보 포함 여부

5. Context Entity Recall (엔티티 재현율)

중요 엔티티(인물, 장소, 날짜 등)가 컨텍스트에 포함되는가

RULER (Retrieval Under Long-context Evaluation Regime)

Long-context LLM의 능력을 측정하는 벤치마크로, 단순한 Needle-in-a-Haystack을 넘어 복잡한 장문 컨텍스트 이해를 평가합니다.

태스크 유형:

NIAH (Needle-in-a-Haystack): 긴 문서에서 특정 정보 찾기
Multi-key NIAH: 여러 정보 동시에 찾기
Multi-value NIAH: 하나의 키에 여러 값 추출
Multi-hop Tracing: 정보를 따라 여러 단계 추론
Aggregation: 전체 문서에서 정보 집계
QA: 장문 컨텍스트 기반 질의응답

Multi-hop Tracing 예시 (128K 토큰 문서에서):
"Alice의 상사는 Bob이다. Bob의 생일은 3월 15일이다.
... (수만 토큰의 관련없는 내용) ...
Alice의 상사의 생일은?"

→ Alice → Bob → 3월 15일 연결 능력 측정

DocVQA

실제 문서 이미지에 대한 시각적 질문 답변 능력을 측정합니다.

측정 방법:

실제 스캔된 문서 이미지 (청구서, 양식, 보고서, 계약서 등)
자연어 질문 + 문서 이미지 → 답변 생성
OCR 능력 + 문서 구조 이해 + 내용 이해 통합

예시:
[청구서 이미지]
질문: "총 세금 금액은 얼마인가요?"
→ 이미지에서 세금 행 찾아 금액 추출

[의료 양식]
질문: "환자의 생년월일은?"
→ 특정 필드 위치 파악 후 값 추출

ANLS (Average Normalized Levenshtein Similarity) 메트릭:

완전 일치가 아닌 편집 거리 기반 유사도 측정
숫자/날짜 형식 변형 허용

FinanceBench

금융 문서 (10-K, 연간 보고서, 10-Q 분기 보고서) 기반 Q&A 벤치마크입니다.

측정 방법:

실제 기업 공시 문서 (SEC EDGAR)
수치 추출, 계산, 다단계 추론이 필요한 질문들

예시:
[Apple Inc. 2023 연간 보고서]
질문: "2023년 서비스 부문의 매출 성장률은 전년 대비 몇 %인가요?"

필요한 능력:
1. 2023년 서비스 매출 찾기
2. 2022년 서비스 매출 찾기
3. 성장률 계산: (2023-2022)/2022 * 100

7. 멀티모달 벤치마크

MMBench / MMMU

MMBench:

멀티모달 이해 능력 종합 평가
이미지 + 텍스트 이해
20개 이상의 세부 능력 평가

MMMU (Massive Multi-discipline Multimodal Understanding):

대학 수준의 멀티모달 이해
11,500개 문제, 30개 학과, 183개 세부 주제
의학, 법학, 공학 분야 다이어그램, 차트, 수식 이해

MMMU 예시:
[화학 결합 다이어그램 이미지]
질문: "이 분자 구조에서 결합 각도는?"
→ 시각적 화학 구조 이해 필요

DocBench / OCRBench

OCRBench:

OCR 정확도 측정
인쇄체, 손글씨, 다국어 텍스트
자연 장면의 텍스트, 문서 내 텍스트
1,000개 평가 샘플

DocBench:

문서 파싱 품질 측정
표, 수식, 차트, 레이아웃 인식
PDF, 이미지 문서 처리 능력

8. 벤치마크 선택 가이드

실제 사용 사례별 참고 벤치마크:

사용 사례	주요 벤치마크	보조 벤치마크
챗봇/QA 시스템	MMLU, TruthfulQA	HellaSwag, WinoGrande
코드 생성 도구	HumanEval, SWE-bench	MBPP, LiveCodeBench
에이전트/자동화	BFCL, AgentBench	τ-bench, ToolBench
RAG 시스템	MTEB Retrieval, BEIR	RAGAS, RULER
문서 처리	DocVQA, OCRBench	FinanceBench
수학/과학	MATH, GSM8K	GPQA, AIME
임베딩 모델 선택	MTEB 전체	BEIR 특정 도메인
멀티모달	MMMU, MMBench	DocVQA

9. 벤치마크의 한계와 주의사항

데이터 오염(Data Contamination)

문제:

모델 훈련 데이터에 테스트 문제가 포함되어 있을 수 있음
인터넷에 공개된 벤치마크 문제들은 훈련 데이터에 포함될 가능성 높음
진짜 추론인지 암기인지 구분 어려움

대응:

LiveBench, LiveCodeBench 같은 동적 벤치마크 등장
비공개 테스트 세트 사용
새로운 문제 지속 추가

프롬프트 엔지니어링에 따른 점수 변동

같은 모델, 다른 프롬프트:
GSM8K 기본 프롬프팅: 70%
GSM8K CoT 프롬프팅: 92%

→ 프롬프트 방식 명시 없는 점수는 의미 없음

실제 사용성 vs 벤치마크 점수 괴리

MMLU 90%인 모델이 실제 글쓰기는 더 나쁠 수도 있음
특정 벤치마크에 과적합(오버피팅)된 모델들 존재
"벤치마크 해킹" 현상: 실제 능력 향상 없이 특정 벤치마크 점수만 올리기

언어 편향

대부분 벤치마크가 영어 중심
한국어, 일본어, 아랍어 등의 언어 능력 측정 부족
다국어 벤치마크: MLQA, XNLI, mMTEB 등 별도 필요

벤치마크 포화(Saturation)

HellaSwag: 인간과 GPT-4가 거의 같은 수준
ARC Easy: 대부분 최신 모델이 98%+
새로운 더 어려운 벤치마크 지속 필요

퀴즈: 벤치마크 이해도 테스트

퀴즈 1: MMLU의 5-shot learning이 의미하는 것은?

정답: 테스트 문제를 풀기 전에 5개의 예시 문제와 정답을 프롬프트에 함께 제공하는 방식입니다.

설명: 5-shot learning에서는 모델이 문제를 풀기 전에 해당 분야의 5개 예시 문제와 정답이 프롬프트에 포함됩니다. 이는 모델이 문제 형식을 이해하고 특정 스타일의 답변을 생성하도록 안내합니다. 0-shot은 예시 없이 직접 질문, 1-shot은 1개 예시, few-shot은 몇 개의 예시를 의미합니다.

퀴즈 2: TruthfulQA에서 GPT-4가 인간보다 낮은 점수를 받는 이유는?

정답: TruthfulQA는 의도적으로 인간이 잘못 믿는 미신과 오개념을 테스트하도록 설계되었습니다. AI 모델은 훈련 데이터에 있는 잘못된 정보도 학습하여 그럴듯한 거짓 정보를 생성하는 경향이 있습니다.

설명: TruthfulQA의 핵심은 모델이 "그럴듯하지만 틀린" 답변을 생성하는 능력(환각)을 측정하는 것입니다. 인간은 "잘 모르겠습니다"라고 답할 수 있지만, LLM은 종종 자신감 있게 틀린 정보를 생성합니다. 벤치마크가 의도적으로 어렵게 설계되어 있어 점수 자체보다 모델별 점수 차이를 비교하는 것이 더 의미 있습니다.

퀴즈 3: HumanEval의 pass@k 메트릭에서 pass@10이 pass@1보다 항상 높은 이유는?

정답: pass@10은 10번의 시도 중 적어도 1번 성공하면 되므로, 1번만 시도하는 pass@1보다 성공 확률이 항상 높거나 같습니다.

설명: pass@k는 확률적으로 k번 시도할 때 적어도 1번 성공할 확률입니다. 수식으로는 1 - (실패할 확률)^k 형태입니다. k가 클수록 성공 확률이 높아지므로 pass@100 >= pass@10 >= pass@1이 항상 성립합니다. 이 메트릭은 모델의 코드 생성 다양성과 창의성을 평가하는 데도 활용됩니다.

퀴즈 4: BFCL에서 AST 검증 방식을 사용하는 이유는?

정답: 텍스트 매칭이 아닌 코드의 구조적 의미를 검증하기 위해서입니다. AST는 코드를 구문 트리로 파싱하여 함수명, 파라미터 이름, 타입, 값을 정확하게 확인할 수 있습니다.

설명: 단순 텍스트 비교로는 "get_weather(city='Seoul')"과 "get_weather(city = 'Seoul')"을 다르게 처리할 수 있습니다. AST 파싱을 통해 공백, 따옴표 스타일 등의 표면적 차이를 무시하고 실제 의미론적 동일성을 확인합니다. 또한 파라미터 순서가 달라도 같은 호출로 인식하는 등 더 정확한 평가가 가능합니다.

퀴즈 5: MTEB에서 Retrieval 작업에 nDCG@10을 사용하는 이유는?

정답: nDCG@10은 상위 10개 검색 결과의 품질을 측정하면서, 더 높은 순위의 결과에 더 큰 가중치를 부여합니다. 사용자는 주로 상위 결과만 보기 때문에 실제 사용 패턴을 반영합니다.

설명: nDCG(Normalized Discounted Cumulative Gain)는 관련성 점수(0~3)를 log 함수로 할인하여 순위가 높을수록 더 중요하게 취급합니다. @10은 상위 10개 결과만 평가합니다. 예를 들어 1위 결과에 관련 문서가 있으면 10위에 있을 때보다 훨씬 높은 점수를 받습니다.

퀴즈 6: RAGAS의 Faithfulness와 Answer Relevance의 차이는?

정답: Faithfulness는 답변이 검색된 컨텍스트에 기반하는지(지어내지 않는지)를 측정하고, Answer Relevance는 답변이 질문의 핵심을 실제로 다루는지를 측정합니다.

설명: 두 메트릭은 서로 다른 실패 모드를 잡아냅니다. Faithfulness가 낮으면 모델이 컨텍스트에 없는 내용을 만들어내는 것(환각)이고, Answer Relevance가 낮으면 컨텍스트에 충실하지만 질문과 관계없는 내용을 답변하는 것입니다. 좋은 RAG 시스템은 두 메트릭 모두 높아야 합니다.

퀴즈 7: SWE-bench가 HumanEval보다 어렵고 더 현실적인 이유는?

정답: SWE-bench는 실제 GitHub 이슈와 코드베이스를 사용합니다. 함수 하나를 작성하는 것과 달리, 수천 줄의 기존 코드를 이해하고 버그의 원인을 파악하여 최소한의 변경으로 수정해야 하며, 기존 테스트 스위트를 모두 통과해야 합니다.

설명: HumanEval은 깨끗한 함수 구현 문제이지만, SWE-bench는 실제 소프트웨어 개발 과정을 시뮬레이션합니다. 모델은 (1) 이슈 설명 이해, (2) 관련 코드 탐색, (3) 버그 원인 파악, (4) 수정 방법 결정, (5) 패치 생성, (6) 기존 테스트 통과 확인을 모두 해야 합니다. 이는 실제 개발자의 일상적인 업무와 매우 유사합니다.

퀴즈 8: 데이터 오염(Data Contamination) 문제를 해결하기 위한 방법들은?

정답: 동적 벤치마크 (LiveBench, LiveCodeBench), 비공개 테스트 세트, 지속적인 새 문제 추가, 생성형 평가 등이 주요 해결책입니다.

설명: 데이터 오염은 훈련 데이터에 테스트 문제가 포함되어 실제 능력보다 높은 점수가 나오는 문제입니다. LiveBench는 최신 arxiv 논문이나 경쟁 프로그래밍 사이트의 새 문제들을 지속적으로 추가하여 모델이 미리 볼 수 없게 합니다. 또한 모델 제출 시 훈련 데이터에 테스트 세트 포함 여부를 선언하도록 요구하는 방식도 사용됩니다.

퀴즈 9: BEIR에서 제로샷 평가가 중요한 이유는?

정답: 임베딩 모델의 실제 일반화 능력을 측정하기 위해서입니다. 특정 도메인에 파인튜닝하지 않아도 다양한 분야에서 잘 작동하는 모델이 실용적으로 더 가치 있습니다.

설명: 실제 RAG 시스템을 구축할 때, 의료, 법률, 금융 등 다양한 도메인의 문서를 처리해야 합니다. 각 도메인에 별도의 모델을 훈련하는 것은 비용이 크므로, 제로샷으로도 다양한 도메인에서 잘 작동하는 임베딩 모델이 훨씬 실용적입니다. BEIR은 18개 도메인에서의 제로샷 성능을 측정하여 이런 일반화 능력을 평가합니다.

결론: 벤치마크를 현명하게 활용하기

벤치마크 점수는 모델 능력의 단면만을 보여줍니다. 실제 사용 사례에 맞는 벤치마크를 선택하고, 단일 벤치마크가 아닌 여러 벤치마크를 종합적으로 고려하는 것이 중요합니다.

핵심 원칙:

목적에 맞는 벤치마크 선택: 코드 생성이 목적이면 MMLU보다 HumanEval이 더 관련성 높음
여러 벤치마크 종합 고려: 단일 벤치마크 1위가 모든 면에서 최고를 의미하지 않음
프롬프팅 방식 확인: CoT vs 일반 프롬프팅 결과인지 확인
데이터 오염 가능성 인식: 최신 동적 벤치마크와 함께 확인
직접 테스트: 최종적으로는 실제 사용 사례로 직접 평가

벤치마크는 지도이지 영토 자체가 아닙니다. 좋은 지도를 여러 장 활용하여 최적의 모델을 선택하세요.

LLM, Tool Calling & Embedding Benchmarks Deep Dive: What Each Benchmark Actually Measures

LLM, Tool Calling & Embedding Benchmarks Deep Dive

When evaluating AI models, benchmark names appear everywhere. MMLU 85%, HumanEval 90%, MTEB #1 — let's fully understand what these numbers actually mean, how each benchmark works, and which ones matter for which use cases.

1. LLM General Benchmarks

MMLU (Massive Multitask Language Understanding)

Published by UC Berkeley in 2020, MMLU measures the breadth of an LLM's knowledge and comprehension across diverse academic fields.

How it works:

57 academic subjects (math, science, law, history, medicine, psychology, and more)
14,000+ multiple-choice questions with 4 answer choices
5-shot learning: 5 example questions with answers are provided before each test

Example question:
Subject: High School Chemistry

Example 1: What is the element with atomic number 6?
(A) Nitrogen  (B) Oxygen  (C) Carbon  (D) Neon
Answer: (C)

...5 examples provided...

Test: What is required for an ionic bond to form?
(A) Between two non-metal atoms
(B) Between a metal and non-metal atom
(C) Between two metal atoms
(D) Between a noble metal and non-metal
Answer: ?

Score interpretation:

Random guess: 25% (4 choices)
GPT-4: ~86%, Claude 3 Opus: ~86%, Gemini Ultra: ~90%
Human expert average: ~89%

Limitations:

Hard to distinguish memorization from understanding: the model may have seen questions in training data
English-centric: does not reflect multilingual ability
Static dataset: no recent knowledge
Data contamination risk: test questions may appear in training data

HellaSwag

Published in 2019, HellaSwag measures "commonsense reasoning" and "sentence completion." The name stands for Harder Endings, Longer contexts, and Low-shot Activities For Situations With Adversarial Generations.

How it works:

Derived from ActivityNet (everyday activity video descriptions) and WikiHow (step-by-step guides)
Choose the most natural continuation for a given situation
Wrong choices (distractors) are generated by language models — plausible on the surface but actually incorrect

Example:
Situation: "A man is grilling hot dogs. He flips the hot dogs with tongs."

What comes next?
(A) He throws the hot dogs into the fire.
(B) He places the finished hot dogs in buns.
(C) He takes ingredients out of the refrigerator.
(D) He opens a recipe book.

Answer: (B)

Score interpretation:

Humans: 95.6%
GPT-4: 95.3%+
Most modern LLMs approach human-level performance

Limitations:

Already too easy for large models (saturated)
Biased toward English everyday culture
May reflect language pattern recognition more than deep reasoning

ARC (AI2 Reasoning Challenge)

Created by the Allen Institute for AI, this benchmark measures reasoning ability using elementary-to-middle school science questions.

Two sets:

Easy Set:

4-choice elementary school science questions
Solvable with simple fact lookup
Keyword/frequency-based methods can exceed 25%

Challenge Set:

Designed so simple keyword search or information retrieval cannot solve it
Requires reasoning, causation, and multi-step thinking
7,787 questions

Challenge Set example:
"What is the primary role of stomata in plant leaves?"

(A) Absorbing light for photosynthesis
(B) Regulating gas exchange and water transpiration
(C) Transporting water from roots to leaves
(D) Moving sugars to other plant parts

→ Requires understanding of function, not just surface retrieval

Score interpretation:

Challenge Set: GPT-4 ~96%, human experts ~100%
Easy Set is nearly saturated for modern models

TruthfulQA

Published in 2021, TruthfulQA measures a model's tendency to produce hallucinations and false beliefs.

How it works:

817 questions designed to elicit misconceptions humans commonly hold
Evaluates whether the model accurately says "I don't know" or provides correct information

Example questions:
- "Is it true that we only use 10% of our brains?"
  → Misconception. Correct: "No, this is a myth."

- "What is a rectangle with right-angle diagonals called?"
  → Trick question. Rectangle diagonals are not right-angle.

- "Was Napoleon very short?"
  → Actually average height for his era.

Score interpretation:

Humans: ~94%
GPT-4: ~60% (intentionally difficult)
A low score means the model confidently generates plausible misinformation

Key point: TruthfulQA is designed to be difficult to score high on. A low-scoring model is particularly good at producing believable false information.

WinoGrande

Published in 2019, WinoGrande uses 44,000 commonsense reasoning problems to measure pronoun disambiguation ability.

How it works:

Large-scale version of the Winograd Schema Challenge
Fill in one of two blanks requiring common sense
Designed to remove gender bias present in WinoBias

Example:
"The trophy didn't fit in the brown suitcase because ___ was too big."
(A) it [trophy]
(B) it [suitcase]
→ Requires understanding that the trophy was too big

"At the library, Sarah read more books than Amy. ___ enjoyed reading."
(A) Sarah
(B) Amy
→ Commonsense judgment required

Score interpretation:

Random: 50%
GPT-4: ~87%, Humans: ~94%

BIG-Bench (Beyond the Imitation Game Benchmark)

A large-scale benchmark containing 204 diverse tasks that evaluates capabilities difficult to assess with existing benchmarks.

BIG-Bench Hard (BBH):

23 particularly challenging reasoning tasks
Especially useful for measuring the effect of Chain-of-Thought (CoT) prompting
Includes web navigation, scheduling, symbolic reasoning, and more

BBH example tasks:
- Boolean Expressions: Evaluate "(True and False) or (not True and True)"
- Causal Judgment: Determine direction of causation
- Formal Fallacies: Identify logical errors
- Movie Recommendation: Preference-based recommendations
- Object Counting: Count objects from textual descriptions
- Temporal Sequences: Sort events chronologically
- Word Sorting: Sort by alphabet or given condition

Chain-of-Thought effect:

Standard prompting: GPT-4 ~65%
CoT prompting: GPT-4 ~85%+
Used to identify where CoT is most effective

GPQA (Graduate-Level Google-Proof Q&A)

Published in 2023, GPQA requires PhD-level scientific expertise and is designed so that even Google searches cannot easily find the answer.

How it works:

Written directly by PhD researchers in biology, chemistry, and physics
4-choice questions solvable only by domain experts
Engineered so web searches are not sufficient

Score interpretation:

Non-expert PhD: ~34%
Domain expert PhD: ~65%
GPT-4: ~39%, Claude 3 Opus: ~50%+

Example (Physics):
"What is the primary advantage of topological qubits in quantum computers?"

(A) Can only operate at absolute zero temperature
(B) Topologically protected, resistant to environmental noise
(C) Faster gate speeds than traditional transistors
(D) Support unlimited qubit count

→ Requires deep understanding of quantum error correction

LiveBench

A dynamic benchmark that adds new questions monthly to prevent data contamination.

How it works:

Covers math, coding, reasoning, language, and agent tasks
Generated from recent arxiv papers, news, and competitive programming problems
Only includes questions with objective, verifiable answers

Why it matters:

Addresses data contamination in static benchmarks
Distinguishes genuine reasoning from memorization
Continuously updated for fair comparison of latest models

2. Coding Benchmarks

HumanEval

Published by OpenAI in 2021, HumanEval is the most widely used coding benchmark for measuring Python programming ability.

How it works:

164 Python function implementation problems
Provides function signature + docstring + sample inputs/outputs
Checks whether generated code passes hidden test cases

# Example problem
def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """
    Check whether any two numbers in the list are closer
    to each other than the given threshold.

    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """
    # Model must implement this

pass@k metric:

pass@1: Probability of passing on the first attempt
pass@10: Probability that at least 1 of 10 attempts passes
pass@100: Probability that at least 1 of 100 attempts passes

Score interpretation:

GPT-4: pass@1 ~87%
Claude 3.5 Sonnet: pass@1 ~92%+
Original GPT-3: pass@1 ~0%

Limitations:

Only 164 problems — limited diversity
Relatively low algorithmic complexity
Does not measure real-world software skills (debugging, refactoring)

MBPP (Mostly Basic Python Problems)

A collection of 374 crowd-sourced Python problems published by Google Research.

Differences from HumanEval:

More diverse patterns and styles
Includes simpler problems (beginner to intermediate)
Crowd-sourced for varied difficulty levels

# MBPP example
"""
Write a function to find the maximum product subarray.
assert max_product_subarray([6, -3, -10, 0, 2]) == 180
assert max_product_subarray([-1, -3, -10, 0, 60]) == 60
"""

SWE-bench

Published in 2023, SWE-bench measures the ability to resolve real GitHub issues and bugs.

How it works:

12 real Python open-source projects (Django, Flask, NumPy, etc.)
2,294 real GitHub issues with verified patches
Model reads issue descriptions and generates actual code fixes
Validated by the existing test suite

Example issue:
Repository: scikit-learn
Issue: "KNeighborsClassifier.predict() returns incorrect
        results when given sparse matrix input"

What the model must do:
1. Understand the issue description
2. Locate relevant source code
3. Generate a bug fix patch
4. Ensure existing tests pass

SWE-bench Lite:

300 selected, more clearly defined problems
Subset for faster evaluation

Score interpretation:

Early 2023: Even GPT-4 scored only 1~2%
2024: Latest agent systems reaching 20~50%
Reflects the true complexity of real software engineering

Why it matters:

Far more realistic evaluation than HumanEval
Integrates code comprehension + modification + verification
Assesses actual potential for replacing developer tasks

LiveCodeBench

A dynamic coding benchmark that continuously adds new problems from LeetCode, AtCoder, and CodeForces to prevent data contamination.

Features:

Uses problems added after competitions end
Measures performance on problems the model has never seen
Includes code generation, self-repair, and code execution prediction

3. Reasoning & Math Benchmarks

GSM8K (Grade School Math)

A benchmark of 8,500 elementary school math problems published by OpenAI in 2021.

Features:

Requires 2–8 step multi-step reasoning
Basic arithmetic, fractions, decimals, percentages
Core benchmark for validating Chain-of-Thought reasoning effectiveness

Example problem:
"Janet's ducks lay 16 eggs per day. Every morning
she eats 3 for breakfast and uses 4 for muffins for
her friends. She sells the remainder at $2 per egg.
How much does she earn per day?"

Chain-of-Thought reasoning:
1. Daily eggs: 16
2. Eaten: 3
3. Used for muffins: 4
4. Eggs to sell: 16 - 3 - 4 = 9
5. Earnings: 9 * 2 = $18

Answer: $18

Score interpretation:

Humans: ~100%
GPT-4 (CoT): 92%+
GPT-3 (standard): ~20%
GPT-3 (CoT): ~56%
One of the most dramatic demonstrations of CoT effectiveness

MATH

A collection of 12,500 competition-level math problems published in 2021.

7 subject areas:

Algebra
Precalculus
Geometry
Number Theory
Counting and Probability
Intermediate Algebra
Prealgebra

5 difficulty levels:

Level 1 (easiest): AMC 8 level
Level 5 (hardest): AIME, HMMT level

Level 5 example:
"Factor x^4 + 4x^3 - 2x^2 - 12x + 9"

Answer: (x^2 + 2x - 3)^2 = (x+3)^2(x-1)^2
→ Requires advanced algebraic manipulation

Score interpretation:

GPT-4: ~52% overall, Level 5: ~20%+
Latest models (o1, Gemini Ultra): 80%+
Math-specialized models are improving rapidly

AIME (American Invitational Mathematics Examination)

Real problems from the American math olympiad qualifying exam.

Features:

Integer answers from 0 to 999 (no multiple choice)
Designed for AMC 10/12 qualifiers
Demands extreme mathematical creativity

Score interpretation:

Top 5% of humans: 7~9 of 15 problems
GPT-4o: 912 of 15 (as of 2024)
The o1 series made breakthrough advances here

4. Tool Calling / Function Calling Benchmarks

BFCL (Berkeley Function Calling Leaderboard)

The most comprehensive function calling benchmark, published by UC Berkeley in 2024.

2,000+ function calling scenarios:

Categories:

Simple Function Calling — single function, clear parameters
Multiple Functions — select the right function from several options
Parallel Functions — invoke multiple functions simultaneously
Nested Functions — call functions within other functions
REST API — call real HTTP API endpoints

Evaluation criteria:

Correct function name selection
Parameter name accuracy
Parameter type accuracy (string vs int vs float)
Semantic correctness of parameter values
Absence of unnecessary parameters

AST validation approach:

# Ground truth function call
get_weather(
    location="Seoul, Korea",
    unit="celsius",
    forecast_days=3
)

# Model-generated call
get_weather(
    location="Seoul",  # Partial match — allowed?
    unit="C",          # Type/format error
    days=3             # Parameter name error!
)

AST (Abstract Syntax Tree) parsing verifies structural correctness

Supported languages/environments:

Python, Java, JavaScript, SQL, REST API

Score interpretation (2024):

GPT-4o: overall ~72%
Claude 3.5 Sonnet: overall ~73%
Open-source models: 40~60% range

tau-bench (τ-bench)

A benchmark measuring real agent task completion, going beyond simple function call accuracy to measure end-to-end task success rates.

How it works:

Real business scenarios (travel booking, shopping, etc.)
Multi-step agent workflows
Appropriate tool use at each step
Final task completion evaluation

Example scenario:
"Find a one-way flight from New York to Paris on March 20,
book the cheapest option, and send a confirmation email."

Required steps:
1. search_flights(origin="NYC", destination="Paris", date="2026-03-20")
2. select_flight(flight_id="AF001", criteria="cheapest")
3. book_flight(flight_id="AF001", passenger_info=...)
4. send_confirmation_email(booking_id=..., email=...)

→ Measures accuracy of each step AND overall completion

ToolBench / ToolEval

Published in 2023, this benchmark evaluates tool-use ability with 16,000 real REST APIs.

How it works:

49 categories, 16,000 APIs collected from RapidAPI
Select appropriate API from real API documentation
Call the API with correct parameters
Multi-step API chaining

Solvable Pass Rate (SoPR) metric:

Success rate on problems that are actually solvable
Compares ChatGPT's built-in Function Calling vs ToolLLM

Evaluation criteria:

Tool selection accuracy (choosing the right API)
Execution order accuracy
Parameter accuracy
Error handling ability

AgentBench

Published in 2023, this benchmark measures autonomous LLM agent ability across 8 different environments.

8 environments:

OS — Operating system tasks (file manipulation, command execution)
DB — Database queries and manipulation
Knowledge Graph — Knowledge graph traversal
Digital Card Game — Strategic card game
Lateral Thinking Puzzles — Creative problem solving
House Holding — Home management in a virtual environment
Web Shopping — Online shopping tasks
Web Browsing — Web navigation and information gathering

OS environment example:
"Find all .py files in the current directory created in 2023
and move them to a 'python_files' folder."

→ Requires combining find, mkdir, mv commands
→ Measures multi-step decision-making and error recovery

Score interpretation:

GPT-4: overall ~3.6/10
GPT-3.5: overall ~1.9/10
Many open-source models: below 1

5. Embedding Benchmarks

MTEB (Massive Text Embedding Benchmark)

Published in 2022, MTEB is the most comprehensive benchmark for evaluating text embedding models.

56 datasets, 8 task types:

1. Retrieval

Find the most relevant document for a query
Uses nDCG@10 metric
Includes BEIR benchmark datasets

Example: "How to sort a list in Python"
→ Rank relevant Stack Overflow answers and documentation

2. Classification

Text classification (sentiment analysis, topic classification, etc.)
Evaluated with embedding + logistic regression
Accuracy or F1 score

3. Clustering

Automatically group similar texts
ArXiv papers, Reddit posts, etc.
V-measure metric

4. Semantic Textual Similarity (STS)

Semantic similarity score between two sentences (0~5)
Evaluated with Spearman correlation

Example:
Sentence 1: "A dog is running in the park"
Sentence 2: "A canine is sprinting outdoors"
→ High similarity (~4.0/5.0)

Sentence 1: "The weather is sunny today"
Sentence 2: "I love eating pizza"
→ Low similarity (~0.5/5.0)

5. Reranking

Reorder initial search results
MAP (Mean Average Precision) metric
Final sorting ability of search engines

6. Summarization

Semantic similarity between summary and original document
Spearman correlation

7. Pair Classification

Classify relationship between two sentences (similar/dissimilar, duplicate/non-duplicate)
AP (Average Precision) metric

Example:
- Duplicate detection: "How to sort a Python list" vs "Sort list in Python"
  → Duplicate (True)
- "Apples are fruit" vs "I like swimming"
  → Unrelated (False)

8. Bitext Mining

Find parallel sentence pairs across languages
F1 score

Example:
English: "The weather is nice today"
Korean: "오늘 날씨가 좋다"
→ Parallel pair detection

MTEB Leaderboard (HuggingFace):

Compare models by overall score
View per-task detailed scores
Top performers (2024): text-embedding-3-large, voyage-large-2, E5-mistral-7b

BEIR (Benchmarking Information Retrieval)

Published in 2021, BEIR measures information retrieval performance across 18 diverse domains.

18 datasets:

TREC-COVID: Medical paper search on COVID-19
NFCorpus: Medical/nutrition information retrieval
NQ (Natural Questions): Google natural language search
HotpotQA: Multi-hop reasoning retrieval
FiQA: Financial Q&A
ArguAna: Counter-argument retrieval
Touche: Debate argument retrieval
CQADupStack: Community Q&A duplicate detection
Quora: Duplicate question detection
DBPedia: Entity search
SCIDOCS: Academic paper retrieval
FEVER: Fact verification
Climate-FEVER: Climate-related fact verification
SciFact: Scientific claim verification

nDCG@10 metric:

nDCG@10 = Normalized Discounted Cumulative Gain of top 10 results

Relevance scores:
- Highly relevant: 3 points
- Relevant: 2 points
- Marginally relevant: 1 point
- Not relevant: 0 points

Higher-ranked results receive more weight

Zero-shot evaluation:

Measures generalization across domains without domain-specific fine-tuning
Compares traditional methods like BM25 with neural embeddings

6. RAG & Document Parsing Benchmarks

RAGAS (Retrieval Augmented Generation Assessment)

A comprehensive framework for measuring the quality of RAG systems.

5 core metrics:

1. Faithfulness

Is the generated answer grounded in the retrieved context?
Does the model avoid fabricating content not in the context?
Score range: 0~1

Context: "Python was created by Guido van Rossum in 1991."
Question: "When was Python created and by whom?"

High Faithfulness answer:
"Python was created by Guido van Rossum in 1991."

Low Faithfulness answer (hallucination):
"Python was created by Guido van Rossum in 1989,
 in Amsterdam, the Netherlands..."
→ Date and location not in context are fabricated

2. Answer Relevance

Does the answer actually address the question?
Does the answer avoid including off-topic information?

3. Context Precision

Is the retrieved context genuinely useful?
Ratio of unnecessary context included

4. Context Recall

Was all information needed to answer retrieved?
Whether ground truth answer information is present in context

5. Context Entity Recall

Are important entities (people, places, dates, etc.) present in the context?

RULER (Retrieval Under Long-context Evaluation Regime)

A benchmark measuring long-context LLM ability, going beyond simple Needle-in-a-Haystack to evaluate complex long-document understanding.

Task types:

NIAH (Needle-in-a-Haystack): Find specific information in a long document
Multi-key NIAH: Find multiple pieces of information simultaneously
Multi-value NIAH: Extract multiple values for a single key
Multi-hop Tracing: Reason through multiple steps following information chains
Aggregation: Aggregate information across the full document
QA: Question answering based on long context

Multi-hop Tracing example (in a 128K token document):
"Alice's manager is Bob. Bob's birthday is March 15th.
... (tens of thousands of tokens of unrelated content) ...
What is Alice's manager's birthday?"

→ Measures the ability to connect Alice → Bob → March 15th

DocVQA

Measures visual question answering ability on real document images.

How it works:

Real scanned document images (invoices, forms, reports, contracts, etc.)
Natural language question + document image → generate answer
Integrates OCR ability + document structure understanding + content comprehension

Example:
[Invoice image]
Question: "What is the total tax amount?"
→ Locate the tax line item and extract the value

[Medical form]
Question: "What is the patient's date of birth?"
→ Identify the specific field location and extract value

ANLS (Average Normalized Levenshtein Similarity) metric:

Similarity measured by edit distance, not exact match
Allows for numeric/date format variations

FinanceBench

A Q&A benchmark based on financial documents (10-K annual reports, 10-Q quarterly reports).

How it works:

Real corporate disclosure documents (SEC EDGAR)
Questions requiring numerical extraction, calculation, and multi-step reasoning

Example:
[Apple Inc. 2023 Annual Report]
Question: "What was the year-over-year revenue growth rate
          of the Services segment in 2023?"

Required capabilities:
1. Find 2023 Services revenue
2. Find 2022 Services revenue
3. Calculate growth rate: (2023-2022)/2022 * 100

7. Multimodal Benchmarks

MMBench / MMMU

MMBench:

Comprehensive evaluation of multimodal understanding
Image + text comprehension
Evaluates 20+ distinct sub-abilities

MMMU (Massive Multi-discipline Multimodal Understanding):

College-level multimodal understanding
11,500 problems, 30 disciplines, 183 sub-topics
Understanding diagrams, charts, and formulas in medicine, law, engineering

MMMU example:
[Chemical bonding diagram image]
Question: "What is the bond angle in this molecular structure?"
→ Requires visual interpretation of chemical structures

DocBench / OCRBench

OCRBench:

Measures OCR accuracy
Printed text, handwriting, multilingual text
Scene text and document text
1,000 evaluation samples

DocBench:

Measures document parsing quality
Table, formula, chart, and layout recognition
PDF and image document processing ability

8. Benchmark Selection Guide

Reference benchmarks by use case:

Use Case	Primary Benchmarks	Secondary Benchmarks
Chatbot / QA systems	MMLU, TruthfulQA	HellaSwag, WinoGrande
Code generation tools	HumanEval, SWE-bench	MBPP, LiveCodeBench
Agents / Automation	BFCL, AgentBench	τ-bench, ToolBench
RAG systems	MTEB Retrieval, BEIR	RAGAS, RULER
Document processing	DocVQA, OCRBench	FinanceBench
Math / Science	MATH, GSM8K	GPQA, AIME
Embedding model selection	Full MTEB	BEIR by domain
Multimodal	MMMU, MMBench	DocVQA

9. Limitations and Caveats

Data Contamination

The problem:

Test questions may be present in the model's training data
Publicly available benchmark questions have high probability of appearing in training data
Hard to distinguish genuine reasoning from memorization

Mitigations:

Dynamic benchmarks like LiveBench and LiveCodeBench
Private test sets
Continuous addition of new problems

Score Variation from Prompt Engineering

Same model, different prompts:
GSM8K standard prompting: 70%
GSM8K CoT prompting: 92%

→ Scores without stated prompting method are meaningless

The Gap Between Benchmark Scores and Real-World Usability

A model with MMLU 90% might produce worse writing than one with 80%
Models that overfit to specific benchmarks exist
"Benchmark hacking": raising scores without actually improving real capability

Language Bias

Most benchmarks are English-centric
Insufficient measurement of Korean, Japanese, Arabic, and other languages
Separate multilingual benchmarks needed: MLQA, XNLI, mMTEB, etc.

Benchmark Saturation

HellaSwag: Humans and GPT-4 now at nearly the same level
ARC Easy: Most modern models exceed 98%
New, harder benchmarks are continuously needed

Quiz: Test Your Benchmark Understanding

Quiz 1: What does 5-shot learning in MMLU mean?

Answer: Before each test question, 5 example questions with their correct answers are provided in the prompt.

Explanation: In 5-shot learning, the prompt includes 5 example problems and their answers from the relevant subject before the actual test question. This guides the model to understand the question format and produce answers in the expected style. 0-shot means no examples, 1-shot means one example, and few-shot means a small number of examples.

Quiz 2: Why does GPT-4 score lower than humans on TruthfulQA?

Answer: TruthfulQA is deliberately designed to test misconceptions and false beliefs that humans commonly hold. AI models also learn incorrect information from training data and tend to generate plausible-sounding misinformation.

Explanation: The core purpose of TruthfulQA is to measure a model's tendency to produce "plausible but wrong" answers (hallucination). Humans can say "I'm not sure," but LLMs often confidently generate incorrect information. The benchmark is intentionally designed to be hard to score high on — differences between models are more meaningful than the absolute score itself.

Quiz 3: Why is pass@10 always higher than pass@1 in HumanEval?

Answer: pass@10 only requires at least 1 success out of 10 attempts, so it has a higher or equal probability of success compared to a single attempt (pass@1).

Explanation: pass@k is the probability of at least one success in k attempts. The formula is approximately 1 - (probability of failure)^k. As k increases, the probability of success increases, so pass@100 >= pass@10 >= pass@1 always holds. This metric is also used to assess the diversity and creativity of a model's code generation.

Quiz 4: Why does BFCL use AST validation?

Answer: To verify the structural meaning of code rather than doing text matching. AST parses code into a syntax tree to accurately check function names, parameter names, types, and values.

Explanation: Simple text comparison might treat get_weather(city='Seoul') and get_weather(city = 'Seoul') as different. AST parsing ignores surface differences like whitespace and quote style to verify actual semantic equivalence. It also recognizes the same call regardless of parameter order, enabling more accurate evaluation.

Quiz 5: Why does MTEB use nDCG@10 for Retrieval tasks?

Answer: nDCG@10 measures the quality of the top 10 search results while assigning more weight to higher-ranked results. This reflects real user behavior since users typically only look at the top results.

Explanation: nDCG (Normalized Discounted Cumulative Gain) discounts relevance scores (0~3) with a log function so that higher-ranked results are weighted more heavily. The @10 means only the top 10 results are evaluated. For example, a relevant document in position 1 receives a much higher score than the same document in position 10.

Quiz 6: What is the difference between Faithfulness and Answer Relevance in RAGAS?

Answer: Faithfulness measures whether the answer is grounded in the retrieved context (does not fabricate), while Answer Relevance measures whether the answer actually addresses the core of the question.

Explanation: The two metrics catch different failure modes. Low Faithfulness means the model is making up content not in the context (hallucination). Low Answer Relevance means the model is faithful to the context but answering something other than what was asked. A good RAG system needs both metrics to be high.

Quiz 7: Why is SWE-bench harder and more realistic than HumanEval?

Answer: SWE-bench uses real GitHub issues and codebases. Unlike writing a single function, it requires understanding thousands of lines of existing code, diagnosing the root cause of a bug, making minimal targeted changes, and passing an existing test suite.

Explanation: HumanEval involves writing clean function implementations, but SWE-bench simulates real software development. The model must (1) understand the issue description, (2) navigate the codebase, (3) diagnose the bug, (4) decide how to fix it, (5) generate a patch, and (6) verify it passes existing tests. This closely mirrors the everyday work of a real developer.

Quiz 8: What are the main solutions to the data contamination problem?

Answer: Dynamic benchmarks (LiveBench, LiveCodeBench), private test sets, continuous addition of new problems, and generative evaluation are the main solutions.

Explanation: Data contamination occurs when test questions are included in training data, producing artificially high scores. LiveBench continuously adds new problems from recent arxiv papers and competitive programming sites so models cannot preview them. Some approaches also require model submitters to declare whether the test set was included in training data.

Quiz 9: Why is zero-shot evaluation important in BEIR?

Answer: To measure the true generalization ability of embedding models. A model that works well across diverse domains without domain-specific fine-tuning is far more practical.

Explanation: When building real RAG systems, you often need to handle documents from diverse domains like medicine, law, and finance. Training separate models for each domain is costly, so embedding models that work well across domains in zero-shot settings are much more practical. BEIR evaluates this generalization ability across 18 domains.

Conclusion: Using Benchmarks Wisely

Benchmark scores show only one facet of model capability. It is essential to choose benchmarks that match your actual use case and consider multiple benchmarks holistically rather than relying on any single one.

Core principles:

Choose benchmarks aligned with your goal: For code generation, HumanEval is more relevant than MMLU
Consider multiple benchmarks together: Ranking #1 on a single benchmark does not mean best in all areas
Check the prompting method: Verify whether results used CoT vs standard prompting
Be aware of data contamination: Cross-check with dynamic benchmarks
Test directly: Ultimately, evaluate on your actual use case

Benchmarks are maps, not the territory itself. Use multiple good maps to choose the optimal model for your needs.