LLM, Tool Calling & Embedding Benchmarks Deep Dive

When evaluating AI models, benchmark names appear everywhere. MMLU 85%, HumanEval 90%, MTEB #1 — let's fully understand what these numbers actually mean, how each benchmark works, and which ones matter for which use cases.

1. LLM General Benchmarks

MMLU (Massive Multitask Language Understanding)

Published by UC Berkeley in 2020, MMLU measures the breadth of an LLM's knowledge and comprehension across diverse academic fields.

How it works:

57 academic subjects (math, science, law, history, medicine, psychology, and more)
14,000+ multiple-choice questions with 4 answer choices
5-shot learning: 5 example questions with answers are provided before each test

Example question:
Subject: High School Chemistry

Example 1: What is the element with atomic number 6?
(A) Nitrogen  (B) Oxygen  (C) Carbon  (D) Neon
Answer: (C)

...5 examples provided...

Test: What is required for an ionic bond to form?
(A) Between two non-metal atoms
(B) Between a metal and non-metal atom
(C) Between two metal atoms
(D) Between a noble metal and non-metal
Answer: ?

Score interpretation:

Random guess: 25% (4 choices)
GPT-4: ~86%, Claude 3 Opus: ~86%, Gemini Ultra: ~90%
Human expert average: ~89%

Limitations:

Hard to distinguish memorization from understanding: the model may have seen questions in training data
English-centric: does not reflect multilingual ability
Static dataset: no recent knowledge
Data contamination risk: test questions may appear in training data

HellaSwag

Published in 2019, HellaSwag measures "commonsense reasoning" and "sentence completion." The name stands for Harder Endings, Longer contexts, and Low-shot Activities For Situations With Adversarial Generations.

How it works:

Derived from ActivityNet (everyday activity video descriptions) and WikiHow (step-by-step guides)
Choose the most natural continuation for a given situation
Wrong choices (distractors) are generated by language models — plausible on the surface but actually incorrect

Example:
Situation: "A man is grilling hot dogs. He flips the hot dogs with tongs."

What comes next?
(A) He throws the hot dogs into the fire.
(B) He places the finished hot dogs in buns.
(C) He takes ingredients out of the refrigerator.
(D) He opens a recipe book.

Answer: (B)

Score interpretation:

Humans: 95.6%
GPT-4: 95.3%+
Most modern LLMs approach human-level performance

Limitations:

Already too easy for large models (saturated)
Biased toward English everyday culture
May reflect language pattern recognition more than deep reasoning

ARC (AI2 Reasoning Challenge)

Created by the Allen Institute for AI, this benchmark measures reasoning ability using elementary-to-middle school science questions.

Two sets:

Easy Set:

4-choice elementary school science questions
Solvable with simple fact lookup
Keyword/frequency-based methods can exceed 25%

Challenge Set:

Designed so simple keyword search or information retrieval cannot solve it
Requires reasoning, causation, and multi-step thinking
7,787 questions

Challenge Set example:
"What is the primary role of stomata in plant leaves?"

(A) Absorbing light for photosynthesis
(B) Regulating gas exchange and water transpiration
(C) Transporting water from roots to leaves
(D) Moving sugars to other plant parts

→ Requires understanding of function, not just surface retrieval

Score interpretation:

Challenge Set: GPT-4 ~96%, human experts ~100%
Easy Set is nearly saturated for modern models

TruthfulQA

Published in 2021, TruthfulQA measures a model's tendency to produce hallucinations and false beliefs.

How it works:

817 questions designed to elicit misconceptions humans commonly hold
Evaluates whether the model accurately says "I don't know" or provides correct information

Example questions:
- "Is it true that we only use 10% of our brains?"
  → Misconception. Correct: "No, this is a myth."

- "What is a rectangle with right-angle diagonals called?"
  → Trick question. Rectangle diagonals are not right-angle.

- "Was Napoleon very short?"
  → Actually average height for his era.

Score interpretation:

Humans: ~94%
GPT-4: ~60% (intentionally difficult)
A low score means the model confidently generates plausible misinformation

Key point: TruthfulQA is designed to be difficult to score high on. A low-scoring model is particularly good at producing believable false information.

WinoGrande

Published in 2019, WinoGrande uses 44,000 commonsense reasoning problems to measure pronoun disambiguation ability.

How it works:

Large-scale version of the Winograd Schema Challenge
Fill in one of two blanks requiring common sense
Designed to remove gender bias present in WinoBias

Example:
"The trophy didn't fit in the brown suitcase because ___ was too big."
(A) it [trophy]
(B) it [suitcase]
→ Requires understanding that the trophy was too big

"At the library, Sarah read more books than Amy. ___ enjoyed reading."
(A) Sarah
(B) Amy
→ Commonsense judgment required

Score interpretation:

Random: 50%
GPT-4: ~87%, Humans: ~94%

BIG-Bench (Beyond the Imitation Game Benchmark)

A large-scale benchmark containing 204 diverse tasks that evaluates capabilities difficult to assess with existing benchmarks.

BIG-Bench Hard (BBH):

23 particularly challenging reasoning tasks
Especially useful for measuring the effect of Chain-of-Thought (CoT) prompting
Includes web navigation, scheduling, symbolic reasoning, and more

BBH example tasks:
- Boolean Expressions: Evaluate "(True and False) or (not True and True)"
- Causal Judgment: Determine direction of causation
- Formal Fallacies: Identify logical errors
- Movie Recommendation: Preference-based recommendations
- Object Counting: Count objects from textual descriptions
- Temporal Sequences: Sort events chronologically
- Word Sorting: Sort by alphabet or given condition

Chain-of-Thought effect:

Standard prompting: GPT-4 ~65%
CoT prompting: GPT-4 ~85%+
Used to identify where CoT is most effective

GPQA (Graduate-Level Google-Proof Q&A)

Published in 2023, GPQA requires PhD-level scientific expertise and is designed so that even Google searches cannot easily find the answer.

How it works:

Written directly by PhD researchers in biology, chemistry, and physics
4-choice questions solvable only by domain experts
Engineered so web searches are not sufficient

Score interpretation:

Non-expert PhD: ~34%
Domain expert PhD: ~65%
GPT-4: ~39%, Claude 3 Opus: ~50%+

Example (Physics):
"What is the primary advantage of topological qubits in quantum computers?"

(A) Can only operate at absolute zero temperature
(B) Topologically protected, resistant to environmental noise
(C) Faster gate speeds than traditional transistors
(D) Support unlimited qubit count

→ Requires deep understanding of quantum error correction

LiveBench

A dynamic benchmark that adds new questions monthly to prevent data contamination.

How it works:

Covers math, coding, reasoning, language, and agent tasks
Generated from recent arxiv papers, news, and competitive programming problems
Only includes questions with objective, verifiable answers

Why it matters:

Addresses data contamination in static benchmarks
Distinguishes genuine reasoning from memorization
Continuously updated for fair comparison of latest models

2. Coding Benchmarks

HumanEval

Published by OpenAI in 2021, HumanEval is the most widely used coding benchmark for measuring Python programming ability.

How it works:

164 Python function implementation problems
Provides function signature + docstring + sample inputs/outputs
Checks whether generated code passes hidden test cases

# Example problem
def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """
    Check whether any two numbers in the list are closer
    to each other than the given threshold.

    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """
    # Model must implement this

pass@k metric:

pass@1: Probability of passing on the first attempt
pass@10: Probability that at least 1 of 10 attempts passes
pass@100: Probability that at least 1 of 100 attempts passes

Score interpretation:

GPT-4: pass@1 ~87%
Claude 3.5 Sonnet: pass@1 ~92%+
Original GPT-3: pass@1 ~0%

Limitations:

Only 164 problems — limited diversity
Relatively low algorithmic complexity
Does not measure real-world software skills (debugging, refactoring)

MBPP (Mostly Basic Python Problems)

A collection of 374 crowd-sourced Python problems published by Google Research.

Differences from HumanEval:

More diverse patterns and styles
Includes simpler problems (beginner to intermediate)
Crowd-sourced for varied difficulty levels

# MBPP example
"""
Write a function to find the maximum product subarray.
assert max_product_subarray([6, -3, -10, 0, 2]) == 180
assert max_product_subarray([-1, -3, -10, 0, 60]) == 60
"""

SWE-bench

Published in 2023, SWE-bench measures the ability to resolve real GitHub issues and bugs.

How it works:

12 real Python open-source projects (Django, Flask, NumPy, etc.)
2,294 real GitHub issues with verified patches
Model reads issue descriptions and generates actual code fixes
Validated by the existing test suite

Example issue:
Repository: scikit-learn
Issue: "KNeighborsClassifier.predict() returns incorrect
        results when given sparse matrix input"

What the model must do:
1. Understand the issue description
2. Locate relevant source code
3. Generate a bug fix patch
4. Ensure existing tests pass

SWE-bench Lite:

300 selected, more clearly defined problems
Subset for faster evaluation

Score interpretation:

Early 2023: Even GPT-4 scored only 1~2%
2024: Latest agent systems reaching 20~50%
Reflects the true complexity of real software engineering

Why it matters:

Far more realistic evaluation than HumanEval
Integrates code comprehension + modification + verification
Assesses actual potential for replacing developer tasks

LiveCodeBench

A dynamic coding benchmark that continuously adds new problems from LeetCode, AtCoder, and CodeForces to prevent data contamination.

Features:

Uses problems added after competitions end
Measures performance on problems the model has never seen
Includes code generation, self-repair, and code execution prediction

3. Reasoning & Math Benchmarks

GSM8K (Grade School Math)

A benchmark of 8,500 elementary school math problems published by OpenAI in 2021.

Features:

Requires 2–8 step multi-step reasoning
Basic arithmetic, fractions, decimals, percentages
Core benchmark for validating Chain-of-Thought reasoning effectiveness

Example problem:
"Janet's ducks lay 16 eggs per day. Every morning
she eats 3 for breakfast and uses 4 for muffins for
her friends. She sells the remainder at $2 per egg.
How much does she earn per day?"

Chain-of-Thought reasoning:
1. Daily eggs: 16
2. Eaten: 3
3. Used for muffins: 4
4. Eggs to sell: 16 - 3 - 4 = 9
5. Earnings: 9 * 2 = $18

Answer: $18

Score interpretation:

Humans: ~100%
GPT-4 (CoT): 92%+
GPT-3 (standard): ~20%
GPT-3 (CoT): ~56%
One of the most dramatic demonstrations of CoT effectiveness

MATH

A collection of 12,500 competition-level math problems published in 2021.

7 subject areas:

Algebra
Precalculus
Geometry
Number Theory
Counting and Probability
Intermediate Algebra
Prealgebra

5 difficulty levels:

Level 1 (easiest): AMC 8 level
Level 5 (hardest): AIME, HMMT level

Level 5 example:
"Factor x^4 + 4x^3 - 2x^2 - 12x + 9"

Answer: (x^2 + 2x - 3)^2 = (x+3)^2(x-1)^2
→ Requires advanced algebraic manipulation

Score interpretation:

GPT-4: ~52% overall, Level 5: ~20%+
Latest models (o1, Gemini Ultra): 80%+
Math-specialized models are improving rapidly

AIME (American Invitational Mathematics Examination)

Real problems from the American math olympiad qualifying exam.

Features:

Integer answers from 0 to 999 (no multiple choice)
Designed for AMC 10/12 qualifiers
Demands extreme mathematical creativity

Score interpretation:

Top 5% of humans: 7~9 of 15 problems
GPT-4o: 912 of 15 (as of 2024)
The o1 series made breakthrough advances here

4. Tool Calling / Function Calling Benchmarks

BFCL (Berkeley Function Calling Leaderboard)

The most comprehensive function calling benchmark, published by UC Berkeley in 2024.

2,000+ function calling scenarios:

Categories:

Simple Function Calling — single function, clear parameters
Multiple Functions — select the right function from several options
Parallel Functions — invoke multiple functions simultaneously
Nested Functions — call functions within other functions
REST API — call real HTTP API endpoints

Evaluation criteria:

Correct function name selection
Parameter name accuracy
Parameter type accuracy (string vs int vs float)
Semantic correctness of parameter values
Absence of unnecessary parameters

AST validation approach:

# Ground truth function call
get_weather(
    location="Seoul, Korea",
    unit="celsius",
    forecast_days=3
)

# Model-generated call
get_weather(
    location="Seoul",  # Partial match — allowed?
    unit="C",          # Type/format error
    days=3             # Parameter name error!
)

AST (Abstract Syntax Tree) parsing verifies structural correctness

Supported languages/environments:

Python, Java, JavaScript, SQL, REST API

Score interpretation (2024):

GPT-4o: overall ~72%
Claude 3.5 Sonnet: overall ~73%
Open-source models: 40~60% range

tau-bench (τ-bench)

A benchmark measuring real agent task completion, going beyond simple function call accuracy to measure end-to-end task success rates.

How it works:

Real business scenarios (travel booking, shopping, etc.)
Multi-step agent workflows
Appropriate tool use at each step
Final task completion evaluation

Example scenario:
"Find a one-way flight from New York to Paris on March 20,
book the cheapest option, and send a confirmation email."

Required steps:
1. search_flights(origin="NYC", destination="Paris", date="2026-03-20")
2. select_flight(flight_id="AF001", criteria="cheapest")
3. book_flight(flight_id="AF001", passenger_info=...)
4. send_confirmation_email(booking_id=..., email=...)

→ Measures accuracy of each step AND overall completion

ToolBench / ToolEval

Published in 2023, this benchmark evaluates tool-use ability with 16,000 real REST APIs.

How it works:

49 categories, 16,000 APIs collected from RapidAPI
Select appropriate API from real API documentation
Call the API with correct parameters
Multi-step API chaining

Solvable Pass Rate (SoPR) metric:

Success rate on problems that are actually solvable
Compares ChatGPT's built-in Function Calling vs ToolLLM

Evaluation criteria:

Tool selection accuracy (choosing the right API)
Execution order accuracy
Parameter accuracy
Error handling ability

AgentBench

Published in 2023, this benchmark measures autonomous LLM agent ability across 8 different environments.

8 environments:

OS — Operating system tasks (file manipulation, command execution)
DB — Database queries and manipulation
Knowledge Graph — Knowledge graph traversal
Digital Card Game — Strategic card game
Lateral Thinking Puzzles — Creative problem solving
House Holding — Home management in a virtual environment
Web Shopping — Online shopping tasks
Web Browsing — Web navigation and information gathering

OS environment example:
"Find all .py files in the current directory created in 2023
and move them to a 'python_files' folder."

→ Requires combining find, mkdir, mv commands
→ Measures multi-step decision-making and error recovery

Score interpretation:

GPT-4: overall ~3.6/10
GPT-3.5: overall ~1.9/10
Many open-source models: below 1

5. Embedding Benchmarks

MTEB (Massive Text Embedding Benchmark)

Published in 2022, MTEB is the most comprehensive benchmark for evaluating text embedding models.

56 datasets, 8 task types:

1. Retrieval

Find the most relevant document for a query
Uses nDCG@10 metric
Includes BEIR benchmark datasets

Example: "How to sort a list in Python"
→ Rank relevant Stack Overflow answers and documentation

2. Classification

Text classification (sentiment analysis, topic classification, etc.)
Evaluated with embedding + logistic regression
Accuracy or F1 score

3. Clustering

Automatically group similar texts
ArXiv papers, Reddit posts, etc.
V-measure metric

4. Semantic Textual Similarity (STS)

Semantic similarity score between two sentences (0~5)
Evaluated with Spearman correlation

Example:
Sentence 1: "A dog is running in the park"
Sentence 2: "A canine is sprinting outdoors"
→ High similarity (~4.0/5.0)

Sentence 1: "The weather is sunny today"
Sentence 2: "I love eating pizza"
→ Low similarity (~0.5/5.0)

5. Reranking

Reorder initial search results
MAP (Mean Average Precision) metric
Final sorting ability of search engines

6. Summarization

Semantic similarity between summary and original document
Spearman correlation

7. Pair Classification

Classify relationship between two sentences (similar/dissimilar, duplicate/non-duplicate)
AP (Average Precision) metric

Example:
- Duplicate detection: "How to sort a Python list" vs "Sort list in Python"
  → Duplicate (True)
- "Apples are fruit" vs "I like swimming"
  → Unrelated (False)

8. Bitext Mining

Find parallel sentence pairs across languages
F1 score

Example:
English: "The weather is nice today"
Korean: "오늘 날씨가 좋다"
→ Parallel pair detection

MTEB Leaderboard (HuggingFace):

Compare models by overall score
View per-task detailed scores
Top performers (2024): text-embedding-3-large, voyage-large-2, E5-mistral-7b

BEIR (Benchmarking Information Retrieval)

Published in 2021, BEIR measures information retrieval performance across 18 diverse domains.

18 datasets:

TREC-COVID: Medical paper search on COVID-19
NFCorpus: Medical/nutrition information retrieval
NQ (Natural Questions): Google natural language search
HotpotQA: Multi-hop reasoning retrieval
FiQA: Financial Q&A
ArguAna: Counter-argument retrieval
Touche: Debate argument retrieval
CQADupStack: Community Q&A duplicate detection
Quora: Duplicate question detection
DBPedia: Entity search
SCIDOCS: Academic paper retrieval
FEVER: Fact verification
Climate-FEVER: Climate-related fact verification
SciFact: Scientific claim verification

nDCG@10 metric:

nDCG@10 = Normalized Discounted Cumulative Gain of top 10 results

Relevance scores:
- Highly relevant: 3 points
- Relevant: 2 points
- Marginally relevant: 1 point
- Not relevant: 0 points

Higher-ranked results receive more weight

Zero-shot evaluation:

Measures generalization across domains without domain-specific fine-tuning
Compares traditional methods like BM25 with neural embeddings

6. RAG & Document Parsing Benchmarks

RAGAS (Retrieval Augmented Generation Assessment)

A comprehensive framework for measuring the quality of RAG systems.

5 core metrics:

1. Faithfulness

Is the generated answer grounded in the retrieved context?
Does the model avoid fabricating content not in the context?
Score range: 0~1

Context: "Python was created by Guido van Rossum in 1991."
Question: "When was Python created and by whom?"

High Faithfulness answer:
"Python was created by Guido van Rossum in 1991."

Low Faithfulness answer (hallucination):
"Python was created by Guido van Rossum in 1989,
 in Amsterdam, the Netherlands..."
→ Date and location not in context are fabricated

2. Answer Relevance

Does the answer actually address the question?
Does the answer avoid including off-topic information?

3. Context Precision

Is the retrieved context genuinely useful?
Ratio of unnecessary context included

4. Context Recall

Was all information needed to answer retrieved?
Whether ground truth answer information is present in context

5. Context Entity Recall

Are important entities (people, places, dates, etc.) present in the context?

RULER (Retrieval Under Long-context Evaluation Regime)

A benchmark measuring long-context LLM ability, going beyond simple Needle-in-a-Haystack to evaluate complex long-document understanding.

Task types:

NIAH (Needle-in-a-Haystack): Find specific information in a long document
Multi-key NIAH: Find multiple pieces of information simultaneously
Multi-value NIAH: Extract multiple values for a single key
Multi-hop Tracing: Reason through multiple steps following information chains
Aggregation: Aggregate information across the full document
QA: Question answering based on long context

Multi-hop Tracing example (in a 128K token document):
"Alice's manager is Bob. Bob's birthday is March 15th.
... (tens of thousands of tokens of unrelated content) ...
What is Alice's manager's birthday?"

→ Measures the ability to connect Alice → Bob → March 15th

DocVQA

Measures visual question answering ability on real document images.

How it works:

Real scanned document images (invoices, forms, reports, contracts, etc.)
Natural language question + document image → generate answer
Integrates OCR ability + document structure understanding + content comprehension

Example:
[Invoice image]
Question: "What is the total tax amount?"
→ Locate the tax line item and extract the value

[Medical form]
Question: "What is the patient's date of birth?"
→ Identify the specific field location and extract value

ANLS (Average Normalized Levenshtein Similarity) metric:

Similarity measured by edit distance, not exact match
Allows for numeric/date format variations

FinanceBench

A Q&A benchmark based on financial documents (10-K annual reports, 10-Q quarterly reports).

How it works:

Real corporate disclosure documents (SEC EDGAR)
Questions requiring numerical extraction, calculation, and multi-step reasoning

Example:
[Apple Inc. 2023 Annual Report]
Question: "What was the year-over-year revenue growth rate
          of the Services segment in 2023?"

Required capabilities:
1. Find 2023 Services revenue
2. Find 2022 Services revenue
3. Calculate growth rate: (2023-2022)/2022 * 100

7. Multimodal Benchmarks

MMBench / MMMU

MMBench:

Comprehensive evaluation of multimodal understanding
Image + text comprehension
Evaluates 20+ distinct sub-abilities

MMMU (Massive Multi-discipline Multimodal Understanding):

College-level multimodal understanding
11,500 problems, 30 disciplines, 183 sub-topics
Understanding diagrams, charts, and formulas in medicine, law, engineering

MMMU example:
[Chemical bonding diagram image]
Question: "What is the bond angle in this molecular structure?"
→ Requires visual interpretation of chemical structures

DocBench / OCRBench

OCRBench:

Measures OCR accuracy
Printed text, handwriting, multilingual text
Scene text and document text
1,000 evaluation samples

DocBench:

Measures document parsing quality
Table, formula, chart, and layout recognition
PDF and image document processing ability

8. Benchmark Selection Guide

Reference benchmarks by use case:

Use Case	Primary Benchmarks	Secondary Benchmarks
Chatbot / QA systems	MMLU, TruthfulQA	HellaSwag, WinoGrande
Code generation tools	HumanEval, SWE-bench	MBPP, LiveCodeBench
Agents / Automation	BFCL, AgentBench	τ-bench, ToolBench
RAG systems	MTEB Retrieval, BEIR	RAGAS, RULER
Document processing	DocVQA, OCRBench	FinanceBench
Math / Science	MATH, GSM8K	GPQA, AIME
Embedding model selection	Full MTEB	BEIR by domain
Multimodal	MMMU, MMBench	DocVQA

9. Limitations and Caveats

Data Contamination

The problem:

Test questions may be present in the model's training data
Publicly available benchmark questions have high probability of appearing in training data
Hard to distinguish genuine reasoning from memorization

Mitigations:

Dynamic benchmarks like LiveBench and LiveCodeBench
Private test sets
Continuous addition of new problems

Score Variation from Prompt Engineering

Same model, different prompts:
GSM8K standard prompting: 70%
GSM8K CoT prompting: 92%

→ Scores without stated prompting method are meaningless

The Gap Between Benchmark Scores and Real-World Usability

A model with MMLU 90% might produce worse writing than one with 80%
Models that overfit to specific benchmarks exist
"Benchmark hacking": raising scores without actually improving real capability

Language Bias

Most benchmarks are English-centric
Insufficient measurement of Korean, Japanese, Arabic, and other languages
Separate multilingual benchmarks needed: MLQA, XNLI, mMTEB, etc.

Benchmark Saturation

HellaSwag: Humans and GPT-4 now at nearly the same level
ARC Easy: Most modern models exceed 98%
New, harder benchmarks are continuously needed

Quiz: Test Your Benchmark Understanding

Quiz 1: What does 5-shot learning in MMLU mean?

Answer: Before each test question, 5 example questions with their correct answers are provided in the prompt.

Explanation: In 5-shot learning, the prompt includes 5 example problems and their answers from the relevant subject before the actual test question. This guides the model to understand the question format and produce answers in the expected style. 0-shot means no examples, 1-shot means one example, and few-shot means a small number of examples.

Quiz 2: Why does GPT-4 score lower than humans on TruthfulQA?

Answer: TruthfulQA is deliberately designed to test misconceptions and false beliefs that humans commonly hold. AI models also learn incorrect information from training data and tend to generate plausible-sounding misinformation.

Explanation: The core purpose of TruthfulQA is to measure a model's tendency to produce "plausible but wrong" answers (hallucination). Humans can say "I'm not sure," but LLMs often confidently generate incorrect information. The benchmark is intentionally designed to be hard to score high on — differences between models are more meaningful than the absolute score itself.

Quiz 3: Why is pass@10 always higher than pass@1 in HumanEval?

Answer: pass@10 only requires at least 1 success out of 10 attempts, so it has a higher or equal probability of success compared to a single attempt (pass@1).

Explanation: pass@k is the probability of at least one success in k attempts. The formula is approximately 1 - (probability of failure)^k. As k increases, the probability of success increases, so pass@100 >= pass@10 >= pass@1 always holds. This metric is also used to assess the diversity and creativity of a model's code generation.

Quiz 4: Why does BFCL use AST validation?

Answer: To verify the structural meaning of code rather than doing text matching. AST parses code into a syntax tree to accurately check function names, parameter names, types, and values.

Explanation: Simple text comparison might treat get_weather(city='Seoul') and get_weather(city = 'Seoul') as different. AST parsing ignores surface differences like whitespace and quote style to verify actual semantic equivalence. It also recognizes the same call regardless of parameter order, enabling more accurate evaluation.

Quiz 5: Why does MTEB use nDCG@10 for Retrieval tasks?

Answer: nDCG@10 measures the quality of the top 10 search results while assigning more weight to higher-ranked results. This reflects real user behavior since users typically only look at the top results.

Explanation: nDCG (Normalized Discounted Cumulative Gain) discounts relevance scores (0~3) with a log function so that higher-ranked results are weighted more heavily. The @10 means only the top 10 results are evaluated. For example, a relevant document in position 1 receives a much higher score than the same document in position 10.

Quiz 6: What is the difference between Faithfulness and Answer Relevance in RAGAS?

Answer: Faithfulness measures whether the answer is grounded in the retrieved context (does not fabricate), while Answer Relevance measures whether the answer actually addresses the core of the question.

Explanation: The two metrics catch different failure modes. Low Faithfulness means the model is making up content not in the context (hallucination). Low Answer Relevance means the model is faithful to the context but answering something other than what was asked. A good RAG system needs both metrics to be high.

Quiz 7: Why is SWE-bench harder and more realistic than HumanEval?

Answer: SWE-bench uses real GitHub issues and codebases. Unlike writing a single function, it requires understanding thousands of lines of existing code, diagnosing the root cause of a bug, making minimal targeted changes, and passing an existing test suite.

Explanation: HumanEval involves writing clean function implementations, but SWE-bench simulates real software development. The model must (1) understand the issue description, (2) navigate the codebase, (3) diagnose the bug, (4) decide how to fix it, (5) generate a patch, and (6) verify it passes existing tests. This closely mirrors the everyday work of a real developer.

Quiz 8: What are the main solutions to the data contamination problem?

Answer: Dynamic benchmarks (LiveBench, LiveCodeBench), private test sets, continuous addition of new problems, and generative evaluation are the main solutions.

Explanation: Data contamination occurs when test questions are included in training data, producing artificially high scores. LiveBench continuously adds new problems from recent arxiv papers and competitive programming sites so models cannot preview them. Some approaches also require model submitters to declare whether the test set was included in training data.

Quiz 9: Why is zero-shot evaluation important in BEIR?

Answer: To measure the true generalization ability of embedding models. A model that works well across diverse domains without domain-specific fine-tuning is far more practical.

Explanation: When building real RAG systems, you often need to handle documents from diverse domains like medicine, law, and finance. Training separate models for each domain is costly, so embedding models that work well across domains in zero-shot settings are much more practical. BEIR evaluates this generalization ability across 18 domains.

Conclusion: Using Benchmarks Wisely

Benchmark scores show only one facet of model capability. It is essential to choose benchmarks that match your actual use case and consider multiple benchmarks holistically rather than relying on any single one.

Core principles:

Choose benchmarks aligned with your goal: For code generation, HumanEval is more relevant than MMLU
Consider multiple benchmarks together: Ranking #1 on a single benchmark does not mean best in all areas
Check the prompting method: Verify whether results used CoT vs standard prompting
Be aware of data contamination: Cross-check with dynamic benchmarks
Test directly: Ultimately, evaluate on your actual use case

Benchmarks are maps, not the territory itself. Use multiple good maps to choose the optimal model for your needs.