- Published on
LLM, Tool Calling & Embedding Benchmarks Deep Dive: What Each Benchmark Actually Measures
- Authors

- Name
- Youngju Kim
- @fjvbn20031
LLM, Tool Calling & Embedding Benchmarks Deep Dive
When evaluating AI models, benchmark names appear everywhere. MMLU 85%, HumanEval 90%, MTEB #1 — let's fully understand what these numbers actually mean, how each benchmark works, and which ones matter for which use cases.
1. LLM General Benchmarks
MMLU (Massive Multitask Language Understanding)
Published by UC Berkeley in 2020, MMLU measures the breadth of an LLM's knowledge and comprehension across diverse academic fields.
How it works:
- 57 academic subjects (math, science, law, history, medicine, psychology, and more)
- 14,000+ multiple-choice questions with 4 answer choices
- 5-shot learning: 5 example questions with answers are provided before each test
Example question:
Subject: High School Chemistry
Example 1: What is the element with atomic number 6?
(A) Nitrogen (B) Oxygen (C) Carbon (D) Neon
Answer: (C)
...5 examples provided...
Test: What is required for an ionic bond to form?
(A) Between two non-metal atoms
(B) Between a metal and non-metal atom
(C) Between two metal atoms
(D) Between a noble metal and non-metal
Answer: ?
Score interpretation:
- Random guess: 25% (4 choices)
- GPT-4: ~86%, Claude 3 Opus: ~86%, Gemini Ultra: ~90%
- Human expert average: ~89%
Limitations:
- Hard to distinguish memorization from understanding: the model may have seen questions in training data
- English-centric: does not reflect multilingual ability
- Static dataset: no recent knowledge
- Data contamination risk: test questions may appear in training data
HellaSwag
Published in 2019, HellaSwag measures "commonsense reasoning" and "sentence completion." The name stands for Harder Endings, Longer contexts, and Low-shot Activities For Situations With Adversarial Generations.
How it works:
- Derived from ActivityNet (everyday activity video descriptions) and WikiHow (step-by-step guides)
- Choose the most natural continuation for a given situation
- Wrong choices (distractors) are generated by language models — plausible on the surface but actually incorrect
Example:
Situation: "A man is grilling hot dogs. He flips the hot dogs with tongs."
What comes next?
(A) He throws the hot dogs into the fire.
(B) He places the finished hot dogs in buns.
(C) He takes ingredients out of the refrigerator.
(D) He opens a recipe book.
Answer: (B)
Score interpretation:
- Humans: 95.6%
- GPT-4: 95.3%+
- Most modern LLMs approach human-level performance
Limitations:
- Already too easy for large models (saturated)
- Biased toward English everyday culture
- May reflect language pattern recognition more than deep reasoning
ARC (AI2 Reasoning Challenge)
Created by the Allen Institute for AI, this benchmark measures reasoning ability using elementary-to-middle school science questions.
Two sets:
Easy Set:
- 4-choice elementary school science questions
- Solvable with simple fact lookup
- Keyword/frequency-based methods can exceed 25%
Challenge Set:
- Designed so simple keyword search or information retrieval cannot solve it
- Requires reasoning, causation, and multi-step thinking
- 7,787 questions
Challenge Set example:
"What is the primary role of stomata in plant leaves?"
(A) Absorbing light for photosynthesis
(B) Regulating gas exchange and water transpiration
(C) Transporting water from roots to leaves
(D) Moving sugars to other plant parts
→ Requires understanding of function, not just surface retrieval
Score interpretation:
- Challenge Set: GPT-4 ~96%, human experts ~100%
- Easy Set is nearly saturated for modern models
TruthfulQA
Published in 2021, TruthfulQA measures a model's tendency to produce hallucinations and false beliefs.
How it works:
- 817 questions designed to elicit misconceptions humans commonly hold
- Evaluates whether the model accurately says "I don't know" or provides correct information
Example questions:
- "Is it true that we only use 10% of our brains?"
→ Misconception. Correct: "No, this is a myth."
- "What is a rectangle with right-angle diagonals called?"
→ Trick question. Rectangle diagonals are not right-angle.
- "Was Napoleon very short?"
→ Actually average height for his era.
Score interpretation:
- Humans: ~94%
- GPT-4: ~60% (intentionally difficult)
- A low score means the model confidently generates plausible misinformation
Key point: TruthfulQA is designed to be difficult to score high on. A low-scoring model is particularly good at producing believable false information.
WinoGrande
Published in 2019, WinoGrande uses 44,000 commonsense reasoning problems to measure pronoun disambiguation ability.
How it works:
- Large-scale version of the Winograd Schema Challenge
- Fill in one of two blanks requiring common sense
- Designed to remove gender bias present in WinoBias
Example:
"The trophy didn't fit in the brown suitcase because ___ was too big."
(A) it [trophy]
(B) it [suitcase]
→ Requires understanding that the trophy was too big
"At the library, Sarah read more books than Amy. ___ enjoyed reading."
(A) Sarah
(B) Amy
→ Commonsense judgment required
Score interpretation:
- Random: 50%
- GPT-4: ~87%, Humans: ~94%
BIG-Bench (Beyond the Imitation Game Benchmark)
A large-scale benchmark containing 204 diverse tasks that evaluates capabilities difficult to assess with existing benchmarks.
BIG-Bench Hard (BBH):
- 23 particularly challenging reasoning tasks
- Especially useful for measuring the effect of Chain-of-Thought (CoT) prompting
- Includes web navigation, scheduling, symbolic reasoning, and more
BBH example tasks:
- Boolean Expressions: Evaluate "(True and False) or (not True and True)"
- Causal Judgment: Determine direction of causation
- Formal Fallacies: Identify logical errors
- Movie Recommendation: Preference-based recommendations
- Object Counting: Count objects from textual descriptions
- Temporal Sequences: Sort events chronologically
- Word Sorting: Sort by alphabet or given condition
Chain-of-Thought effect:
- Standard prompting: GPT-4 ~65%
- CoT prompting: GPT-4 ~85%+
- Used to identify where CoT is most effective
GPQA (Graduate-Level Google-Proof Q&A)
Published in 2023, GPQA requires PhD-level scientific expertise and is designed so that even Google searches cannot easily find the answer.
How it works:
- Written directly by PhD researchers in biology, chemistry, and physics
- 4-choice questions solvable only by domain experts
- Engineered so web searches are not sufficient
Score interpretation:
- Non-expert PhD: ~34%
- Domain expert PhD: ~65%
- GPT-4: ~39%, Claude 3 Opus: ~50%+
Example (Physics):
"What is the primary advantage of topological qubits in quantum computers?"
(A) Can only operate at absolute zero temperature
(B) Topologically protected, resistant to environmental noise
(C) Faster gate speeds than traditional transistors
(D) Support unlimited qubit count
→ Requires deep understanding of quantum error correction
LiveBench
A dynamic benchmark that adds new questions monthly to prevent data contamination.
How it works:
- Covers math, coding, reasoning, language, and agent tasks
- Generated from recent arxiv papers, news, and competitive programming problems
- Only includes questions with objective, verifiable answers
Why it matters:
- Addresses data contamination in static benchmarks
- Distinguishes genuine reasoning from memorization
- Continuously updated for fair comparison of latest models
2. Coding Benchmarks
HumanEval
Published by OpenAI in 2021, HumanEval is the most widely used coding benchmark for measuring Python programming ability.
How it works:
- 164 Python function implementation problems
- Provides function signature + docstring + sample inputs/outputs
- Checks whether generated code passes hidden test cases
# Example problem
def has_close_elements(numbers: List[float], threshold: float) -> bool:
"""
Check whether any two numbers in the list are closer
to each other than the given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
True
"""
# Model must implement this
pass@k metric:
- pass@1: Probability of passing on the first attempt
- pass@10: Probability that at least 1 of 10 attempts passes
- pass@100: Probability that at least 1 of 100 attempts passes
Score interpretation:
- GPT-4: pass@1 ~87%
- Claude 3.5 Sonnet: pass@1 ~92%+
- Original GPT-3: pass@1 ~0%
Limitations:
- Only 164 problems — limited diversity
- Relatively low algorithmic complexity
- Does not measure real-world software skills (debugging, refactoring)
MBPP (Mostly Basic Python Problems)
A collection of 374 crowd-sourced Python problems published by Google Research.
Differences from HumanEval:
- More diverse patterns and styles
- Includes simpler problems (beginner to intermediate)
- Crowd-sourced for varied difficulty levels
# MBPP example
"""
Write a function to find the maximum product subarray.
assert max_product_subarray([6, -3, -10, 0, 2]) == 180
assert max_product_subarray([-1, -3, -10, 0, 60]) == 60
"""
SWE-bench
Published in 2023, SWE-bench measures the ability to resolve real GitHub issues and bugs.
How it works:
- 12 real Python open-source projects (Django, Flask, NumPy, etc.)
- 2,294 real GitHub issues with verified patches
- Model reads issue descriptions and generates actual code fixes
- Validated by the existing test suite
Example issue:
Repository: scikit-learn
Issue: "KNeighborsClassifier.predict() returns incorrect
results when given sparse matrix input"
What the model must do:
1. Understand the issue description
2. Locate relevant source code
3. Generate a bug fix patch
4. Ensure existing tests pass
SWE-bench Lite:
- 300 selected, more clearly defined problems
- Subset for faster evaluation
Score interpretation:
- Early 2023: Even GPT-4 scored only 1~2%
- 2024: Latest agent systems reaching 20~50%
- Reflects the true complexity of real software engineering
Why it matters:
- Far more realistic evaluation than HumanEval
- Integrates code comprehension + modification + verification
- Assesses actual potential for replacing developer tasks
LiveCodeBench
A dynamic coding benchmark that continuously adds new problems from LeetCode, AtCoder, and CodeForces to prevent data contamination.
Features:
- Uses problems added after competitions end
- Measures performance on problems the model has never seen
- Includes code generation, self-repair, and code execution prediction
3. Reasoning & Math Benchmarks
GSM8K (Grade School Math)
A benchmark of 8,500 elementary school math problems published by OpenAI in 2021.
Features:
- Requires 2–8 step multi-step reasoning
- Basic arithmetic, fractions, decimals, percentages
- Core benchmark for validating Chain-of-Thought reasoning effectiveness
Example problem:
"Janet's ducks lay 16 eggs per day. Every morning
she eats 3 for breakfast and uses 4 for muffins for
her friends. She sells the remainder at $2 per egg.
How much does she earn per day?"
Chain-of-Thought reasoning:
1. Daily eggs: 16
2. Eaten: 3
3. Used for muffins: 4
4. Eggs to sell: 16 - 3 - 4 = 9
5. Earnings: 9 * 2 = $18
Answer: $18
Score interpretation:
- Humans: ~100%
- GPT-4 (CoT): 92%+
- GPT-3 (standard): ~20%
- GPT-3 (CoT): ~56%
- One of the most dramatic demonstrations of CoT effectiveness
MATH
A collection of 12,500 competition-level math problems published in 2021.
7 subject areas:
- Algebra
- Precalculus
- Geometry
- Number Theory
- Counting and Probability
- Intermediate Algebra
- Prealgebra
5 difficulty levels:
- Level 1 (easiest): AMC 8 level
- Level 5 (hardest): AIME, HMMT level
Level 5 example:
"Factor x^4 + 4x^3 - 2x^2 - 12x + 9"
Answer: (x^2 + 2x - 3)^2 = (x+3)^2(x-1)^2
→ Requires advanced algebraic manipulation
Score interpretation:
- GPT-4: ~52% overall, Level 5: ~20%+
- Latest models (o1, Gemini Ultra): 80%+
- Math-specialized models are improving rapidly
AIME (American Invitational Mathematics Examination)
Real problems from the American math olympiad qualifying exam.
Features:
- Integer answers from 0 to 999 (no multiple choice)
- Designed for AMC 10/12 qualifiers
- Demands extreme mathematical creativity
Score interpretation:
- Top 5% of humans: 7~9 of 15 problems
- GPT-4o:
912 of 15 (as of 2024) - The o1 series made breakthrough advances here
4. Tool Calling / Function Calling Benchmarks
BFCL (Berkeley Function Calling Leaderboard)
The most comprehensive function calling benchmark, published by UC Berkeley in 2024.
2,000+ function calling scenarios:
Categories:
- Simple Function Calling — single function, clear parameters
- Multiple Functions — select the right function from several options
- Parallel Functions — invoke multiple functions simultaneously
- Nested Functions — call functions within other functions
- REST API — call real HTTP API endpoints
Evaluation criteria:
- Correct function name selection
- Parameter name accuracy
- Parameter type accuracy (string vs int vs float)
- Semantic correctness of parameter values
- Absence of unnecessary parameters
AST validation approach:
# Ground truth function call
get_weather(
location="Seoul, Korea",
unit="celsius",
forecast_days=3
)
# Model-generated call
get_weather(
location="Seoul", # Partial match — allowed?
unit="C", # Type/format error
days=3 # Parameter name error!
)
AST (Abstract Syntax Tree) parsing verifies structural correctness
Supported languages/environments:
- Python, Java, JavaScript, SQL, REST API
Score interpretation (2024):
- GPT-4o: overall ~72%
- Claude 3.5 Sonnet: overall ~73%
- Open-source models: 40~60% range
tau-bench (τ-bench)
A benchmark measuring real agent task completion, going beyond simple function call accuracy to measure end-to-end task success rates.
How it works:
- Real business scenarios (travel booking, shopping, etc.)
- Multi-step agent workflows
- Appropriate tool use at each step
- Final task completion evaluation
Example scenario:
"Find a one-way flight from New York to Paris on March 20,
book the cheapest option, and send a confirmation email."
Required steps:
1. search_flights(origin="NYC", destination="Paris", date="2026-03-20")
2. select_flight(flight_id="AF001", criteria="cheapest")
3. book_flight(flight_id="AF001", passenger_info=...)
4. send_confirmation_email(booking_id=..., email=...)
→ Measures accuracy of each step AND overall completion
ToolBench / ToolEval
Published in 2023, this benchmark evaluates tool-use ability with 16,000 real REST APIs.
How it works:
- 49 categories, 16,000 APIs collected from RapidAPI
- Select appropriate API from real API documentation
- Call the API with correct parameters
- Multi-step API chaining
Solvable Pass Rate (SoPR) metric:
- Success rate on problems that are actually solvable
- Compares ChatGPT's built-in Function Calling vs ToolLLM
Evaluation criteria:
- Tool selection accuracy (choosing the right API)
- Execution order accuracy
- Parameter accuracy
- Error handling ability
AgentBench
Published in 2023, this benchmark measures autonomous LLM agent ability across 8 different environments.
8 environments:
- OS — Operating system tasks (file manipulation, command execution)
- DB — Database queries and manipulation
- Knowledge Graph — Knowledge graph traversal
- Digital Card Game — Strategic card game
- Lateral Thinking Puzzles — Creative problem solving
- House Holding — Home management in a virtual environment
- Web Shopping — Online shopping tasks
- Web Browsing — Web navigation and information gathering
OS environment example:
"Find all .py files in the current directory created in 2023
and move them to a 'python_files' folder."
→ Requires combining find, mkdir, mv commands
→ Measures multi-step decision-making and error recovery
Score interpretation:
- GPT-4: overall ~3.6/10
- GPT-3.5: overall ~1.9/10
- Many open-source models: below 1
5. Embedding Benchmarks
MTEB (Massive Text Embedding Benchmark)
Published in 2022, MTEB is the most comprehensive benchmark for evaluating text embedding models.
56 datasets, 8 task types:
1. Retrieval
- Find the most relevant document for a query
- Uses nDCG@10 metric
- Includes BEIR benchmark datasets
Example: "How to sort a list in Python"
→ Rank relevant Stack Overflow answers and documentation
2. Classification
- Text classification (sentiment analysis, topic classification, etc.)
- Evaluated with embedding + logistic regression
- Accuracy or F1 score
3. Clustering
- Automatically group similar texts
- ArXiv papers, Reddit posts, etc.
- V-measure metric
4. Semantic Textual Similarity (STS)
- Semantic similarity score between two sentences (0~5)
- Evaluated with Spearman correlation
Example:
Sentence 1: "A dog is running in the park"
Sentence 2: "A canine is sprinting outdoors"
→ High similarity (~4.0/5.0)
Sentence 1: "The weather is sunny today"
Sentence 2: "I love eating pizza"
→ Low similarity (~0.5/5.0)
5. Reranking
- Reorder initial search results
- MAP (Mean Average Precision) metric
- Final sorting ability of search engines
6. Summarization
- Semantic similarity between summary and original document
- Spearman correlation
7. Pair Classification
- Classify relationship between two sentences (similar/dissimilar, duplicate/non-duplicate)
- AP (Average Precision) metric
Example:
- Duplicate detection: "How to sort a Python list" vs "Sort list in Python"
→ Duplicate (True)
- "Apples are fruit" vs "I like swimming"
→ Unrelated (False)
8. Bitext Mining
- Find parallel sentence pairs across languages
- F1 score
Example:
English: "The weather is nice today"
Korean: "오늘 날씨가 좋다"
→ Parallel pair detection
MTEB Leaderboard (HuggingFace):
- Compare models by overall score
- View per-task detailed scores
- Top performers (2024): text-embedding-3-large, voyage-large-2, E5-mistral-7b
BEIR (Benchmarking Information Retrieval)
Published in 2021, BEIR measures information retrieval performance across 18 diverse domains.
18 datasets:
- TREC-COVID: Medical paper search on COVID-19
- NFCorpus: Medical/nutrition information retrieval
- NQ (Natural Questions): Google natural language search
- HotpotQA: Multi-hop reasoning retrieval
- FiQA: Financial Q&A
- ArguAna: Counter-argument retrieval
- Touche: Debate argument retrieval
- CQADupStack: Community Q&A duplicate detection
- Quora: Duplicate question detection
- DBPedia: Entity search
- SCIDOCS: Academic paper retrieval
- FEVER: Fact verification
- Climate-FEVER: Climate-related fact verification
- SciFact: Scientific claim verification
nDCG@10 metric:
nDCG@10 = Normalized Discounted Cumulative Gain of top 10 results
Relevance scores:
- Highly relevant: 3 points
- Relevant: 2 points
- Marginally relevant: 1 point
- Not relevant: 0 points
Higher-ranked results receive more weight
Zero-shot evaluation:
- Measures generalization across domains without domain-specific fine-tuning
- Compares traditional methods like BM25 with neural embeddings
6. RAG & Document Parsing Benchmarks
RAGAS (Retrieval Augmented Generation Assessment)
A comprehensive framework for measuring the quality of RAG systems.
5 core metrics:
1. Faithfulness
- Is the generated answer grounded in the retrieved context?
- Does the model avoid fabricating content not in the context?
- Score range: 0~1
Context: "Python was created by Guido van Rossum in 1991."
Question: "When was Python created and by whom?"
High Faithfulness answer:
"Python was created by Guido van Rossum in 1991."
Low Faithfulness answer (hallucination):
"Python was created by Guido van Rossum in 1989,
in Amsterdam, the Netherlands..."
→ Date and location not in context are fabricated
2. Answer Relevance
- Does the answer actually address the question?
- Does the answer avoid including off-topic information?
3. Context Precision
- Is the retrieved context genuinely useful?
- Ratio of unnecessary context included
4. Context Recall
- Was all information needed to answer retrieved?
- Whether ground truth answer information is present in context
5. Context Entity Recall
- Are important entities (people, places, dates, etc.) present in the context?
RULER (Retrieval Under Long-context Evaluation Regime)
A benchmark measuring long-context LLM ability, going beyond simple Needle-in-a-Haystack to evaluate complex long-document understanding.
Task types:
- NIAH (Needle-in-a-Haystack): Find specific information in a long document
- Multi-key NIAH: Find multiple pieces of information simultaneously
- Multi-value NIAH: Extract multiple values for a single key
- Multi-hop Tracing: Reason through multiple steps following information chains
- Aggregation: Aggregate information across the full document
- QA: Question answering based on long context
Multi-hop Tracing example (in a 128K token document):
"Alice's manager is Bob. Bob's birthday is March 15th.
... (tens of thousands of tokens of unrelated content) ...
What is Alice's manager's birthday?"
→ Measures the ability to connect Alice → Bob → March 15th
DocVQA
Measures visual question answering ability on real document images.
How it works:
- Real scanned document images (invoices, forms, reports, contracts, etc.)
- Natural language question + document image → generate answer
- Integrates OCR ability + document structure understanding + content comprehension
Example:
[Invoice image]
Question: "What is the total tax amount?"
→ Locate the tax line item and extract the value
[Medical form]
Question: "What is the patient's date of birth?"
→ Identify the specific field location and extract value
ANLS (Average Normalized Levenshtein Similarity) metric:
- Similarity measured by edit distance, not exact match
- Allows for numeric/date format variations
FinanceBench
A Q&A benchmark based on financial documents (10-K annual reports, 10-Q quarterly reports).
How it works:
- Real corporate disclosure documents (SEC EDGAR)
- Questions requiring numerical extraction, calculation, and multi-step reasoning
Example:
[Apple Inc. 2023 Annual Report]
Question: "What was the year-over-year revenue growth rate
of the Services segment in 2023?"
Required capabilities:
1. Find 2023 Services revenue
2. Find 2022 Services revenue
3. Calculate growth rate: (2023-2022)/2022 * 100
7. Multimodal Benchmarks
MMBench / MMMU
MMBench:
- Comprehensive evaluation of multimodal understanding
- Image + text comprehension
- Evaluates 20+ distinct sub-abilities
MMMU (Massive Multi-discipline Multimodal Understanding):
- College-level multimodal understanding
- 11,500 problems, 30 disciplines, 183 sub-topics
- Understanding diagrams, charts, and formulas in medicine, law, engineering
MMMU example:
[Chemical bonding diagram image]
Question: "What is the bond angle in this molecular structure?"
→ Requires visual interpretation of chemical structures
DocBench / OCRBench
OCRBench:
- Measures OCR accuracy
- Printed text, handwriting, multilingual text
- Scene text and document text
- 1,000 evaluation samples
DocBench:
- Measures document parsing quality
- Table, formula, chart, and layout recognition
- PDF and image document processing ability
8. Benchmark Selection Guide
Reference benchmarks by use case:
| Use Case | Primary Benchmarks | Secondary Benchmarks |
|---|---|---|
| Chatbot / QA systems | MMLU, TruthfulQA | HellaSwag, WinoGrande |
| Code generation tools | HumanEval, SWE-bench | MBPP, LiveCodeBench |
| Agents / Automation | BFCL, AgentBench | τ-bench, ToolBench |
| RAG systems | MTEB Retrieval, BEIR | RAGAS, RULER |
| Document processing | DocVQA, OCRBench | FinanceBench |
| Math / Science | MATH, GSM8K | GPQA, AIME |
| Embedding model selection | Full MTEB | BEIR by domain |
| Multimodal | MMMU, MMBench | DocVQA |
9. Limitations and Caveats
Data Contamination
The problem:
- Test questions may be present in the model's training data
- Publicly available benchmark questions have high probability of appearing in training data
- Hard to distinguish genuine reasoning from memorization
Mitigations:
- Dynamic benchmarks like LiveBench and LiveCodeBench
- Private test sets
- Continuous addition of new problems
Score Variation from Prompt Engineering
Same model, different prompts:
GSM8K standard prompting: 70%
GSM8K CoT prompting: 92%
→ Scores without stated prompting method are meaningless
The Gap Between Benchmark Scores and Real-World Usability
- A model with MMLU 90% might produce worse writing than one with 80%
- Models that overfit to specific benchmarks exist
- "Benchmark hacking": raising scores without actually improving real capability
Language Bias
- Most benchmarks are English-centric
- Insufficient measurement of Korean, Japanese, Arabic, and other languages
- Separate multilingual benchmarks needed: MLQA, XNLI, mMTEB, etc.
Benchmark Saturation
- HellaSwag: Humans and GPT-4 now at nearly the same level
- ARC Easy: Most modern models exceed 98%
- New, harder benchmarks are continuously needed
Quiz: Test Your Benchmark Understanding
Quiz 1: What does 5-shot learning in MMLU mean?
Answer: Before each test question, 5 example questions with their correct answers are provided in the prompt.
Explanation: In 5-shot learning, the prompt includes 5 example problems and their answers from the relevant subject before the actual test question. This guides the model to understand the question format and produce answers in the expected style. 0-shot means no examples, 1-shot means one example, and few-shot means a small number of examples.
Quiz 2: Why does GPT-4 score lower than humans on TruthfulQA?
Answer: TruthfulQA is deliberately designed to test misconceptions and false beliefs that humans commonly hold. AI models also learn incorrect information from training data and tend to generate plausible-sounding misinformation.
Explanation: The core purpose of TruthfulQA is to measure a model's tendency to produce "plausible but wrong" answers (hallucination). Humans can say "I'm not sure," but LLMs often confidently generate incorrect information. The benchmark is intentionally designed to be hard to score high on — differences between models are more meaningful than the absolute score itself.
Quiz 3: Why is pass@10 always higher than pass@1 in HumanEval?
Answer: pass@10 only requires at least 1 success out of 10 attempts, so it has a higher or equal probability of success compared to a single attempt (pass@1).
Explanation: pass@k is the probability of at least one success in k attempts. The formula is approximately 1 - (probability of failure)^k. As k increases, the probability of success increases, so pass@100 >= pass@10 >= pass@1 always holds. This metric is also used to assess the diversity and creativity of a model's code generation.
Quiz 4: Why does BFCL use AST validation?
Answer: To verify the structural meaning of code rather than doing text matching. AST parses code into a syntax tree to accurately check function names, parameter names, types, and values.
Explanation: Simple text comparison might treat get_weather(city='Seoul') and get_weather(city = 'Seoul') as different. AST parsing ignores surface differences like whitespace and quote style to verify actual semantic equivalence. It also recognizes the same call regardless of parameter order, enabling more accurate evaluation.
Quiz 5: Why does MTEB use nDCG@10 for Retrieval tasks?
Answer: nDCG@10 measures the quality of the top 10 search results while assigning more weight to higher-ranked results. This reflects real user behavior since users typically only look at the top results.
Explanation: nDCG (Normalized Discounted Cumulative Gain) discounts relevance scores (0~3) with a log function so that higher-ranked results are weighted more heavily. The @10 means only the top 10 results are evaluated. For example, a relevant document in position 1 receives a much higher score than the same document in position 10.
Quiz 6: What is the difference between Faithfulness and Answer Relevance in RAGAS?
Answer: Faithfulness measures whether the answer is grounded in the retrieved context (does not fabricate), while Answer Relevance measures whether the answer actually addresses the core of the question.
Explanation: The two metrics catch different failure modes. Low Faithfulness means the model is making up content not in the context (hallucination). Low Answer Relevance means the model is faithful to the context but answering something other than what was asked. A good RAG system needs both metrics to be high.
Quiz 7: Why is SWE-bench harder and more realistic than HumanEval?
Answer: SWE-bench uses real GitHub issues and codebases. Unlike writing a single function, it requires understanding thousands of lines of existing code, diagnosing the root cause of a bug, making minimal targeted changes, and passing an existing test suite.
Explanation: HumanEval involves writing clean function implementations, but SWE-bench simulates real software development. The model must (1) understand the issue description, (2) navigate the codebase, (3) diagnose the bug, (4) decide how to fix it, (5) generate a patch, and (6) verify it passes existing tests. This closely mirrors the everyday work of a real developer.
Quiz 8: What are the main solutions to the data contamination problem?
Answer: Dynamic benchmarks (LiveBench, LiveCodeBench), private test sets, continuous addition of new problems, and generative evaluation are the main solutions.
Explanation: Data contamination occurs when test questions are included in training data, producing artificially high scores. LiveBench continuously adds new problems from recent arxiv papers and competitive programming sites so models cannot preview them. Some approaches also require model submitters to declare whether the test set was included in training data.
Quiz 9: Why is zero-shot evaluation important in BEIR?
Answer: To measure the true generalization ability of embedding models. A model that works well across diverse domains without domain-specific fine-tuning is far more practical.
Explanation: When building real RAG systems, you often need to handle documents from diverse domains like medicine, law, and finance. Training separate models for each domain is costly, so embedding models that work well across domains in zero-shot settings are much more practical. BEIR evaluates this generalization ability across 18 domains.
Conclusion: Using Benchmarks Wisely
Benchmark scores show only one facet of model capability. It is essential to choose benchmarks that match your actual use case and consider multiple benchmarks holistically rather than relying on any single one.
Core principles:
- Choose benchmarks aligned with your goal: For code generation, HumanEval is more relevant than MMLU
- Consider multiple benchmarks together: Ranking #1 on a single benchmark does not mean best in all areas
- Check the prompting method: Verify whether results used CoT vs standard prompting
- Be aware of data contamination: Cross-check with dynamic benchmarks
- Test directly: Ultimately, evaluate on your actual use case
Benchmarks are maps, not the territory itself. Use multiple good maps to choose the optimal model for your needs.