Skip to content
Published on

AI Agent & LLM Benchmarks 2026 — SWE-bench Verified / ARC-AGI 2 / GAIA / MMLU-Pro / GPQA / LiveCodeBench / Chatbot Arena Deep Dive

Authors

Prologue — the scores on every model launch slide

Whichever model lab is on stage in 2026, the same table shows up.

  • SWE-bench Verified in the 70s
  • MMLU-Pro in the 80s
  • GPQA Diamond in the 60s
  • LiveCodeBench Hard in the 50s
  • AIME in the 80s
  • Chatbot Arena Elo above 1400

Each number has a small asterisk with caveats like "best-of-k" or "with thinking". After watching the slide, the PM on your team asks the only question that matters — "OK, but is this new model better on our codebase?"

The honest answer is "probably". Benchmark scores are indirect signals of model capability. Real performance on your domain has to be measured by you. And yet benchmarks still matter — they create a shared coordinate system. If model A beats model B by 10 points on SWE-bench, it's likely (not guaranteed) to do better on your domain too.

This post is a one-page map of the 30+ benchmarks that actually mean something in 2026. What each one measures, how it scores, and where it gets gamed. Plus which scores to look at when picking a model for your use case.


1. A map of the 2026 benchmark landscape — 4 categories

Benchmarks fall into roughly four buckets, by what they measure.

CategoryWhat it measuresRepresentative benchmarks
Code / SWECan it finish a real software taskSWE-bench Verified, LiveCodeBench, Aider polyglot, HumanEval, MBPP
Agent / ToolsMulti-step tool use and task completionAgentBench, WebArena, GAIA, AppWorld, ToolBench, RE-Bench
Reasoning / KnowledgeAcademic knowledge plus reasoningMMLU-Pro, GPQA Diamond, BIG-bench Hard, AGIEval, AIME, MATH, GSM8K, Frontier Math
Holistic / QualitativeDoes it feel good to humansChatbot Arena, AlpacaEval, MT-Bench, Open LLM Leaderboard, HELM

Two cross-cutting axes round it out.

  • Factuality / safety: TruthfulQA, FACTSCORE
  • Locale / multilingual: KMMLU, HAERAE-bench, JMMLU, ELYZA-tasks-100

Key insight: no single benchmark evaluates a model. Each one only sees a narrow slice. That's why lab launches bundle 4-6 numbers at once. When comparing models for your own use, you also need at least three benchmarks cross-checked.

Another: benchmarks die over time. Once models approach a perfect score, the benchmark stops separating them. MMLU (2020) saturated and MMLU-Pro replaced it. HellaSwag (2019) had the same fate. SWE-bench Verified will probably die around 2027. That's why new benchmarks keep launching.


2. SWE-bench — the most important SWE benchmark

Since 2024, no benchmark has more influence on coding-agent evaluation than SWE-bench.

Background: published by the Princeton NLP group in 2023. Core idea — take real issues from real open-source projects and check whether an agent can patch them so that the real tests pass. Not synthetic problems.

Dataset structure:

  • 12 popular Python libraries (django, flask, sympy, scikit-learn, requests, etc.)
  • 2,294 (issue, PR, test) triples
  • Each task = issue description + snapshot of that repository
  • Agent produces a patch (diff) → apply patch → run tests → pass/fail

Scoring:

  • "Resolved": the new tests added in the merged PR pass
  • "Applied": the patch at least applies cleanly

In early 2024 the state of the art on full SWE-bench was 2-3%. By late 2025 the leaders hit 50-70%. As of 2026 the top agents sit in the high 70s.

Why it matters? SWE-bench isn't just a coding puzzle — it captures the whole SWE workflow. Read an issue, explore the repo, find relevant files, modify them, verify tests pass. That's coding + agent + tool use all at once.

Limitations:

  1. All Python, all open source, biased toward 12 libraries
  2. Some tasks leak the answer in the issue description
  3. Some tests are too narrow or essentially impossible
  4. Scoring infrastructure is expensive (2,294 Docker containers)

These limitations led to SWE-bench Verified.


3. SWE-bench Verified — OpenAI's curated 500 (Aug 2024)

In August 2024, OpenAI shipped a curated subset of SWE-bench. The name literally means "verified": 500 tasks that humans validated.

The process:

  1. 93 professional software engineers reviewed every SWE-bench task
  2. For each task they rated four things
    • Is the issue description clear
    • Is the test reasonable (neither too narrow nor too broad)
    • Are there hidden environment requirements beyond the unit test
    • Is the solution achievable in reasonable time
  3. Only tasks that passed all criteria made the cut → 500 tasks

The result: scores on the verified set are considered more accurate than the full SWE-bench. Starting in 2025, model lab launch slides made "SWE-bench Verified" the default.

Approximate 2026 score distribution (from launch decks):

ModelSWE-bench Verified
Claude Sonnet 4.5 (with thinking)~70%
GPT-5 (verified harness)~65%
Gemini 2.5 Pro (deep think)~60%
Llama 4 405B + agent~45%
Open-source 7B + harness~15%

Caveat: scores depend heavily on the harness. The same model can vary by ±10% between OpenHands, SWE-agent, Aider, etc. So "Claude 70%" is less precise than "Claude + harness X 70%".

Gaming risks:

  • Some models may have SWE-bench tasks in training data
  • OpenAI runs SWE-bench Live with tasks added after model cutoffs
  • And SWE-bench Multimodal extends it further

4. SWE-bench Multimodal — a new dimension

Late 2024 brought SWE-bench Multimodal. It targets JavaScript/TypeScript projects (mostly React, Vue, front-end frameworks) and adds tasks that require looking at images.

Example tasks:

  • The attached screenshot shows a misaligned button — fix the CSS
  • Match the attached UI mockup by modifying the component

Why it matters? Front-end work is visual work. Real issues frequently come with screenshots. A text-only model can't solve these.

Dataset:

  • 17 popular JS/TS repositories
  • 619 tasks (with images)

Top models in 2026 land in the 30-40% range — well below SWE-bench Verified. Multimodal reasoning + visual perception + code in one shot is hard.


5. AgentBench / WebArena / GAIA — measuring agent capability

Code alone doesn't cover agent capability. Separate benchmarks target tool use, multi-step reasoning, and environment interaction.

AgentBench (Tsinghua, 2023)

Measures LLM agents across 8 environments. OS (shell tasks), DB (SQL), KG (knowledge graphs), DCG (digital card games), Lateral Thinking Puzzles, House Holding (simulated home), Web Shopping, Web Browsing. Tests how generally an LLM adapts across very different environments.

WebArena (CMU, 2023)

Dedicated to web-browsing agents. Four realistic websites (a shopping site, GitLab clone, Reddit clone, maps service, etc.) where agents execute natural-language tasks. "Find product Y on site X, add to cart, change shipping to Z" style. Scoring is deterministic — does the final state match the target.

The point: it measures the ability to automate human web work. SOTA was around 14% in 2024, climbing to 40-50% in 2026.

GAIA (Meta AI, 2023)

General AI Assistant benchmark. 466 human-written tasks across 3 difficulty levels.

  • Level 1: under 5 steps, simple tools
  • Level 2: 5-10 steps
  • Level 3: very complex multi-step + multimodal

Example: "From this PDF, give the names of authors of the cited papers published after year Y who were graduates of university Z." The answer is exactly one string so scoring is trivial — but reaching it needs search, PDF parsing, computation, and logical reasoning.

In 2026 the top agents average around 60% (with full tool access). Humans average above 90%. The gap is shrinking but humans still win.


6. ARC-AGI 2 (Chollet) — the 1 million dollar prize

François Chollet introduced ARC (Abstraction and Reasoning Corpus) in 2019 — visual pattern-reasoning tasks. Grids of colored squares where you infer the transformation rule. Easy for humans, hard for models.

In 2024 ARC-AGI 2 launched with a 1M USD prize. Conditions: a public-leaderboard solution that reaches human average (around 85%).

Why ARC is hard:

  • Each task has a unique abstraction rule
  • Few-shot examples are the only training signal
  • Models haven't seen these patterns
  • Simple pattern matching doesn't cut it

In 2024 OpenAI's o1 / o3 made big jumps on ARC. o3 high hit around 75% — at staggering cost (tens to hundreds of dollars per task), so practicality is uncertain.

In 2026:

  • ARC-AGI 1 (original): top models around 80%
  • ARC-AGI 2 (newer, harder): around 50%
  • No cost-efficient solution yet

Chollet's stance has stayed consistent: "Solving this means we're closer to AGI. But the way GPT solves it is closer to brute force than real reasoning." That's why he keeps emphasizing efficiency, not just accuracy.


7. RE-Bench (METR) — research-engineering capability

METR (Model Evaluation and Threat Research) published RE-Bench in 2024. The core question: "How well does AI do AI research-engineer work?"

This is a self-referential question. If AI is good at building AI, capability could compound. METR measures this precisely from a safety-research angle.

Example RE-Bench tasks:

  • Improve the throughput of a given PyTorch model by X%
  • Write distributed-training code that scales to N GPUs
  • Build a data-preprocessing pipeline that improves a target training metric
  • Debugging — find the bugs in an intentionally broken codebase

Scoring: relative to what a human ML engineer accomplishes in 8 hours. The unit is "hours of human work the AI matched".

2025 results:

  • Claude 3.5 Sonnet: about 2 human-hours of work, in 8 hours
  • GPT-4o: under 2 human-hours
  • Claude Sonnet 4.5 + Codex 5: 4-6 human-hours of work, in 8 hours

In 2026 the gap is closing fast. METR's "AI accelerating AI capability" indicator is starting to mean something.


8. Frontier Math (Epoch AI) — the hardest math benchmark

In November 2024 Epoch AI released Frontier Math: 60 problems that working math PhDs need hours to days to solve.

Properties:

  • Answers are auto-checkable (numeric or canonical form)
  • Not on the internet or in training data (all newly created)
  • Written by math PhDs and reviewed by other PhDs
  • Spans number theory, algebraic geometry, analysis, topology, etc.

When it launched, SOTA models scored around 2%. The hardest human-built math benchmark.

In 2025 OpenAI o3 high reached around 25%, making news. With huge time + compute (often hundreds of dollars per task).

In 2026:

  • General models (GPT-5, Claude Sonnet 4.5): 10-15%
  • "Thinking" mode + multi-agent + tools: 30-40%
  • Human math PhDs: about 50% average (8-hour budget)

One area where AI still trails human PhDs. Frontier Math is the cleanest current ruler for that gap.


9. HumanEval / MBPP / LiveCodeBench / CodeBench

Narrower slices of coding capability.

HumanEval (OpenAI, 2021)

164 Python function-completion tasks. Function signature + docstring → write body → tests pass. The oldest standard coding benchmark.

In 2026 top models hit 95%+. Effectively saturated. Almost no separation. Still useful as a fast sanity check.

MBPP (Google, 2021)

Mostly Basic Python Problems. 974 entry-to-mid Python tasks. More variety than HumanEval, slightly harder. Same fate — top models in the 90s.

LiveCodeBench (UC Berkeley, 2024)

Continuously adds new problems from LeetCode, AtCoder, and Codeforces. Only uses tasks after a model's cutoff date — guaranteed not in training data.

Three difficulties (Easy / Medium / Hard).

2026:

  • Easy: 95%+
  • Medium: 60-70%
  • Hard: 30-40%

LiveCodeBench Hard is where you see real separation. Algorithms + data structures + math + reasoning all at once.

CodeBench (Stanford, 2024)

Another live coding benchmark. Similar philosophy to LiveCodeBench but multilingual (Python, C++, Java, JS).


10. MMLU-Pro / GPQA Diamond — academic reasoning

MMLU (2020)

Massive Multitask Language Understanding. 57 academic subjects, around 14K questions. 4-choice. The original standard LLM knowledge benchmark.

By 2024 top models passed 90%, fully saturated. No separation left.

MMLU-Pro (TIGER Lab, 2024)

Successor to MMLU. Differences:

  • 10-choice instead of 4-choice → harder to guess
  • Selected questions that require more reasoning
  • 12,032 questions

2026:

  • Top models: 75-85%
  • Still discriminating

GPQA Diamond (NYU, 2023)

Graduate-Level Google-Proof Q&A. 198 PhD-level physics, chemistry, and biology questions (Diamond subset). "Google-Proof" means questions that don't yield to a Google search — they need real reasoning.

2026:

  • Top models: 60-70%
  • Non-experts + 30 minutes of Google: 30-40%
  • Domain PhDs: 65-80%
  • Models are approaching the average PhD level

GPQA is one of the cleanest single signals of "how close is AI to expert-level reasoning?".


11. MATH / GSM8K / AIME — math benchmarks

MATH (Hendrycks, 2021)

12,500 problems from US high-school and college math competitions (AIME, AMC, IMO-style). Closed-form answers (numbers or simple expressions).

2026 top models: 95%+. Mostly saturated.

GSM8K (OpenAI, 2021)

Grade-School Math 8K. 8,500 elementary-to-middle-school math word problems.

2026: 99%+. Fully saturated. Effectively meaningless.

AIME (American Invitational Math Examination)

The US high-school qualifier for the math olympiad. 15 problems, answers are integers from 0 to 999. New problems every year → low contamination risk.

Evaluating with AIME 2024, 2025, 2026 problems is standard.

2026:

  • Top models + thinking: 80-90%
  • Generic models: 50-60%
  • Top high-school math students: 70-80%

With GSM8K and MATH saturated, AIME is one of the most discriminating general math-reasoning benchmarks left.

HellaSwag (deprecated)

Commonsense reasoning benchmark from 2019. Saturated past 95% by 2023. Rarely used now.


12. Chatbot Arena (LMSYS) — blind human ranking

LMSYS (UC Berkeley) runs the most-cited human-pairwise-comparison ranking.

How it works:

  1. A user types any prompt
  2. Two models (blind) reply
  3. The user votes on which reply is better
  4. ELO ratings produce a model ranking

Since 2024 this has become one of the most important rankings. Why:

  • No contamination concerns (users invent prompts on the fly)
  • Coverage of many real use cases
  • Hard for labs to game
  • Hundreds of thousands of votes

Top of the board in 2026 (approximate):

RankModelElo
1Claude Sonnet 4.5 (thinking)1480
2GPT-51465
3Gemini 2.5 Pro1455
4Claude Opus 4.71450
5DeepSeek R31430
6Llama 4 405B1410

A 100-point Elo gap is roughly a 64% win rate — clearly noticeable.

Limitations:

  • User-preference bias — long, well-formatted markdown replies tend to win
  • Short and precise answers can lose
  • A "Style override" feature partially corrects for this

Even so, it remains the most reliable single indicator of what humans actually prefer.


13. Aider polyglot / Open LLM Leaderboard — composites

Aider polyglot benchmark

Aider (a CLI coding agent) maintains a multilingual coding benchmark. 225 tasks across 6 languages (Python, Go, Rust, JS, TS, C++), based on Exercism problems.

What's notable:

  • Forces diff-format output (agent practicality, not just generation)
  • Two modes — whole (rewrite the whole file) vs diff (only changed lines)
  • Diff correctness itself is graded — malformed diff is a zero

2026 top models hit 60-75% in diff mode. The most useful single benchmark for Aider-style workflows.

Open LLM Leaderboard (Hugging Face)

HF's composite ranking for open-source models. v2 (refreshed in 2024) aggregates 6 benchmarks.

  • IFEval (Instruction Following)
  • BBH (BIG-bench Hard)
  • MATH lvl 5
  • GPQA
  • MUSR (Multistep Reasoning)
  • MMLU-Pro

Standard starting point for comparing open-source models. Doesn't include closed models (GPT, Claude).


14. AlpacaEval / MT-Bench / AGIEval / MEGA-Bench

Smaller, faster benchmarks.

AlpacaEval (Stanford, 2023)

GPT-4 auto-scores LLM outputs (LLM-as-judge). 805 instructions.

Problem: judge bias — especially toward longer, more detailed replies. AlpacaEval 2.0 corrects with length-controlled win rate.

Now eclipsed by Chatbot Arena.

MT-Bench (LMSYS, 2023)

80 multi-turn dialogue tasks across 8 categories (coding, math, reasoning, writing, etc.). GPT-4 scores 1-10.

Fast and cheap, popular as a quick check during model development.

AGIEval (Microsoft, 2023)

Built from human academic admission and certification exams. SAT, GRE, LSAT, China's Gaokao, the US Bar — directly comparable to human scores.

MEGA-Bench (2024)

500+ diverse tasks bundled into one benchmark. Text, images, video, and audio. Useful for multimodal model evaluation.


15. FACTSCORE / TruthfulQA — factuality

Whether the model makes things up plausibly (hallucinates).

TruthfulQA (Oxford, 2021)

817 questions where humans commonly hold misconceptions. Tests whether the model parrots the popular misconception or knows it's wrong.

Example: "Why do humans only use 10% of their brain?" → correct: "False premise." Wrong: "Because specific regions are activated."

FACTSCORE (UW, 2023)

Breaks a long generated text (like a biography) into atomic facts and verifies each one against sources like Wikipedia. Quantifies hallucination rate.

In 2026 top models score 70-85% — meaning 15-30% hallucination rates remain. Factuality is still an open problem.


16. ToolBench / ToolLLM / AppWorld — tools and interactive

ToolBench / ToolLLM (Tsinghua, 2023)

Evaluates an LLM picking and calling tools across 16,000 APIs. Uses real APIs collected from RapidAPI.

Each task = a natural-language request + a tool catalog → a call sequence → a final answer.

Scoring: pass rate (did it finish) + win rate (was the final answer right).

AppWorld (AI2, 2024)

One of the most realistic tool-use benchmarks. Simulations of 9 realistic apps (email, calendar, shopping, food delivery, music, etc.) where an agent operates inside them.

Example: "My mom's birthday is next Thursday. Book a restaurant, send invites to relatives, and order a cake."

Scoring:

  • Compares pre- and post-interaction state
  • Validates exact state changes

2026 top agents land at 35-50%. Interactive multi-app tasks are still very hard.


17. Locale benchmarks — Korea and Japan

English-only benchmarks don't tell you anything about your locale.

Korea

  • KMMLU (2024): Korean version of MMLU. 45 subjects, 35K questions. Based on Korean certifications and the college entrance exam.
  • K-MMLU 2 (2025): KMMLU successor with broader domains
  • HAERAE-bench (2023): Korean-specific reasoning. Tests culture, history, language ability.
  • KoBest: Korean NLU benchmark

In 2026 top models score in the 80s on KMMLU. GPT-5 and Claude Sonnet 4.5 effectively match their English-level performance in Korean. Smaller open-source models drop to the 60s.

Japan

  • JMMLU (2024): Japanese MMLU
  • ELYZA-tasks-100 (2023): 100 Japanese instruction-following tasks
  • JNLI (NICT): Japanese natural-language inference
  • JCommonsenseQA: Japanese commonsense reasoning

2026 top models hit 75-85% on JMMLU. ELYZA-tasks-100 uses both human and model scoring.

Key insight: without locale benchmarks you don't know your locale performance. An English SOTA model isn't guaranteed to perform identically in Korean or Japanese. The gap grows wider for smaller models.


18. BIG-bench Hard (BBH) / HELM — the big picture

BIG-bench Hard (Google, 2022)

The original BIG-bench bundled 200+ varied tasks. BBH is the curated subset of 23 where LLMs underperformed humans — the truly hard subset.

Logic puzzles, multi-step arithmetic, Dyck languages, etc. A clean slice of reasoning capability.

In 2026 top models hit 70-85%.

HELM (Stanford CRFM, 2022 onwards)

Holistic Evaluation of Language Models. Not a single number, but a matrix of 30+ scenarios × 7 axes (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency).

  • "This model is accurate but biased"
  • "This model is robust but slow"

A framework for producing comprehensive model cards. Important from a policy / safety angle.


19. Benchmark limits — overfit, contamination, gaming

Why you can't take benchmark scores at face value.

Contamination

Benchmark tasks ending up in training data. The model "memorized" rather than "solved".

Mitigations:

  • Only use problems after a cutoff (LiveCodeBench, fresh AIME)
  • Held-out test sets remain private
  • Decontamination tools attempt to strip benchmarks from training data

Still imperfect. A lab saying "no contamination" is hard to independently verify.

Overfitting

Models or agents tuned for a specific benchmark while underperforming on others.

Example: a prompt / harness optimized for SWE-bench that fails on your actual codebase tasks.

Gaming

Exploiting benchmark weaknesses to lift scores.

  • best-of-K (one correct out of K attempts counts as correct) inflation
  • Format-matching weaknesses in scoring
  • Carefully chosen few-shot examples
  • Prompt-injecting LLM-as-judge scorers

"Saturated" benchmarks

MMLU, HumanEval, GSM8K, MATH, HellaSwag — all saturated. No useful separation.

That's why new benchmarks keep appearing. Benchmarks age faster than models.

Cost / compute ignored

Most benchmarks report "how well" without reporting "how expensive". Yet for production, cost is key.

  • OpenAI o3 high hit 75% on ARC-AGI — at around 300 USD per task
  • A model hitting the same score at 1 USD per task is worth more
  • Some benchmarks (ARC-AGI 2) are starting to add cost constraints

20. A team's benchmark-usage guide — wrap-up

Which scores should you actually look at when picking a model? By domain:

Building a coding agent

  1. SWE-bench Verified (overall)
  2. LiveCodeBench Hard (algorithmic skill)
  3. Aider polyglot (multilingual + diff accuracy)
  4. Your own evaluation in your domain (most important)

Building a general chatbot / assistant

  1. Chatbot Arena Elo (overall preference)
  2. MMLU-Pro / GPQA (academic reasoning)
  3. Your own evaluation of your user scenarios

Building an agent

  1. GAIA (general tool use)
  2. AppWorld / WebArena (interactive)
  3. ToolBench (tool-calling correctness)
  4. SWE-bench Verified (coding)
  5. Task-completion rate in your own environment

Math / science evaluation

  1. AIME (general math reasoning)
  2. GPQA Diamond (expert-level reasoning)
  3. Frontier Math (hardest)
  4. MATH is saturated, GSM8K is meaningless

Locale (Korean / Japanese) models

  1. KMMLU / JMMLU
  2. HAERAE-bench / ELYZA-tasks-100
  3. Your own Korean / Japanese eval

Factuality matters

  1. TruthfulQA
  2. FACTSCORE
  3. Domain factuality (compare against your field's fact base)

Final — the one thing that matters most

Build your own eval set for your domain. Benchmark scores are coordinates. The only measurement that actually means anything for your team is on your real tasks. 100-200 tasks is enough to start.

Benchmarks narrow the model field. Your evaluation makes the final call. You need both.


References