
  <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
      <title>Chaos and Order</title>
      <link>https://www.youngju.dev/blog</link>
      <description>천천히 올바르게. AI Researcher &amp; DevOps Engineer Youngju&#39;s tech blog. GPU/CUDA, LLM, MLOps, Kubernetes AI workloads, distributed training, and data engineering.</description>
      <language>ko</language>
      <managingEditor>fjvbn2003@gmail.com (Youngju Kim)</managingEditor>
      <webMaster>fjvbn2003@gmail.com (Youngju Kim)</webMaster>
      <lastBuildDate>Sat, 16 May 2026 00:00:00 GMT</lastBuildDate>
      <atom:link href="https://www.youngju.dev/tags/codebench/feed.xml" rel="self" type="application/rss+xml"/>
      
  <item>
    <guid>https://www.youngju.dev/blog/culture/2026-05-16-ai-agent-llm-benchmarks-2026-swe-bench-verified-arc-agi-2-gaia-mmlu-pro-gpqa-livecodebench-chatbot-arena-deep-dive.en</guid>
    <title>AI Agent &amp; LLM Benchmarks 2026 — SWE-bench Verified / ARC-AGI 2 / GAIA / MMLU-Pro / GPQA / LiveCodeBench / Chatbot Arena Deep Dive</title>
    <link>https://www.youngju.dev/blog/culture/2026-05-16-ai-agent-llm-benchmarks-2026-swe-bench-verified-arc-agi-2-gaia-mmlu-pro-gpqa-livecodebench-chatbot-arena-deep-dive.en</link>
    <description>A single-page map of the 30+ AI benchmarks that matter in 2026. From SWE-bench / SWE-bench Verified / SWE-bench Multimodal to AgentBench, WebArena and GAIA, ARC-AGI 2 (Chollet $1M prize), RE-Bench (METR), Frontier Math (Epoch AI), HumanEval / MBPP / LiveCodeBench, MMLU-Pro / GPQA Diamond, MATH / GSM8K / AIME, Chatbot Arena (LMSYS), Aider polyglot, the Open LLM Leaderboard, AlpacaEval / MT-Bench / AGIEval / MEGA-Bench, FACTSCORE / TruthfulQA, ToolBench / AppWorld, plus Korean and Japanese locales (KMMLU / HAERAE, JMMLU / ELYZA-tasks-100). What each benchmark structurally measures, where it gets gamed (contamination, overfit, best-of-K), and which scores actually matter when you pick a model.</description>
    <pubDate>Sat, 16 May 2026 00:00:00 GMT</pubDate>
    <author>fjvbn2003@gmail.com (Youngju Kim)</author>
    <category>ai-benchmark</category><category>llm-evaluation</category><category>swe-bench</category><category>swe-bench-verified</category><category>agentbench</category><category>webarena</category><category>gaia</category><category>arc-agi-2</category><category>francois-chollet</category><category>big-bench-hard</category><category>helm</category><category>mmlu-pro</category><category>gpqa</category><category>humaneval</category><category>mbpp</category><category>livecodebench</category><category>codebench</category><category>re-bench</category><category>metr</category><category>frontier-math</category><category>epoch-ai</category><category>math</category><category>gsm8k</category><category>aime</category><category>hellaswag</category><category>chatbot-arena</category><category>lmsys</category><category>aider-polyglot</category><category>open-llm-leaderboard</category><category>alpacaeval</category><category>mt-bench</category><category>agieval</category><category>kmmlu</category><category>jmmlu</category><category>haerae</category><category>2026</category><category>deep-dive</category><category>english</category>
  </item>

  <item>
    <guid>https://www.youngju.dev/blog/culture/2026-05-16-ai-agent-llm-benchmarks-2026-swe-bench-verified-arc-agi-2-gaia-mmlu-pro-gpqa-livecodebench-chatbot-arena-deep-dive.ja</guid>
    <title>AIエージェント &amp; LLM ベンチマーク 2026 — SWE-bench Verified / ARC-AGI 2 / GAIA / MMLU-Pro / GPQA / LiveCodeBench / Chatbot Arena 徹底ガイド</title>
    <link>https://www.youngju.dev/blog/culture/2026-05-16-ai-agent-llm-benchmarks-2026-swe-bench-verified-arc-agi-2-gaia-mmlu-pro-gpqa-livecodebench-chatbot-arena-deep-dive.ja</link>
    <description>2026年現在、本当に意味のある30以上のAIベンチマークを一枚に整理する。SWE-bench / SWE-bench Verified / SWE-bench MultimodalからAgentBench・WebArena・GAIA、ARC-AGI 2(シャンポレの100万ドル賞金)、RE-Bench(METR)、Frontier Math(Epoch AI)、HumanEval / MBPP / LiveCodeBench、MMLU-Pro / GPQA Diamond、MATH / GSM8K / AIME、Chatbot Arena(LMSYS)、Aider polyglot、Open LLM Leaderboard、AlpacaEval / MT-Bench / AGIEval / MEGA-Bench、FACTSCORE / TruthfulQA、ToolBench / AppWorld、そして韓国・日本のローカルベンチマーク(KMMLU・HAERAE / JMMLU・ELYZA-tasks-100)まで。何を測り、どう採点され、どこでゲームされるか(汚染・オーバーフィット・best-of-K)、そしてモデルを選ぶときに本当に見るべき点数はどれか。</description>
    <pubDate>Sat, 16 May 2026 00:00:00 GMT</pubDate>
    <author>fjvbn2003@gmail.com (Youngju Kim)</author>
    <category>ai-benchmark</category><category>llm-evaluation</category><category>swe-bench</category><category>swe-bench-verified</category><category>agentbench</category><category>webarena</category><category>gaia</category><category>arc-agi-2</category><category>francois-chollet</category><category>big-bench-hard</category><category>helm</category><category>mmlu-pro</category><category>gpqa</category><category>humaneval</category><category>mbpp</category><category>livecodebench</category><category>codebench</category><category>re-bench</category><category>metr</category><category>frontier-math</category><category>epoch-ai</category><category>math</category><category>gsm8k</category><category>aime</category><category>hellaswag</category><category>chatbot-arena</category><category>lmsys</category><category>aider-polyglot</category><category>open-llm-leaderboard</category><category>alpacaeval</category><category>mt-bench</category><category>agieval</category><category>kmmlu</category><category>jmmlu</category><category>haerae</category><category>2026</category><category>deep-dive</category><category>日本語</category>
  </item>

  <item>
    <guid>https://www.youngju.dev/blog/culture/2026-05-16-ai-agent-llm-benchmarks-2026-swe-bench-verified-arc-agi-2-gaia-mmlu-pro-gpqa-livecodebench-chatbot-arena-deep-dive</guid>
    <title>AI 에이전트 &amp; LLM 벤치마크 2026 — SWE-bench Verified / ARC-AGI 2 / GAIA / MMLU-Pro / GPQA / LiveCodeBench / Chatbot Arena 심층 가이드</title>
    <link>https://www.youngju.dev/blog/culture/2026-05-16-ai-agent-llm-benchmarks-2026-swe-bench-verified-arc-agi-2-gaia-mmlu-pro-gpqa-livecodebench-chatbot-arena-deep-dive</link>
    <description>2026년 현재 가장 의미 있는 AI 벤치마크 30+ 종을 한 장에 정리한다. SWE-bench / SWE-bench Verified / SWE-bench Multimodal부터 AgentBench·WebArena·GAIA, ARC-AGI 2(샹폴레의 $1M 상금), RE-Bench(METR), Frontier Math(Epoch AI), HumanEval / MBPP / LiveCodeBench, MMLU-Pro / GPQA Diamond, MATH / GSM8K / AIME, Chatbot Arena(LMSYS), Aider polyglot, Open LLM Leaderboard, AlpacaEval / MT-Bench / AGIEval / MEGA-Bench, FACTSCORE / TruthfulQA, ToolBench / AppWorld, 그리고 한국·일본 로컬 벤치마크(KMMLU·HAERAE / JMMLU·ELYZA-tasks-100)까지. 벤치마크의 구조, 측정 대상, 한계(오버핏·오염·게임), 그리고 우리가 모델을 고를 때 어떤 점수를 봐야 하는지.</description>
    <pubDate>Sat, 16 May 2026 00:00:00 GMT</pubDate>
    <author>fjvbn2003@gmail.com (Youngju Kim)</author>
    <category>ai-benchmark</category><category>llm-evaluation</category><category>swe-bench</category><category>swe-bench-verified</category><category>agentbench</category><category>webarena</category><category>gaia</category><category>arc-agi-2</category><category>francois-chollet</category><category>big-bench-hard</category><category>helm</category><category>mmlu-pro</category><category>gpqa</category><category>humaneval</category><category>mbpp</category><category>livecodebench</category><category>codebench</category><category>re-bench</category><category>metr</category><category>frontier-math</category><category>epoch-ai</category><category>math</category><category>gsm8k</category><category>aime</category><category>hellaswag</category><category>chatbot-arena</category><category>lmsys</category><category>aider-polyglot</category><category>open-llm-leaderboard</category><category>alpacaeval</category><category>mt-bench</category><category>agieval</category><category>kmmlu</category><category>jmmlu</category><category>haerae</category><category>2026</category><category>deep-dive</category>
  </item>

    </channel>
  </rss>
