BFCLベンチマーク完全ガイド2025：Tool Calling性能評価、リーダーボード分析、モデル比較

はじめに：Tool Callingベンチマークがなぜ重要（じゅうよう）なのか？
1. Tool Callingベンチマークが必要（ひつよう）な理由（りゆう）
2. BFCL概要（がいよう）
3. BFCLカテゴリ深掘（ふかぼ）り
4. BFCL評価（ひょうか）メトリック
5. モデル性能（せいのう）比較（ひかく）（2025年（ねん））
6. BFCLを自分（じぶん）で実行（じっこう）する
7. 他（ほか）のTool Callingベンチマーク
8. Tool Calling性能（せいのう）改善（かいぜん）戦略（せんりゃく）
9. 現実世界（げんじつせかい） vs ベンチマークのギャップ
10. クイズ
11. 参考（さんこう）資料（しりょう）

はじめに：Tool Callingベンチマークがなぜ重要（じゅうよう）なのか？

AIエージェント時代（じだい）の核心（かくしん）は**Tool Calling（Function Calling）**能力（のうりょく）です。LLMがいくら優（すぐ）れた推論（すいろん）能力（のうりょく）を持（も）っていても、外部（がいぶ）ツールを正確（せいかく）に呼（よ）び出（だ）せなければ、実用的（じつようてき）なエージェントは作（つく）れません。

しかし問題（もんだい）があります。MMLUは一般知識（いっぱんちしき）を、HumanEvalはコーディング能力（のうりょく）を測定（そくてい）しますが、**Tool Calling能力（のうりょく）を体系的（たいけいてき）に測定（そくてい）するベンチマークは不足（ふそく）していました。このギャップを埋（う）めたのがUC BerkeleyのBFCL（Berkeley Function Calling Leaderboard）**です。

このガイドでは、BFCLの構造（こうぞう）から評価（ひょうか）メトリック、モデル性能（せいのう）比較（ひかく）、自社（じしゃ）評価（ひょうか）方法（ほうほう）、Tool Calling性能（せいのう）改善（かいぜん）戦略（せんりゃく）まで全（すべ）てをカバーします。

1. Tool Callingベンチマークが必要（ひつよう）な理由（りゆう）

1.1 Tool CallingはAIエージェントの基盤（きばん）

AIエージェント能力スタック：

┌─────────────────────┐
│  Multi-Agent 協業    │  ← Tool Callingなしで不可能
├─────────────────────┤
│  多段階計画実行      │  ← 各ステップでツール呼出
├─────────────────────┤
│  ★ Tool Calling ★   │  ← コア能力
├─────────────────────┤
│  推論（CoT）        │  ← どのツールを使うか決定
├─────────────────────┤
│  テキスト生成       │  ← 基礎能力
└─────────────────────┘

Tool Callingが重要（じゅうよう）な理由（りゆう）：

正確（せいかく）なパラメータ抽出（ちゅうしゅつ）: 「明日（あした）のソウルの天気（てんき）」→ get_weather(location="Seoul", date="2025-03-26")
正（ただ）しいツール選択（せんたく）: 類似（るいじ）する10個（こ）のツールから正確（せいかく）なものを選択
不要（ふよう）な呼出（よびだ）し防止（ぼうし）: ツールが不要（ふよう）な時（とき）は呼（よ）び出（だ）さない判断（はんだん）
複合（ふくごう）呼出（よびだ）し: 複数（ふくすう）のツールを正（ただ）しい順序（じゅんじょ）で組（く）み合（あ）わせ

1.2 ベンチマークなしでは体系的（たいけいてき）改善（かいぜん）不可（ふか）

改善サイクル：

  ┌──────────┐
  │ベンチマーク│
  │  で評価   │
  └────┬─────┘
       │
  ┌────▼─────┐    ┌─────────────┐    ┌─────────────┐
  │ 弱点発見  │───►│ 改善措置     │───►│ 再評価      │
  │          │    │(プロンプト、 │    │(ベンチマーク)│
  │          │    │ ファインチューン)│  │             │
  └──────────┘    └─────────────┘    └──────┬──────┘
                                            │
                          ┌─────────────────┘
                          ▼
                    改善確認 → 繰り返し

1.3 BFCLが埋（う）めた空白（くうはく）

ベンチマーク	測定領域	Tool Calling評価
MMLU	一般知識	不可
HumanEval	コーディング能力	不可
MT-Bench	会話品質	不可
GSM8K	数学推論	不可
BFCL	Tool Calling	専門ベンチマーク

2. BFCL概要（がいよう）

2.1 プロジェクト背景（はいけい）

BFCLはUC BerkeleyのGorillaプロジェクトチームが作（つく）ったTool Calling専門（せんもん）ベンチマークです。Gorillaプロジェクトは、LLMがAPIを正確（せいかく）に呼（よ）び出（だ）せるよう研究（けんきゅう）するプロジェクトで、2023年（ねん）の論文（ろんぶん）「Gorilla: Large Language Model Connected with Massive APIs」から始（はじ）まりました。

2.2 核心（かくしん）数値（すうち）

BFCL核心情報：
─────────────────────────────────────────
テストケース：    2,000+（v3基準）
カテゴリ：       7つの主要カテゴリ
サポート言語：   Python、Java、JavaScript
評価方式：       AST + Executable
リーダーボード： gorilla.cs.berkeley.edu
最新バージョン： BFCL v3（2025）
更新周期：       四半期ごと
参加モデル：     60+（商用 + オープンソース）
─────────────────────────────────────────

2.3 バージョン進化（しんか）

バージョン	時期	主要変更
BFCL v1	2024年初頭	初期版。Simple/Multiple/Parallel基本カテゴリ
BFCL v2	2024年中盤	Liveテスト追加、Multi-turnシナリオ、実行ベース評価強化
BFCL v3	2025年	Multi-stepシナリオ、複合呼出チェーン、現実シナリオ拡大

3. BFCLカテゴリ深掘（ふかぼ）り

3.1 Simple Function Calling（単純（たんじゅん）関数（かんすう）呼出（よびだ）し）

単一（たんいつ）関数（かんすう）、単一（たんいつ）呼出（よびだ）し。自然言語（しぜんげんご）から正（ただ）しいパラメータを抽出（ちゅうしゅつ）する基本（きほん）能力（のうりょく）を測定（そくてい）します。

テスト例（れい）：

# ユーザー入力
"What is the weather in San Francisco today?"

# 使用可能な関数
def get_weather(location: str, date: str = "today") -> dict:
    """Get weather information for a specific location and date."""
    pass

# 期待出力
get_weather(location="San Francisco", date="today")

評価（ひょうか）ポイント：

正しい関数選択
必須パラメータの正確な抽出
オプションパラメータの適切な処理
パラメータタイプの一致（string、int、float、boolean）

難（むずか）しいケース：

# 入力: "Find me flights from NYC to LA next Friday under $500"
# 使用可能な関数:
def search_flights(
    origin: str,        # 空港コードか都市名か？
    destination: str,
    date: str,          # "next Friday" → 実際の日付変換？
    max_price: float,   # "$500" → 500.0
    currency: str = "USD"
) -> list:
    pass

# 期待: search_flights(origin="NYC", destination="LA",
#        date="2025-03-28", max_price=500.0, currency="USD")

3.2 Multiple Function Calling（多重（たじゅう）関数（かんすう）選択（せんたく））

複数（ふくすう）の類似（るいじ）した関数（かんすう）から正（ただ）しいものを選択（せんたく）する能力（のうりょく）を測定（そくてい）します。

# 使用可能な関数群（類似だが異なる）
def get_current_weather(location: str, unit: str = "celsius") -> dict:
    """Get CURRENT weather conditions for a location."""
    pass

def get_weather_forecast(location: str, days: int = 7) -> dict:
    """Get weather FORECAST for upcoming days."""
    pass

def get_historical_weather(location: str, date: str) -> dict:
    """Get HISTORICAL weather data for a past date."""
    pass

def check_severe_weather_alerts(region: str) -> list:
    """Check for severe weather ALERTS in a region."""
    pass

# テスト1: "What will the weather be like in Tokyo next week?"
# 正解: get_weather_forecast(location="Tokyo", days=7)

# テスト2: "Were there any storms in Florida last month?"
# 正解: get_historical_weather(location="Florida", date="2025-02")

# テスト3: "Is it raining in Seoul right now?"
# 正解: get_current_weather(location="Seoul")

3.3 Parallel Function Calling（並列（へいれつ）関数（かんすう）呼出（よびだ）し）

1つのリクエストから独立（どくりつ）した複数（ふくすう）の呼出（よびだ）しを同時（どうじ）に行（おこな）う能力（のうりょく）を測定（そくてい）します。

# 入力: "What's the weather in Seoul, Tokyo, and New York?"

# 期待: 3つの独立した並列呼出し
[
    get_weather(location="Seoul"),
    get_weather(location="Tokyo"),
    get_weather(location="New York")
]

# より複雑なケース:
# "AliceとBobに挨拶メールを送り、明日のカレンダーも確認して"
[
    send_email(to="alice@example.com", subject="Greeting", body="Hello Alice!"),
    send_email(to="bob@example.com", subject="Greeting", body="Hello Bob!"),
    get_calendar(date="2025-03-26")  # 異なる関数だが並列可能
]

3.4 Nested/Composite Function Calling（中（ちゅう）首（しゅ）/複合（ふくごう）呼出（よびだ）し）

ある関数（かんすう）の結果（けっか）を別（べつ）の関数（かんすう）の入力（にゅうりょく）として使用（しよう）する多段階（ただんかい）推論（すいろん）を測定（そくてい）します。

# 入力: "リストから最も安い目的地へのフライトを予約して"

# ステップ1: 目的地の価格を照会
destinations = get_destination_prices(origin="Seoul")
# 結果: [{"city": "Tokyo", "price": 300}, {"city": "Osaka", "price": 250}]

# ステップ2: 最安値の目的地に予約
cheapest = min(destinations, key=lambda x: x["price"])
book_flight(origin="Seoul", destination=cheapest["city"])

3.5 Relevance Detection（関連性（かんれんせい）検出（けんしゅつ））

最（もっと）も重要（じゅうよう）なカテゴリの1つ。 与（あた）えられた関数（かんすう）がユーザーリクエストと関連（かんれん）がない時（とき）、呼（よ）び出（だ）さない能力（のうりょく）を測定（そくてい）します。

# シナリオ1: 関連のない関数のみ存在
# ユーザー: "人生の意味とは？"
# 使用可能: get_weather(), search_products(), book_flight()
# 期待: どの関数も呼び出さず直接回答

# シナリオ2: 部分的に関連あるが不十分
# ユーザー: "ビッグマックのカロリーは？"
# 使用可能: search_restaurants(cuisine, location)
# 期待: 関数呼出なし（レストラン検索であり、カロリー情報ではない）

# シナリオ3: 誘惑的だが誤用
# ユーザー: "プログラミングのジョークを教えて"
# 使用可能: search_web(query)
# 期待: 関数呼出なし（LLMが直接ジョーク生成可能）

なぜ重要（じゅうよう）か：

Relevance Detection失敗の結果：
─────────────────────────────────────
1. 不要なAPIコスト発生
2. ユーザー体験の低下（遅い応答）
3. 誤った結果によるハルシネーション
4. セキュリティリスク（不要なデータアクセス）
─────────────────────────────────────

3.6 AST Evaluation（AST評価（ひょうか））

生成（せいせい）された関数（かんすう）呼出（よびだ）しの**構造的（こうぞうてき）正確性（せいかくせい）**をAbstract Syntax Treeベースで評価（ひょうか）します。

# 評価対象
generated_call = 'get_weather(location="Seoul", unit="celsius")'

# ASTパース
import ast
tree = ast.parse(generated_call)

# 検証項目:
# 1. 関数名が正しいか？
# 2. パラメータ名が正しいか？
# 3. パラメータタイプが正しいか？
# 4. 必須パラメータが全て含まれているか？
# 5. 存在しないパラメータが含まれていないか？

3.7 Executable Evaluation（実行（じっこう）可能性（かのうせい）評価（ひょうか））

生成（せいせい）された関数（かんすう）呼出（よびだ）しを**実際（じっさい）に実行（じっこう）**して正確性（せいかくせい）を検証（けんしょう）します。

def evaluate_executable(generated_call, expected_result):
    try:
        actual_result = eval(generated_call)
        return compare_results(actual_result, expected_result)
    except TypeError as e:
        return {"status": "fail", "reason": f"Type error: {e}"}
    except Exception as e:
        return {"status": "fail", "reason": f"Execution error: {e}"}

サポート言語（げんご）：

Python: 最も包括的なサポート
Java: 静的型検証を含む
JavaScript: Web APIシナリオ

4. BFCL評価（ひょうか）メトリック

4.1 メトリック体系（たいけい）

BFCLメトリック構造：
─────────────────────────────────────────────────────

Overall Accuracy（総合正確度）
├── AST Accuracy（構文的正確度）
│   ├── Simple AST
│   ├── Multiple AST
│   ├── Parallel AST
│   └── Nested AST
├── Exec Accuracy（実行正確度）
│   ├── Simple Exec
│   ├── Multiple Exec
│   ├── Parallel Exec
│   └── Nested Exec
├── Relevance Accuracy（関連性正確度）
│   └── 不要呼出し拒否率
└── Live Test Accuracy（リアルタイムテスト）
    └── 実際のAPIに対する正確度

─────────────────────────────────────────────────────

4.2 詳細（しょうさい）メトリック説明（せつめい）

メトリック	説明	重要度
Overall Accuracy	全テストケース正確度	総合指標
AST Simple	単純呼出しの構文正確度	基本能力
AST Multiple	多重関数選択正確度	判別力
AST Parallel	並列呼出し正確度	効率性
Exec Accuracy	実行成功率	実用性
Relevance	不要呼出し拒否率	安全性
Latency	応答時間	使用性
Cost per call	呼出し当たりコスト	経済性

4.3 正確度（せいかくど）計算（けいさん）方式（ほうしき）

# AST正確度計算
def calculate_ast_accuracy(predictions, ground_truth):
    correct = 0
    total = len(predictions)

    for pred, truth in zip(predictions, ground_truth):
        pred_ast = parse_function_call(pred)
        truth_ast = parse_function_call(truth)

        if (pred_ast.function_name == truth_ast.function_name and
            match_parameters(pred_ast.params, truth_ast.params)):
            correct += 1

    return correct / total

# パラメータマッチング（順序不問、タイプ一致）
def match_parameters(pred_params, truth_params):
    for key in truth_params:
        if key not in pred_params:
            return False
        if not type_match(pred_params[key], truth_params[key]):
            return False
    return True

# Relevance正確度計算
def calculate_relevance_accuracy(predictions, labels):
    tp = sum(1 for p, l in zip(predictions, labels)
             if p == "no_call" and l == "irrelevant")
    fp = sum(1 for p, l in zip(predictions, labels)
             if p != "no_call" and l == "irrelevant")

    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    return precision

5. モデル性能（せいのう）比較（ひかく）（2025年（ねん））

5.1 総合（そうごう）リーダーボード（2025年（ねん）3月（がつ）基準（きじゅん））

順位	モデル	Overall	AST Simple	AST Multiple	AST Parallel	Relevance	Exec
1	Claude 3.5 Sonnet (v2)	92.4%	95.1%	91.2%	90.8%	94.5%	91.0%
2	GPT-4o (2025-01)	91.8%	94.8%	90.5%	91.2%	93.0%	90.2%
3	Gemini 2.0 Flash	90.1%	93.2%	89.8%	88.5%	92.0%	89.5%
4	Claude 3.5 Haiku	88.5%	92.0%	87.5%	86.2%	91.5%	87.0%
5	GPT-4 Turbo	87.2%	91.5%	86.0%	85.5%	90.0%	86.8%
6	Llama 3.1 405B	85.5%	90.0%	84.5%	83.0%	88.5%	84.0%
7	Qwen 2.5 72B	84.2%	89.0%	83.0%	82.5%	87.0%	83.5%
8	Mistral Large	83.0%	88.5%	82.0%	81.0%	86.0%	82.0%
9	Llama 3.1 70B	81.5%	87.0%	80.0%	79.5%	84.5%	80.5%
10	GPT-4o-mini	80.8%	86.5%	79.0%	78.5%	83.0%	79.5%

5.2 カテゴリ別（べつ）強弱点（きょうじゃくてん）分析（ぶんせき）

Claude 3.5 Sonnet

強み：
  + Relevance Detection最高性能（94.5%）
  + 複雑なパラメータ抽出の正確度が高い
  + ネスト呼出チェーンで安定的

弱み：
  - 一部の並列呼出しで順次呼出しに変換
  - 非常に多いツール（20+）提供時に選択正確度低下

GPT-4o

強み：
  + Parallel呼出しで最高性能（91.2%）
  + JSONスキーマ準拠率が非常に高い
  + ストリーミングツール呼出しの安定性

弱み：
  - RelevanceでClaudeより低い
  - 時折不要なツール呼出し発生

Gemini 2.0 Flash

強み：
  + 高速な応答速度
  + コスト効率的
  + マルチモーダル入力と結合したツール呼出し

弱み：
  - 複雑なネスト呼出しで正確度低下
  - 一部エッジケースでパラメータタイプエラー

5.3 コスト対（たい）性能（せいのう）分析（ぶんせき）

コスト効率性（正確度 / コスト）：
─────────────────────────────────────────
モデル                    | 正確度 | コスト(/1M tok) | 効率性
GPT-4o-mini               | 80.8%  | ~$0.30          | *****
Claude 3.5 Haiku          | 88.5%  | ~$2.40          | ****
Gemini 2.0 Flash          | 90.1%  | ~$0.40          | *****
Claude 3.5 Sonnet         | 92.4%  | ~$9.00          | ***
GPT-4o                    | 91.8%  | ~$7.50          | ***
Llama 3.1 70B (self)      | 81.5%  | ~$0.10*         | *****
─────────────────────────────────────────
* セルフホスティング基準推定

6. BFCLを自分（じぶん）で実行（じっこう）する

6.1 インストール

# BFCLリポジトリをクローン
git clone https://github.com/ShishirPatil/gorilla.git
cd gorilla/berkeley-function-call-leaderboard

# 依存関係のインストール
pip install -r requirements.txt

# またはpipで直接インストール
pip install bfcl

6.2 評価（ひょうか）実行（じっこう）

# 基本評価の実行
from bfcl import evaluate

# OpenAIモデルの評価
results = evaluate(
    model="gpt-4o",
    categories=["simple", "multiple", "parallel", "relevance"],
    api_key="your-openai-api-key"
)

print(f"Overall Accuracy: {results['overall']:.2%}")
print(f"Simple: {results['simple']:.2%}")
print(f"Multiple: {results['multiple']:.2%}")
print(f"Parallel: {results['parallel']:.2%}")
print(f"Relevance: {results['relevance']:.2%}")

# CLIで実行
python eval.py \
    --model gpt-4o \
    --categories simple multiple parallel relevance \
    --output-dir ./results

# Anthropicモデル
python eval.py \
    --model claude-3-5-sonnet \
    --categories all \
    --output-dir ./results

# ローカルモデル（vLLMサーバー）
python eval.py \
    --model local \
    --api-base http://localhost:8000/v1 \
    --categories all

6.3 カスタムモデル評価（ひょうか）

from bfcl import BFCLEvaluator

class MyModelHandler:
    """カスタムモデルハンドラー"""

    def __init__(self, model_path):
        self.model = load_my_model(model_path)

    def generate(self, prompt, tools, **kwargs):
        """
        BFCLが呼び出すインターフェース。
        prompt: ユーザー入力
        tools: 使用可能なツール定義リスト
        戻り値: 関数呼出し文字列または"NO_CALL"
        """
        formatted_prompt = self.format_prompt(prompt, tools)
        response = self.model.generate(formatted_prompt)
        return self.parse_tool_call(response)

    def format_prompt(self, prompt, tools):
        tool_descriptions = "\n".join([
            f"Function: {t['name']}\n"
            f"Description: {t['description']}\n"
            f"Parameters: {json.dumps(t['parameters'])}"
            for t in tools
        ])
        return f"""Available functions:
{tool_descriptions}

User query: {prompt}

Respond with a function call or "NO_CALL" if no function is relevant."""

# 評価実行
evaluator = BFCLEvaluator()
handler = MyModelHandler("/path/to/model")

results = evaluator.evaluate(
    handler=handler,
    categories=["simple", "multiple", "parallel", "relevance"],
    output_dir="./my_model_results"
)

evaluator.generate_report(results, "./report.html")

6.4 カスタムテストケースの追加（ついか）

custom_test = {
    "id": "custom_001",
    "category": "simple",
    "prompt": "チームのSlackチャンネルに「ミーティング開始」と送って",
    "available_functions": [
        {
            "name": "send_slack_message",
            "description": "Send a message to a Slack channel",
            "parameters": {
                "type": "object",
                "properties": {
                    "channel": {
                        "type": "string",
                        "description": "Slack channel name"
                    },
                    "message": {
                        "type": "string",
                        "description": "Message text"
                    }
                },
                "required": ["channel", "message"]
            }
        }
    ],
    "ground_truth": 'send_slack_message(channel="team", message="ミーティング開始")',
}

evaluator.evaluate_custom(
    handler=handler,
    test_cases=[custom_test],
    output_dir="./custom_results"
)

7. 他（ほか）のTool Callingベンチマーク

7.1 ベンチマーク比較（ひかく）

ベンチマーク	制作者	テスト数	特徴	強み
BFCL	UC Berkeley	2,000+	最も包括的、ライブリーダーボード	業界標準
API-Bank	Li et al.	264	API呼出し計画 + 実行	多段階評価
ToolBench	Qin et al.	16,000+	大規模、RapidAPIベース	規模と多様性
Nexus	Srinivasan	1,500	NexusRavenモデルと共に	関数呼出し特化
T-Eval	Chen et al.	553	段階別評価（計画/選択/実行）	細密な分析
Seal-Tools	Various	1,000+	多言語サポート	国際化

7.2 API-Bank

# API-Bank特徴: 3段階評価
# Level 1: API呼出し能力（単一）
# Level 2: API検索 + 呼出し（正しいAPIの発見）
# Level 3: API組み合わせ + 計画（多段階）

# 例（Level 3）:
# "明日の午前中に会議があるか確認し、あれば参加者に通知して"
# -> Step 1: check_calendar(date="tomorrow", time="morning")
# -> Step 2: if meeting exists, get_attendees(meeting_id=...)
# -> Step 3: send_notification(recipients=..., message=...)

7.3 ToolBench

# ToolBench特徴: RapidAPIの実際の16,000+ APIベース

# カテゴリ:
# - Single Tool: 単一API使用
# - Intra-Category: 同カテゴリ内の複数API
# - Inter-Category: 異なるカテゴリのAPI組み合わせ

# 評価メトリック:
# - Pass Rate: 実行成功率
# - Win Rate: 他モデルとの比較選好度（GPT-4評価）

7.4 T-Eval

# T-Eval特徴: ツール使用の各段階を細密に評価
# 6つのサブ能力を測定:

# 1. Instruct Following: 指示理解
# 2. Plan: 作業計画策定
# 3. Reason: 正しいツール推論
# 4. Retrieve: 適切なツール検索
# 5. Understand: ツールドキュメント理解
# 6. Review: 結果検証と修正

7.5 ベンチマーク選択（せんたく）ガイド

どのベンチマークを使うべきか？
─────────────────────────────────────────────────
目的                         | 推奨ベンチマーク
総合Tool Calling評価          | BFCL
大規模実APIテスト             | ToolBench
段階別細密分析                | T-Eval
多段階API計画評価             | API-Bank
クイック基本評価              | BFCL（Simpleのみ）
自社モデル比較                | BFCL + カスタムテスト
─────────────────────────────────────────────────

8. Tool Calling性能（せいのう）改善（かいぜん）戦略（せんりゃく）

8.1 ファインチューニングデータセット作成（さくせい）

import json
from openai import OpenAI

def generate_training_data(tools, num_examples=1000):
    """GPT-4oを使用して学習データを生成"""
    client = OpenAI()
    training_data = []

    tool_descriptions = json.dumps(tools, indent=2)

    for i in range(num_examples):
        # 1. 自然言語クエリ生成
        query_response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": f"""Generate a natural language query
requiring one of these tools:
{tool_descriptions}

Generate diverse, realistic queries with edge cases.
Respond with ONLY the query text."""},
                {"role": "user", "content": f"Generate query #{i+1}"}
            ]
        )
        query = query_response.choices[0].message.content

        # 2. 正しい関数呼出し生成
        call_response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": query}],
            tools=[{"type": "function", "function": t} for t in tools],
            tool_choice="auto"
        )

        if call_response.choices[0].message.tool_calls:
            tc = call_response.choices[0].message.tool_calls[0]
            training_data.append({
                "messages": [
                    {"role": "system", "content": f"Tools: {tool_descriptions}"},
                    {"role": "user", "content": query},
                    {"role": "assistant", "content": None, "tool_calls": [
                        {
                            "type": "function",
                            "function": {
                                "name": tc.function.name,
                                "arguments": tc.function.arguments
                            }
                        }
                    ]}
                ]
            })

    return training_data

8.2 Tool Description最適化（さいてきか）

# 段階的最適化プロセス

# Step 1: 初期説明
v1 = {
    "name": "search_products",
    "description": "Search for products"  # シンプルすぎ
}

# Step 2: 明確な用途説明を追加
v2 = {
    "name": "search_products",
    "description": "Search for products in the catalog by name, category, or keywords. Returns matching products with price and availability."
}

# Step 3: 使用/非使用条件を追加
v3 = {
    "name": "search_products",
    "description": """Search for products in the e-commerce catalog.

USE WHEN: User wants to find, browse, or compare products.
DO NOT USE: For order status (use get_order), account info, or returns.

Returns: List of products with name, price, rating, availability."""
}

# Step 4: 例を追加（最終版）
v4 = {
    "name": "search_products",
    "description": """Search for products in the e-commerce catalog.

USE WHEN: User wants to find, browse, or compare products.
DO NOT USE: For order status, account info, or returns.

EXAMPLES:
- "wireless headphones" -> query="wireless headphones"
- "cheap laptops under $500" -> query="laptops", max_price=500

Returns: Products with name, price, rating, availability."""
}

8.3 エラー分析（ぶんせき）方法論（ほうほうろん）

def analyze_errors(results):
    """BFCL結果からエラーパターンを分析"""

    error_categories = {
        "wrong_function": [],    # 誤った関数選択
        "missing_params": [],    # 必須パラメータ欠落
        "wrong_param_type": [],  # パラメータタイプエラー
        "extra_params": [],      # 不要パラメータ追加
        "unnecessary_call": [],  # 不要な関数呼出し
        "missing_call": [],      # 必要な関数未呼出し
        "wrong_value": [],       # パラメータ値エラー
    }

    for failure in results["failures"]:
        error_type = classify_error(failure)
        error_categories[error_type].append(failure)

    print("Error Distribution:")
    print("=" * 50)
    total = sum(len(v) for v in error_categories.values())
    for category, errors in sorted(
        error_categories.items(),
        key=lambda x: len(x[1]),
        reverse=True
    ):
        pct = len(errors) / total * 100 if total > 0 else 0
        print(f"  {category}: {len(errors)} ({pct:.1f}%)")

    return error_categories

8.4 反復（はんぷく）改善（かいぜん）サイクル

Tool Calling改善サイクル：
─────────────────────────────────────────────────

ステップ1: 現在の性能測定
  - BFCL全カテゴリ実行
  - カテゴリ別正確度を記録

ステップ2: 弱点特定
  - エラー分析を実行
  - 最頻エラータイプの把握
  - 失敗ケースパターン分析

ステップ3: 改善措置
  ├─ プロンプト改善（即効性あり）
  │   - Tool Description改善
  │   - System Prompt最適化
  │   - Few-shot例の追加
  ├─ ツール設計改善（中期）
  │   - スキーマ簡素化
  │   - 関連ツール統合
  │   - パラメータ名の明確化
  └─ ファインチューニング（長期）
      - 失敗ケースベースの学習データ生成
      - LoRA/QLoRAファインチューニング
      - 評価 + 反復

ステップ4: 再評価
  - 同じベンチマークで再測定
  - 改善率を確認
  - 新しい弱点の特定

→ ステップ2に戻る

9. 現実世界（げんじつせかい） vs ベンチマークのギャップ

9.1 ベンチマークがカバーできないもの

ベンチマークの限界：
─────────────────────────────────────────────────
1. 曖昧なユーザー入力
   ベンチマーク: 「ソウルの天気」（明確）
   現実:        「天気どう？」（場所なし、時間不明）

2. 会話コンテキスト依存
   ベンチマーク: シングルターンテスト
   現実:        前の会話で「そこ」がどこか推論

3. エラー回復
   ベンチマーク: 正常応答のみテスト
   現実:        API障害、タイムアウト、不正な応答の処理

4. ツール数の爆発
   ベンチマーク: 5-10個のツール
   現実:        50-100個のツールを同時提供

5. リアルタイム性能
   ベンチマーク: 正確度のみ測定
   現実:        速度、コスト、安定性全てが重要
─────────────────────────────────────────────────

9.2 独自（どくじ）の評価（ひょうか）スイート構築（こうちく）

class ProductionEvalSuite:
    def __init__(self, tools, model):
        self.tools = tools
        self.model = model
        self.test_cases = []

    def add_test_case(self, category, prompt, expected, context=None):
        self.test_cases.append({
            "category": category,
            "prompt": prompt,
            "expected": expected,
            "context": context or []
        })

    def build_standard_suite(self):
        # 1. 基本機能テスト
        self.add_test_case(
            "basic", "ソウルの天気を教えて",
            "get_weather(location='Seoul')"
        )

        # 2. 曖昧な入力テスト
        self.add_test_case(
            "ambiguous", "天気どう？",
            "ASK_CLARIFICATION"
        )

        # 3. マルチターンコンテキストテスト
        self.add_test_case(
            "multi_turn",
            "そこの明日の天気は？",
            "get_weather(location='Seoul', date='tomorrow')",
            context=[
                {"role": "user", "content": "ソウルの天気を教えて"},
                {"role": "assistant", "content": "ソウルは現在15度です。"}
            ]
        )

        # 4. 関連性テスト
        self.add_test_case(
            "relevance", "人生の意味って何？",
            "NO_CALL"
        )

    def run(self):
        results = {"total": 0, "correct": 0, "by_category": {}}
        for test in self.test_cases:
            result = self.evaluate_single(test)
            results["total"] += 1
            if result["correct"]:
                results["correct"] += 1
        results["accuracy"] = results["correct"] / results["total"]
        return results

9.3 プロダクションモニタリング

class ToolCallingMonitor:
    def __init__(self):
        self.metrics = {
            "total_calls": 0,
            "successful_calls": 0,
            "failed_calls": 0,
            "unnecessary_calls": 0,
            "latency_sum": 0,
            "cost_sum": 0,
        }

    def record_call(self, tool_name, success, latency, cost,
                    was_necessary=True):
        self.metrics["total_calls"] += 1
        if success:
            self.metrics["successful_calls"] += 1
        else:
            self.metrics["failed_calls"] += 1
        if not was_necessary:
            self.metrics["unnecessary_calls"] += 1
        self.metrics["latency_sum"] += latency
        self.metrics["cost_sum"] += cost

    def get_dashboard_data(self):
        total = self.metrics["total_calls"]
        if total == 0:
            return {}
        return {
            "success_rate": self.metrics["successful_calls"] / total,
            "failure_rate": self.metrics["failed_calls"] / total,
            "unnecessary_rate": self.metrics["unnecessary_calls"] / total,
            "avg_latency": self.metrics["latency_sum"] / total,
            "total_cost": self.metrics["cost_sum"],
        }

    def alert_on_anomaly(self):
        data = self.get_dashboard_data()
        alerts = []
        if data.get("failure_rate", 0) > 0.1:
            alerts.append("HIGH: ツール呼出し失敗率が10%超過")
        if data.get("unnecessary_rate", 0) > 0.2:
            alerts.append("MEDIUM: 不要ツール呼出しが20%超過")
        if data.get("avg_latency", 0) > 5.0:
            alerts.append("MEDIUM: 平均レイテンシが5秒超過")
        return alerts

10. クイズ

Q1: BFCLの7つの主要評価カテゴリは何ですか？

正解: Simple Function Calling、Multiple Function Calling、Parallel Function Calling、Nested/Composite Function Calling、Relevance Detection、AST Evaluation、Executable Evaluation。

Simpleは単一関数単一呼出し、Multipleは複数類似関数からの選択、Parallelは独立した並列呼出し、Nestedは結果チェイニング、Relevanceは不要呼出しの拒否、ASTは構文正確性、Executableは実行正確性を評価します。

Q2: Relevance DetectionがTool Callingで最も重要なカテゴリの1つである理由は？

正解: Relevance Detectionは、LLMがツールが不要な時に呼び出さない能力を測定します。これが不足すると：1) 不要なAPIコスト発生、2) 応答遅延、3) 誤った結果によるハルシネーション、4) セキュリティリスク（不要なデータアクセス）が発生します。プロダクションではユーザー質問の相当数がツールなしでも回答可能なため、この能力が不足するとコストとユーザー体験の両方が悪化します。

Q3: AST EvaluationとExecutable Evaluationの違いは何ですか？

正解: AST Evaluationは生成された関数呼出しの構文的構造のみ検証します（関数名、パラメータ名、タイプ一致）。Executable Evaluationは生成されたコードを実際に実行して結果を検証します。ASTはget_weather(location="Seoull")を通過させます（構文的に正しいため）が、Executableは実際のAPIが"Seoull"を認識できず失敗と判定します。

Q4: 2025年基準で、BFCLでTool Calling性能が最も良いモデルとコスト効率が最も良いモデルは？

正解: 性能最高はClaude 3.5 Sonnet（Overall約92.4%）とGPT-4o（約91.8%）です。コスト効率最高はGemini 2.0 Flash（90.1%/低コスト）とGPT-4o-mini（80.8%/最低コスト）です。セルフホスティング可能であればLlama 3.1 70Bもコスト効率的です。

Q5: BFCL以外にどのようなTool Callingベンチマークがあり、それぞれの特徴は？

正解: 1) API-Bank - 3段階評価（呼出/検索/計画）、多段階API使用、2) ToolBench - RapidAPIベース16,000+実APIで大規模テスト、3) T-Eval - 6つのサブ能力（指示理解/計画/推論/検索/理解/検証）を細密に評価、4) Nexus - NexusRavenモデルと共に関数呼出し特化評価。総合評価はBFCL、大規模テストはToolBench、細密分析はT-Evalが適しています。

11. 参考（さんこう）資料（しりょう）

BFCL Official Website - gorilla.cs.berkeley.edu/leaderboard
Gorilla: Large Language Model Connected with Massive APIs - Patil et al., 2023
Berkeley Function-Calling Leaderboard Paper - Yan et al., 2024
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs - Li et al., 2023
ToolBench: An Open Platform for Tool-Augmented LLMs - Qin et al., 2023
T-Eval: Evaluating Tool Utilization Capability of LLMs - Chen et al., 2024
Nexus Function Calling Benchmark - Srinivasan et al., 2024
OpenAI Function Calling Best Practices - OpenAI公式ドキュメント
Anthropic Tool Use Documentation - Anthropic公式ドキュメント
Gorilla GitHub Repository - github.com/ShishirPatil/gorilla
Unsloth Fine-tuning Guide - Tool Callingファインチューニングガイド
LangSmith Evaluation Documentation - LangSmith評価フレームワーク
Seal-Tools: Multilingual Tool Calling Benchmark - 多言語ベンチマーク
HuggingFace Open LLM Leaderboard - オープンソースモデル比較