Tool Calling in Practice: How AI Interacts with the World and Common Pitfalls

Why Tool Calling Is the Core of Agents
OpenAI Function Calling: Full Implementation
Anthropic Claude Tool Use
Five Common Mistakes and How to Fix Them
Parallel Tool Calls
Tool Design Principles
Production Middleware
Wrapping Up

An LLM on its own transforms text. Add tools and it can actually do things. Tool calling is the mechanism that turns a passive text generator into an active agent.

This post covers how tool calling actually works under the hood, how to implement it correctly, and the specific mistakes that trip up every team building agent systems in production.

Why Tool Calling Is the Core of Agents

An agent's capabilities are exactly as wide as its available tools:

Search tool → access to real-time information
Calculator tool → guaranteed mathematical accuracy
Code execution tool → write and run actual code
API tool → integrate with external services
Database tool → read and write persistent data

Without tools, an LLM can only answer from its training data. With tools, it can look up today's stock prices, send emails, and execute code.

OpenAI Function Calling: Full Implementation

OpenAI's function calling has become the de facto standard API format.

Define Your Tools

import openai
import json
from typing import Any

client = openai.OpenAI(api_key="your-api-key")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": (
                "Get current weather for a specific city. "
                "Use this when the user asks about current weather conditions. "
                "Do NOT use for weather forecasts or historical data."
            ),
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "City name in English, e.g. 'Seoul', 'Tokyo', 'New York'"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit. Default: celsius"
                    }
                },
                "required": ["city"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_web",
            "description": (
                "Search the web for current information. "
                "Use for recent news, events, or anything that may have changed "
                "since the model's training cutoff."
            ),
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "Search query"
                    },
                    "num_results": {
                        "type": "integer",
                        "description": "Number of results to return (1-10)",
                        "default": 3
                    }
                },
                "required": ["query"]
            }
        }
    }
]

Implement the Actual Functions

def get_weather(city: str, unit: str = "celsius") -> dict:
    response = weather_api.get(city=city, unit=unit)
    return {
        "city": city,
        "temperature": response.temp,
        "unit": unit,
        "condition": response.condition,
        "humidity": response.humidity
    }

def search_web(query: str, num_results: int = 3) -> list:
    results = search_api.search(query, count=num_results)
    return [
        {"title": r.title, "snippet": r.snippet, "url": r.url}
        for r in results
    ]

available_tools = {
    "get_weather": get_weather,
    "search_web": search_web
}

The Complete Tool Call Loop

def run_agent(user_message: str, max_iterations: int = 10) -> str:
    messages = [{"role": "user", "content": user_message}]

    for iteration in range(max_iterations):
        response = client.chat.completions.create(
            model="gpt-4",
            messages=messages,
            tools=tools,
            tool_choice="auto"
        )

        assistant_message = response.choices[0].message
        messages.append(assistant_message)

        # No tool calls → final answer
        if not assistant_message.tool_calls:
            return assistant_message.content

        # Execute each tool call
        for tool_call in assistant_message.tool_calls:
            function_name = tool_call.function.name
            function_args = json.loads(tool_call.function.arguments)

            print(f"Calling: {function_name}({function_args})")

            if function_name in available_tools:
                try:
                    result = available_tools[function_name](**function_args)
                    tool_result = json.dumps(result, ensure_ascii=False)
                except Exception as e:
                    tool_result = f"Error: {str(e)}. Try a different approach."
            else:
                tool_result = f"Unknown tool: {function_name}"

            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": tool_result
            })

    return "Max iterations reached"

result = run_agent("What's the weather in Seoul? Also find me today's AI news.")
print(result)

Anthropic Claude Tool Use

Claude uses a different format but the same concept:

import anthropic
import json

client = anthropic.Anthropic(api_key="your-api-key")

tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"}
            },
            "required": ["city"]
        }
    }
]

def run_claude_agent(user_message: str) -> str:
    messages = [{"role": "user", "content": user_message}]

    while True:
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=4096,
            tools=tools,
            messages=messages
        )

        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason == "end_turn":
            return " ".join(
                block.text for block in response.content
                if hasattr(block, "text")
            )

        # Handle tool_use blocks
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                try:
                    result = available_tools[block.name](**block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": json.dumps(result)
                    })
                except Exception as e:
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": f"Error: {str(e)}",
                        "is_error": True
                    })

        messages.append({"role": "user", "content": tool_results})

Five Common Mistakes and How to Fix Them

This section is the most valuable part of this post. These are real problems you will hit in production.

Mistake 1: Vague Tool Descriptions

The LLM uses your description to decide when and how to call your tool. Vague descriptions cause incorrect or unnecessary tool calls.

# Bad: ambiguous and unhelpful
{
    "name": "query",
    "description": "Query data"
}

# Good: when to use it, what it returns, what NOT to use it for
{
    "name": "query_customer_orders",
    "description": (
        "Fetch order history for a specific customer. Requires customer_id. "
        "Returns: list of orders with order ID, date, amount, and status. "
        "For customer profile info (name, email), use get_customer_info instead."
    )
}

Mistake 2: No Error Handling

Unhandled exceptions kill the entire agent loop.

# Bad: unhandled exception crashes the agent
result = database.query(sql)

# Good: errors become tool results — the LLM can adapt its approach
try:
    result = database.query(sql)
    return json.dumps(result)
except DatabaseError as e:
    return f"Database error: {str(e)}. The query may have a syntax error."
except TimeoutError:
    return "Query timed out. Try adding more specific filters or simplify the query."
except Exception as e:
    return f"Unexpected error: {str(e)}. Please try a different approach."

Mistake 3: No Loop Detection

An agent can get stuck calling the same tool with the same arguments repeatedly.

MAX_ITERATIONS = 15
call_count = {}

for iteration in range(MAX_ITERATIONS):
    # ...
    for tool_call in tool_calls:
        name = tool_call.function.name
        args_str = tool_call.function.arguments

        call_key = f"{name}:{args_str}"
        call_count[call_key] = call_count.get(call_key, 0) + 1

        if call_count[call_key] > 3:
            return "Error: same tool call repeated too many times. Breaking loop."

Mistake 4: Too Many Tools Defined at Once

LLM accuracy degrades when you provide too many tools. Beyond about 10, consider dynamic tool selection.

def select_relevant_tools(user_query: str, all_tools: list, max_tools: int = 8) -> list:
    """Select only tools relevant to the user's query"""
    if len(all_tools) <= max_tools:
        return all_tools

    query_words = set(user_query.lower().split())
    scored_tools = []

    for tool in all_tools:
        desc_words = set(tool["function"]["description"].lower().split())
        overlap = len(query_words & desc_words)
        scored_tools.append((overlap, tool))

    scored_tools.sort(key=lambda x: x[0], reverse=True)
    return [t for _, t in scored_tools[:max_tools]]

Mistake 5: No Human Confirmation for Irreversible Actions

Delete, payment, email send — these need human approval before execution.

HIGH_RISK_TOOLS = {"delete_record", "send_email", "process_payment", "deploy_code"}

def execute_with_approval(tool_name: str, args: dict) -> str:
    if tool_name in HIGH_RISK_TOOLS:
        # In a real app: show UI modal, send Slack notification, etc.
        print(f"HIGH RISK: {tool_name}({args})")
        approval = input("Approve? (yes/no): ")
        if approval.lower() != "yes":
            return "Action cancelled by user."

    return available_tools[tool_name](**args)

Parallel Tool Calls

Modern LLMs can request multiple tool calls in a single response. Use this — it makes agents dramatically faster.

import asyncio
from concurrent.futures import ThreadPoolExecutor
import time

async def execute_tool_calls_parallel(tool_calls: list) -> list:
    """Execute multiple tool calls concurrently"""

    async def execute_single(tool_call):
        name = tool_call.function.name
        args = json.loads(tool_call.function.arguments)

        try:
            loop = asyncio.get_event_loop()
            with ThreadPoolExecutor() as pool:
                result = await loop.run_in_executor(
                    pool,
                    lambda: available_tools[name](**args)
                )
            return {
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(result)
            }
        except Exception as e:
            return {
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": f"Error: {str(e)}"
            }

    tasks = [execute_single(tc) for tc in tool_calls]
    results = await asyncio.gather(*tasks)
    return results

# Example: "What's the weather in Seoul, Tokyo, and New York?"
# LLM returns three get_weather calls simultaneously
# Sequential: 3 API calls × 1s each = 3 seconds
# Parallel: 3 API calls simultaneously = ~1 second

Tool Design Principles

The difference between well-designed and poorly-designed tools shows up directly in agent performance.

Principle 1: Single Responsibility

Each tool does one thing.

# Bad: one function doing everything
def manage_customer(action: str, customer_id: int, **kwargs): ...

# Good: separate functions
def get_customer(customer_id: int) -> dict: ...
def update_customer(customer_id: int, name: str = None, email: str = None) -> dict: ...
def delete_customer(customer_id: int) -> dict: ...

Principle 2: Separate Reads from Writes

Read tools have no side effects. Write tools change state. Keep them clearly separate.

# Read tools: safe, idempotent, can be called multiple times
def get_user_balance(user_id: int) -> float: ...
def search_products(query: str) -> list: ...

# Write tools: irreversible, handle with care
def transfer_money(from_id: int, to_id: int, amount: float) -> dict: ...
def delete_order(order_id: int) -> dict: ...

Principle 3: Return Structured Data

Return JSON, not prose. The LLM reasons over structured data far better than natural language sentences.

# Bad: LLM has to parse this
def get_user_info(user_id: int) -> str:
    return f"John Doe, 30 years old, email: john@example.com, joined 2022"

# Good: structured data the LLM can reason about precisely
def get_user_info(user_id: int) -> dict:
    return {
        "id": user_id,
        "name": "John Doe",
        "age": 30,
        "email": "john@example.com",
        "joined_at": "2022-01-15"
    }

Principle 4: Include Error Details in Results

def safe_tool_wrapper(func):
    """Wrap tool functions with consistent error handling"""
    def wrapper(*args, **kwargs):
        try:
            result = func(*args, **kwargs)
            return {"success": True, "data": result}
        except ValueError as e:
            return {"success": False, "error": "invalid_input", "message": str(e)}
        except PermissionError as e:
            return {"success": False, "error": "permission_denied", "message": str(e)}
        except Exception as e:
            return {"success": False, "error": "unexpected", "message": str(e)}
    return wrapper

@safe_tool_wrapper
def create_order(product_id: int, quantity: int, user_id: int) -> dict:
    ...

Production Middleware

Putting it all together — a middleware class that wraps tool execution with the safeguards you need in production:

import time
import asyncio

class ToolCallMiddleware:
    def __init__(self, tools: dict, max_iterations: int = 10):
        self.tools = tools
        self.max_iterations = max_iterations
        self.call_log = []

    async def execute(self, tool_name: str, args: dict, timeout: int = 30) -> dict:
        start_time = time.time()

        # 1. Tool existence check
        if tool_name not in self.tools:
            return {"error": f"Unknown tool: {tool_name}"}

        # 2. Input validation
        try:
            validated_args = self._validate_args(tool_name, args)
        except ValueError as e:
            return {"error": f"Invalid arguments: {str(e)}"}

        # 3. Execute with timeout
        try:
            result = await asyncio.wait_for(
                asyncio.to_thread(self.tools[tool_name], **validated_args),
                timeout=timeout
            )
        except asyncio.TimeoutError:
            result = {"error": f"Timed out after {timeout}s"}
        except Exception as e:
            result = {"error": str(e)}

        # 4. Log every call
        elapsed = time.time() - start_time
        self.call_log.append({
            "tool": tool_name,
            "args": args,
            "elapsed_ms": round(elapsed * 1000),
            "success": "error" not in result,
            "timestamp": time.time()
        })

        return result

Wrapping Up

Tool calling is what makes LLMs genuinely useful as agents. The basic implementation isn't complicated — but building tool calling that holds up in production is a game of details.

Three things matter most: clear tool descriptions, graceful error handling, and loop prevention. Get these right and you'll avoid the majority of the problems that sink tool-enabled agents.

This series covered agent design patterns (Post 5), MCP (Post 6), multi-agent systems (Post 7), and tool calling (Post 8). Now go build something — implementation teaches faster than theory.