Skip to content
Published on

LLM Agents & Agentic AI: The Complete Guide — ReAct, Multi-Agent, MCP and Beyond

Authors

Introduction

From 2024 through 2026, the AI paradigm has completely shifted from simple "question-answer" chatbots to autonomously acting agents. LLM agents receive a goal, independently create plans, call tools, review results, and achieve objectives.

Devin autonomously resolves GitHub issues. Claude clicks through computer screens to complete tasks. The era of truly agentic AI has arrived.

This guide covers everything from core LLM agent concepts to production-ready implementations.


1. What Is an Agent?

Traditional LLM vs. Agent

A traditional LLM follows a simple input → output pattern. An agent, by contrast:

  • Perceives: Gathers information from the environment (tool results, user input, memory)
  • Plans: Decides on a sequence of actions to achieve a goal
  • Acts: Calls tools, makes API requests, executes code
  • Reflects: Evaluates results and adjusts the next action

Core components of an agent:

ComponentDescription
LLM CoreReasoning and decision-making engine
ToolsWeb search, code execution, APIs, etc.
MemoryShort-term and long-term context management
OrchestratorAgent loop control

2. ReAct Framework: Reasoning + Acting

What Is ReAct?

ReAct (Reasoning + Acting) is a framework proposed by Yao et al. in 2022. The LLM solves problems by repeating a Thought → Action → Observation cycle.

Thought:     Analyze the current situation and decide on the next action
Action:      Call a tool in the form tool_name(arguments)
Observation: Receive the tool execution result
...repeat...
Final Answer: Derive the final answer

Why Does ReAct Reduce Hallucination?

A standard LLM generates the entire answer in one pass, which can lead to "fabricating" facts midway. ReAct addresses this by:

  1. Real-time grounding: Each Observation acts as a factual anchor
  2. Incremental verification: Intermediate results are checked, allowing early error correction
  3. External knowledge access: Actual tools are used for searching and calculation during reasoning

Python Implementation

from langchain.agents import create_react_agent
from langchain_anthropic import ChatAnthropic
from langchain.tools import DuckDuckGoSearchRun, PythonREPLTool
from langchain import hub

llm = ChatAnthropic(model="claude-opus-4-5", temperature=0)
tools = [DuckDuckGoSearchRun(), PythonREPLTool()]

# Load the ReAct prompt template
prompt = hub.pull("hwchase17/react")

agent = create_react_agent(llm, tools, prompt)

from langchain.agents import AgentExecutor
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

result = agent_executor.invoke({
    "input": "Search for the top 3 AI agent frameworks in 2026 and create a comparison table"
})

3. Chain-of-Thought & Tree-of-Thought

Chain-of-Thought (CoT)

CoT dramatically improves LLM reasoning ability simply by prompting it to "think step by step."

cot_prompt = """
Solve the problem step by step:

Problem: {problem}

Solution process:
1. First, organize the given information.
2. Perform the necessary calculations or reasoning.
3. Verify intermediate results.
4. Derive the final answer.
"""

Tree-of-Thought (ToT)

ToT extends CoT by exploring multiple reasoning paths in a tree structure, using BFS or DFS to select the most promising path.

from langchain_experimental.tot.base import ToTChain
from langchain_experimental.tot.thought_generation import ProposePromptStrategy

tot_chain = ToTChain.from_llm(
    llm=llm,
    checker=checker,
    k=3,    # Number of branches to generate at each level
    c=4,    # Evaluation depth
    verbose=True
)

4. Memory Systems

Four Types of Agent Memory

Agent memory is designed similarly to the human memory system:

Memory TypeStorage LocationCharacteristics
Sensory MemoryInput contextProcesses current input
Short-term MemoryContext windowCurrent conversation session
Long-term MemoryVector DB / KV storePermanent knowledge storage
Episodic MemoryVector DBIndexes past experiences

mem0: Long-term Memory Integration

mem0 is an open-source library that adds personalized long-term memory to agents.

from mem0 import Memory

# Initialize mem0 (using Qdrant as vector DB)
config = {
    "vector_store": {
        "provider": "qdrant",
        "config": {
            "collection_name": "agent_memory",
            "host": "localhost",
            "port": 6333,
        }
    },
    "llm": {
        "provider": "anthropic",
        "config": {
            "model": "claude-opus-4-5",
            "temperature": 0,
        }
    }
}

memory = Memory.from_config(config)
user_id = "user_123"

# Store memory
memory.add(
    messages=[
        {"role": "user", "content": "I mainly use Python and prefer FastAPI"},
        {"role": "assistant", "content": "Got it! I'll remember your Python/FastAPI preference."}
    ],
    user_id=user_id
)

# Search and use memory
relevant_memories = memory.search(
    query="What programming language does the user prefer?",
    user_id=user_id
)

context = "\n".join([m["memory"] for m in relevant_memories])
print(f"Relevant memories: {context}")

Vector Store-Based Episodic Memory

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from datetime import datetime

class EpisodicMemory:
    def __init__(self):
        self.embeddings = OpenAIEmbeddings()
        self.store = Chroma(
            collection_name="episodes",
            embedding_function=self.embeddings,
            persist_directory="./episodic_memory"
        )

    def store_episode(self, content: str, metadata: dict = None):
        """Store a conversation or task episode in memory"""
        metadata = metadata or {}
        metadata["timestamp"] = datetime.now().isoformat()
        self.store.add_texts([content], metadatas=[metadata])

    def recall(self, query: str, k: int = 3):
        """Retrieve relevant episodes"""
        docs = self.store.similarity_search(query, k=k)
        return [doc.page_content for doc in docs]

# Usage example
memory = EpisodicMemory()
memory.store_episode(
    "User asked about FastAPI project structure. Successfully answered.",
    {"task_type": "coding", "success": True}
)

5. Tool Integration

Standard Tool Categories

Key tools used by agents:

  1. Web Search: Tavily, SerpAPI, DuckDuckGo
  2. Code Execution: Python REPL, Jupyter Kernel
  3. File System: Read, write, and search files
  4. API Calls: REST, GraphQL
  5. Databases: SQL, vector DB queries
  6. Computer Control: Screen capture, clicks, keyboard input

Custom Web Search Tool Implementation

from langchain.tools import BaseTool
from pydantic import BaseModel, Field
import httpx
from typing import Optional

class WebSearchInput(BaseModel):
    query: str = Field(description="The query to search for")
    max_results: int = Field(default=5, description="Number of results to return")

class TavilySearchTool(BaseTool):
    name: str = "web_search"
    description: str = "Searches the web for up-to-date information. Use when real-time information is needed."
    args_schema: type[BaseModel] = WebSearchInput
    api_key: str = ""

    def _run(self, query: str, max_results: int = 5) -> str:
        url = "https://api.tavily.com/search"
        payload = {
            "api_key": self.api_key,
            "query": query,
            "max_results": max_results,
            "include_answer": True,
        }
        response = httpx.post(url, json=payload)
        data = response.json()

        results = []
        if data.get("answer"):
            results.append(f"Summary: {data['answer']}\n")
        for r in data.get("results", []):
            results.append(f"- {r['title']}: {r['content'][:200]}...")
        return "\n".join(results)

    async def _arun(self, query: str, max_results: int = 5) -> str:
        async with httpx.AsyncClient() as client:
            response = await client.post(
                "https://api.tavily.com/search",
                json={"api_key": self.api_key, "query": query, "max_results": max_results}
            )
        return self._process_response(response.json())

MCP (Model Context Protocol)

MCP is a standardized tool integration protocol announced by Anthropic in late 2024. While traditional Function Calling required different formats for each LLM, MCP standardizes tools through a server-client model.

Key MCP advantages:

  • Reusability: An MCP server built once works with any LLM
  • Rich context: Provides three abstractions — Resources, Prompts, and Tools
  • Dynamic discovery: Tool lists are dynamically queried at runtime
# MCP server implementation (Python SDK)
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp import types

app = Server("my-tool-server")

@app.list_tools()
async def list_tools() -> list[types.Tool]:
    return [
        types.Tool(
            name="get_weather",
            description="Retrieves the current weather for a specific city",
            inputSchema={
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                    "units": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["city"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[types.TextContent]:
    if name == "get_weather":
        city = arguments["city"]
        weather_data = await fetch_weather(city)
        return [types.TextContent(type="text", text=str(weather_data))]

async def main():
    async with stdio_server() as (read_stream, write_stream):
        await app.run(read_stream, write_stream, app.create_initialization_options())

6. LangGraph: Stateful Agents

LangGraph is a graph-based agent orchestration framework built by the LangChain team. Unlike the DAG-based LangChain Expression Language (LCEL), LangGraph supports cycles, naturally expressing agent loops.

Stateful Agent Implementation

from langgraph.graph import StateGraph, END
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage
from typing import TypedDict, Annotated, Sequence
import operator

# 1. Define state
class AgentState(TypedDict):
    messages: Annotated[Sequence, operator.add]
    tool_calls: list
    iteration_count: int

# 2. Set up LLM and tools
llm = ChatAnthropic(model="claude-opus-4-5")
tools = [WebSearchTool(), PythonREPLTool()]
llm_with_tools = llm.bind_tools(tools)

# 3. Define nodes
def call_model(state: AgentState) -> AgentState:
    """LLM call node"""
    response = llm_with_tools.invoke(state["messages"])
    return {
        "messages": [response],
        "iteration_count": state["iteration_count"] + 1
    }

def call_tools(state: AgentState) -> AgentState:
    """Tool execution node"""
    last_message = state["messages"][-1]
    tool_results = []

    for tool_call in last_message.tool_calls:
        tool = next(t for t in tools if t.name == tool_call["name"])
        result = tool.invoke(tool_call["args"])
        tool_results.append(
            ToolMessage(content=str(result), tool_call_id=tool_call["id"])
        )
    return {"messages": tool_results}

# 4. Routing function (conditional edges)
def should_continue(state: AgentState) -> str:
    last_message = state["messages"][-1]
    if hasattr(last_message, "tool_calls") and last_message.tool_calls:
        if state["iteration_count"] < 10:  # Prevent infinite loops
            return "tools"
    return "end"

# 5. Build the graph
graph = StateGraph(AgentState)
graph.add_node("agent", call_model)
graph.add_node("tools", call_tools)

graph.set_entry_point("agent")
graph.add_conditional_edges(
    "agent",
    should_continue,
    {"tools": "tools", "end": END}
)
graph.add_edge("tools", "agent")  # Return to agent after tool execution

# 6. Add memory checkpointing
from langgraph.checkpoint.memory import MemorySaver
checkpointer = MemorySaver()
app = graph.compile(checkpointer=checkpointer)

# Run (manage conversation sessions with thread_id)
config = {"configurable": {"thread_id": "session_001"}}
result = app.invoke(
    {"messages": [HumanMessage(content="Search for AI agent trends in 2026 and summarize them")], "iteration_count": 0},
    config=config
)

7. Multi-Agent Systems

CrewAI: Role-Based Multi-Agent

from crewai import Agent, Task, Crew, Process
from crewai_tools import SerperDevTool, FileWriterTool

search_tool = SerperDevTool()
file_writer = FileWriterTool()

researcher = Agent(
    role="AI Researcher",
    goal="Conduct in-depth research on the latest AI agent technology trends",
    backstory="""You are an expert researcher in the AI field.
    You analyze the latest papers, blogs, and GitHub repositories to extract key insights.""",
    tools=[search_tool],
    llm="claude-opus-4-5",
    verbose=True
)

writer = Agent(
    role="Technical Writer",
    goal="Write readable technical reports from research results",
    backstory="""You are a professional writer who explains complex AI concepts clearly.""",
    tools=[file_writer],
    llm="claude-opus-4-5",
    verbose=True
)

research_task = Task(
    description="Research the Top 5 LLM agent trends of 2026. Include specific examples and impacts for each trend.",
    expected_output="Detailed analysis of 5 trends (500+ words each)",
    agent=researcher
)

writing_task = Task(
    description="Write a technical blog post based on the research results.",
    expected_output="A 2000-word technical blog post in Markdown format",
    agent=writer,
    output_file="ai_trends_2026.md"
)

crew = Crew(
    agents=[researcher, writer],
    tasks=[research_task, writing_task],
    process=Process.sequential,
    verbose=True
)

result = crew.kickoff()

AutoGen: Conversation-Based Multi-Agent

AutoGen is Microsoft's multi-agent framework featuring collaboration through conversation between agents.

import autogen

config_list = [{"model": "claude-opus-4-5", "api_key": "YOUR_KEY"}]

orchestrator = autogen.AssistantAgent(
    name="Orchestrator",
    system_message="""You are an orchestrator who coordinates the team.
    You analyze tasks and delegate them to the appropriate specialist agents.
    You integrate all results to generate the final answer.""",
    llm_config={"config_list": config_list}
)

coder = autogen.AssistantAgent(
    name="Coder",
    system_message="You are an expert at writing and executing Python code.",
    llm_config={"config_list": config_list}
)

user_proxy = autogen.UserProxyAgent(
    name="UserProxy",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=10,
    code_execution_config={"work_dir": "coding", "use_docker": False}
)

groupchat = autogen.GroupChat(
    agents=[orchestrator, coder, user_proxy],
    messages=[],
    max_round=12
)
manager = autogen.GroupChatManager(groupchat=groupchat)
user_proxy.initiate_chat(manager, message="Write a data visualization script")

8. Claude API Tool Use Implementation

import anthropic
import json

client = anthropic.Anthropic()

tools = [
    {
        "name": "get_stock_price",
        "description": "Retrieves the current stock price and change percentage for a specific ticker",
        "input_schema": {
            "type": "object",
            "properties": {
                "symbol": {
                    "type": "string",
                    "description": "Stock ticker symbol (e.g., AAPL, MSFT)"
                },
                "currency": {
                    "type": "string",
                    "enum": ["USD", "KRW"],
                    "description": "Display currency"
                }
            },
            "required": ["symbol"]
        }
    }
]

def process_tool_call(tool_name: str, tool_input: dict) -> str:
    if tool_name == "get_stock_price":
        return json.dumps({
            "symbol": tool_input["symbol"],
            "price": 185.92,
            "change_percent": "+2.3%",
            "currency": tool_input.get("currency", "USD")
        })

messages = [{"role": "user", "content": "What's the current Apple stock price?"}]

while True:
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=4096,
        tools=tools,
        messages=messages
    )

    messages.append({"role": "assistant", "content": response.content})

    if response.stop_reason == "end_turn":
        final_text = next(
            block.text for block in response.content
            if hasattr(block, "text")
        )
        print(f"Final answer: {final_text}")
        break

    if response.stop_reason == "tool_use":
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = process_tool_call(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result
                })
        messages.append({"role": "user", "content": tool_results})

9. Agent Evaluation

Key Benchmarks

BenchmarkMeasurement AreaCharacteristics
AgentBench8 environments (OS, DB, games, etc.)Real-environment-based evaluation
GAIAGeneral AI assistance capabilitiesComparison with human-level performance
SWE-benchSoftware engineeringSolves real GitHub issues
WebArenaWeb navigation abilityManipulates real websites
OSWorldComputer usage abilityGUI interaction

Trajectory Evaluation vs. Outcome Evaluation

There are two key perspectives in agent evaluation:

Outcome Evaluation:

  • Measures only whether the final goal was achieved
  • Pass@k, Success Rate
  • Simple but ignores the process

Trajectory Evaluation:

  • Evaluates the entire action sequence toward the goal
  • Jointly measures efficiency, safety, and absence of side effects
  • Essential in production environments
from dataclasses import dataclass
from typing import List

@dataclass
class AgentTrajectory:
    task: str
    steps: List[dict]  # {"thought": ..., "action": ..., "observation": ...}
    final_answer: str
    success: bool
    total_tokens: int

def evaluate_trajectory(trajectory: AgentTrajectory) -> dict:
    """Trajectory-based agent evaluation"""
    metrics = {
        "task_success": trajectory.success,
        "efficiency": calculate_efficiency(trajectory.steps),
        "redundant_steps": count_redundant_steps(trajectory.steps),
        "error_recovery": check_error_recovery(trajectory.steps),
        "tool_usage_appropriateness": evaluate_tool_usage(trajectory.steps),
        "cost_efficiency": 1000 / trajectory.total_tokens
    }
    return metrics

def count_redundant_steps(steps: List[dict]) -> int:
    """Count unnecessary duplicate tool calls"""
    seen_actions = set()
    redundant = 0
    for step in steps:
        action_key = f"{step.get('action_type')}:{step.get('action_input')}"
        if action_key in seen_actions:
            redundant += 1
        seen_actions.add(action_key)
    return redundant

Common Agent Failure Modes

  1. Infinite loops: Incorrect goal-completion conditions causing repetition
  2. Tool hallucination: Calling non-existent tools or parameters
  3. Context drift: Forgetting the original goal in long sessions
  4. Over-planning: Unnecessary planning for simple tasks
  5. Tool overuse: Continuously calling tools that are not needed

Computer-use Agents

Claude's Computer Use API and GPT-4o's computer control capabilities let agents see and manipulate actual computer screens.

import anthropic
import base64
from PIL import ImageGrab

def take_screenshot() -> str:
    """Capture screen and encode as base64"""
    screenshot = ImageGrab.grab()
    screenshot.save("/tmp/screenshot.png")
    with open("/tmp/screenshot.png", "rb") as f:
        return base64.b64encode(f.read()).decode()

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=4096,
    tools=[
        {"type": "computer_20241022", "name": "computer", "display_width_px": 1920, "display_height_px": 1080},
        {"type": "text_editor_20241022", "name": "str_replace_editor"},
        {"type": "bash_20241022", "name": "bash"}
    ],
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Open a browser and check the latest trending GitHub repositories"},
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": take_screenshot()}}
        ]
    }]
)

Coding Agents: Devin and SWE-agent

Top coding agent performance on SWE-bench in 2026:

AgentSWE-bench VerifiedCharacteristics
Claude Code~72%Terminal integration, codebase understanding
Devin 2.0~65%Full development workflow
SWE-agent~58%Open-source, research use
Aider~55%Specialized for local codebase

11. OpenAI Assistants API

from openai import OpenAI
import time

client = OpenAI()

assistant = client.beta.assistants.create(
    name="AI Technology Analyst",
    instructions="You are an AI/ML technology expert. Analyze the latest papers and technical documents to provide insights.",
    model="gpt-4o",
    tools=[
        {"type": "file_search"},     # File-based RAG
        {"type": "code_interpreter"}  # Code execution
    ]
)

# Upload file and create vector store
vector_store = client.beta.vector_stores.create(name="AI Papers Repository")
with open("ai_papers_2026.pdf", "rb") as f:
    client.beta.vector_stores.file_batches.upload_and_poll(
        vector_store_id=vector_store.id,
        files=[("ai_papers_2026.pdf", f)]
    )

thread = client.beta.threads.create()
client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content="Analyze the main limitations of Agentic AI from the uploaded papers"
)

run = client.beta.threads.runs.create(
    thread_id=thread.id,
    assistant_id=assistant.id
)

while run.status in ["queued", "in_progress"]:
    run = client.beta.threads.runs.retrieve(thread_id=thread.id, run_id=run.id)
    time.sleep(1)

messages = client.beta.threads.messages.list(thread_id=thread.id)
print(messages.data[0].content[0].text.value)

Quiz: Core Concept Check

Q1. How does the Thought-Action-Observation cycle in the ReAct framework reduce hallucination?

Answer: Step-by-step verification through real-time external grounding

Explanation: A standard LLM generates the entire answer in one pass, leaving room to "fabricate" facts midway through. ReAct calls actual tools (search, calculation, etc.) at each reasoning step and verifies the facts through Observations. These real results act as "factual anchors," preventing subsequent reasoning from drifting without a basis. Because intermediate steps are explicitly recorded, errors can be identified and corrected at the exact point they occurred.

Q2. How does a graph with cycles in LangGraph differ from DAG-based LangChain?

Answer: Enables state-based iterative execution and dynamic routing

Explanation: LangChain's LCEL is a Directed Acyclic Graph (DAG) — once executed, it cannot loop back. LangGraph supports cycles, naturally expressing agent loops like "call tool → check result → retry." Conditional edges dynamically determine the next node based on current state, and checkpointers persist state across sessions for cross-session memory. This mirrors the human "try-error-correct" thought process in code.

Q3. Why is MCP (Model Context Protocol) more flexible than traditional Function Calling?

Answer: A standardized server-client architecture forms an LLM-independent tool ecosystem

Explanation: Traditional Function Calling uses different formats for OpenAI, Anthropic, and Google, making tools tied to specific LLMs. MCP defines a standard protocol over stdio or HTTP, so an MCP server built once can be reused in any MCP-supporting client (Claude, Cursor, VS Code, etc.). Beyond Tools (executable functions), MCP also provides Resources (files, databases, and other data) and Prompts (reusable prompt templates), offering agents much richer context.

Q4. What are the benefits of separating the orchestrator agent from executor agents in a multi-agent system?

Answer: Separation of concerns, specialization, parallel processing, and error isolation

Explanation: The orchestrator focuses exclusively on high-level planning and coordination, while executor agents specialize in specific domains (coding, search, writing, etc.). The benefits are: (1) each agent can be independently optimized; (2) multiple executor agents can work in parallel for higher throughput; (3) the failure of one agent does not halt the entire system (error isolation); (4) new specialist agents can be added easily (scalability); (5) each agent's actions can be independently audited and logged.

Q5. What is the difference between trajectory evaluation and outcome evaluation in agent assessment?

Answer: Outcome measures final success only; trajectory also evaluates process efficiency and safety

Explanation: Outcome Evaluation measures only whether the goal was achieved (0 or 1). It is simple, but allows passing even when the correct result is reached via a bad process or with side effects. Trajectory Evaluation analyzes the entire action sequence: whether there were unnecessary steps (efficiency), whether unsafe actions were taken (safety), whether errors were recovered appropriately, and whether token and API costs were reasonable. Production agents must treat "achieving the goal at excessive cost or with side effects" as a failure, making Trajectory Evaluation essential.


Conclusion

LLM agents have moved beyond the research stage and are now creating real value in production. Key trends for 2026:

  1. Computer-use agents: General-purpose agents that see and manipulate screens directly
  2. Long-term memory standardization: Widespread adoption of memory layers like mem0 and Zep
  3. MCP ecosystem expansion: Thousands of MCP servers and tools
  4. Agent safety: Agent action auditing, permission restrictions, human oversight
  5. Multimodal agents: Integrated processing of text, images, and audio

As a next step, challenge yourself to build a production agent with LangGraph or develop an MCP server!