Split View: LLM 에이전트 & Agentic AI 완전 정복: ReAct, 멀티에이전트, MCP까지

LLM 에이전트 & Agentic AI 완전 정복: ReAct, 멀티에이전트, MCP까지

들어가며
1. 에이전트란 무엇인가?
- 기존 LLM vs 에이전트
2. ReAct 프레임워크: 추론 + 행동
3. Chain-of-Thought & Tree-of-Thought
- Chain-of-Thought (CoT)
- Tree-of-Thought (ToT)
4. 메모리 시스템
5. 도구 통합 (Tool Integration)
6. LangGraph: 상태 기반 에이전트
- 상태 기반 에이전트 구현
7. 멀티에이전트 시스템
- CrewAI: 역할 기반 멀티에이전트
- AutoGen: 대화 기반 멀티에이전트
8. Claude API Tool Use 구현
9. 에이전트 평가
10. 2026년 트렌드: Computer-use & Coding Agents
- Computer-use Agents
- Coding Agents: Devin과 SWE-agent
11. OpenAI Assistants API
퀴즈: 핵심 개념 확인
마치며

들어가며

2024~2025년을 거쳐 2026년, AI의 패러다임은 단순한 "질문-답변" 챗봇에서 **자율적으로 행동하는 에이전트(Agent)**로 완전히 이동했습니다.

LLM 에이전트는 목표를 주면 스스로 계획을 세우고, 도구를 호출하고, 결과를 검토하며 목표를 달성합니다. Devin이 혼자서 GitHub 이슈를 해결하고, Claude가 컴퓨터 화면을 클릭하며 작업을 수행하는 시대가 왔습니다.

이 가이드에서는 LLM 에이전트의 핵심 개념부터 실전 구현까지 모두 다룹니다.

1. 에이전트란 무엇인가?

기존 LLM vs 에이전트

기존 LLM은 입력 → 출력의 단순한 구조입니다. 반면 에이전트는:

지각(Perceive): 환경(툴 결과, 사용자 입력, 메모리)에서 정보 수집
계획(Plan): 목표 달성을 위한 행동 시퀀스 결정
행동(Act): 도구 호출, API 요청, 코드 실행
반성(Reflect): 결과 평가 후 다음 행동 조정

에이전트의 핵심 구성요소:

구성요소	설명
LLM 코어	추론 및 의사결정 엔진
도구(Tools)	웹 검색, 코드 실행, API 등
메모리	단기/장기 컨텍스트 관리
오케스트레이터	에이전트 루프 제어

2. ReAct 프레임워크: 추론 + 행동

ReAct란?

ReAct(Reasoning + Acting)는 2022년 Yao et al.이 제안한 프레임워크로, LLM이 생각(Thought) → 행동(Action) → 관찰(Observation) 사이클을 반복하며 문제를 해결합니다.

Thought: 현재 상황을 분석하고 다음 행동 결정
Action: tool_name(arguments) 형태로 도구 호출
Observation: 도구 실행 결과 수신
... 반복 ...
Final Answer: 최종 답변 도출

왜 ReAct가 환각을 줄이는가?

일반 LLM은 전체 답변을 한 번에 생성하다 보니 중간에 사실을 "지어내는" 경향이 있습니다. ReAct는:

실시간 근거 확인: 각 Observation이 사실 기반의 앵커 역할
단계적 검증: 중간 결과를 확인하며 오류를 조기 수정
외부 지식 활용: 추론 과정 중 실제 도구로 검색/계산

Python 구현 예시

from langchain.agents import create_react_agent
from langchain_anthropic import ChatAnthropic
from langchain.tools import DuckDuckGoSearchRun, PythonREPLTool
from langchain import hub

llm = ChatAnthropic(model="claude-opus-4-5", temperature=0)
tools = [DuckDuckGoSearchRun(), PythonREPLTool()]

# ReAct 프롬프트 템플릿 로드
prompt = hub.pull("hwchase17/react")

agent = create_react_agent(llm, tools, prompt)

from langchain.agents import AgentExecutor
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

result = agent_executor.invoke({
    "input": "2026년 현재 가장 인기 있는 AI 에이전트 프레임워크 3가지를 검색하고 비교표를 만들어줘"
})

3. Chain-of-Thought & Tree-of-Thought

Chain-of-Thought (CoT)

CoT는 "단계별로 생각해봅시다"라는 프롬프트만으로 LLM의 추론 능력을 극적으로 향상시킵니다.

cot_prompt = """
문제를 단계별로 풀어보세요:

문제: {problem}

풀이 과정:
1. 먼저 주어진 정보를 정리합니다.
2. 필요한 계산/추론을 수행합니다.
3. 중간 결과를 검증합니다.
4. 최종 답을 도출합니다.
"""

Tree-of-Thought (ToT)

ToT는 CoT를 확장해 여러 추론 경로를 트리 구조로 탐색합니다. BFS/DFS로 가장 유망한 경로를 선택합니다.

from langchain_experimental.tot.base import ToTChain
from langchain_experimental.tot.thought_generation import ProposePromptStrategy

tot_chain = ToTChain.from_llm(
    llm=llm,
    checker=checker,
    k=3,           # 각 레벨에서 생성할 가지 수
    c=4,           # 평가 깊이
    verbose=True
)

4. 메모리 시스템

메모리의 4가지 유형

에이전트의 메모리는 인간의 기억 체계와 유사하게 설계됩니다:

메모리 유형	저장 위치	특징
센서리 메모리	입력 컨텍스트	현재 입력 처리
단기 메모리	컨텍스트 윈도우	현재 대화 세션
장기 메모리	벡터 DB / KV 저장소	영구 지식 저장
에피소딕 메모리	벡터 DB	과거 경험 인덱싱

mem0: 장기 메모리 통합

mem0는 에이전트에 개인화된 장기 메모리를 추가하는 오픈소스 라이브러리입니다.

from mem0 import Memory

# mem0 초기화 (벡터 DB로 Qdrant 사용)
config = {
    "vector_store": {
        "provider": "qdrant",
        "config": {
            "collection_name": "agent_memory",
            "host": "localhost",
            "port": 6333,
        }
    },
    "llm": {
        "provider": "anthropic",
        "config": {
            "model": "claude-opus-4-5",
            "temperature": 0,
        }
    }
}

memory = Memory.from_config(config)
user_id = "user_123"

# 메모리 저장
memory.add(
    messages=[
        {"role": "user", "content": "나는 Python을 주로 쓰고 FastAPI를 좋아해"},
        {"role": "assistant", "content": "알겠습니다! Python/FastAPI 선호를 기억할게요."}
    ],
    user_id=user_id
)

# 메모리 검색 및 활용
relevant_memories = memory.search(
    query="사용자가 선호하는 언어는?",
    user_id=user_id
)

context = "\n".join([m["memory"] for m in relevant_memories])
print(f"관련 메모리: {context}")

벡터 저장소 기반 에피소딕 메모리

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from datetime import datetime

class EpisodicMemory:
    def __init__(self):
        self.embeddings = OpenAIEmbeddings()
        self.store = Chroma(
            collection_name="episodes",
            embedding_function=self.embeddings,
            persist_directory="./episodic_memory"
        )

    def store_episode(self, content: str, metadata: dict = None):
        """대화/작업 에피소드를 메모리에 저장"""
        metadata = metadata or {}
        metadata["timestamp"] = datetime.now().isoformat()
        self.store.add_texts([content], metadatas=[metadata])

    def recall(self, query: str, k: int = 3):
        """관련 에피소드 검색"""
        docs = self.store.similarity_search(query, k=k)
        return [doc.page_content for doc in docs]

# 사용 예시
memory = EpisodicMemory()
memory.store_episode(
    "사용자가 FastAPI 프로젝트 구조에 대해 질문했고, 성공적으로 답변했다",
    {"task_type": "coding", "success": True}
)

5. 도구 통합 (Tool Integration)

표준 도구 카테고리

에이전트가 사용하는 주요 도구들:

웹 검색: Tavily, SerpAPI, DuckDuckGo
코드 실행: Python REPL, Jupyter Kernel
파일 시스템: 파일 읽기/쓰기/검색
API 호출: REST, GraphQL
데이터베이스: SQL, 벡터 DB 쿼리
컴퓨터 제어: 화면 캡처, 클릭, 키보드 입력

커스텀 웹 검색 도구 구현

from langchain.tools import BaseTool
from pydantic import BaseModel, Field
import httpx
from typing import Optional

class WebSearchInput(BaseModel):
    query: str = Field(description="검색할 쿼리")
    max_results: int = Field(default=5, description="반환할 결과 수")

class TavilySearchTool(BaseTool):
    name: str = "web_search"
    description: str = "최신 정보를 웹에서 검색합니다. 실시간 정보가 필요할 때 사용하세요."
    args_schema: type[BaseModel] = WebSearchInput
    api_key: str = ""

    def _run(self, query: str, max_results: int = 5) -> str:
        url = "https://api.tavily.com/search"
        payload = {
            "api_key": self.api_key,
            "query": query,
            "max_results": max_results,
            "include_answer": True,
        }
        response = httpx.post(url, json=payload)
        data = response.json()

        results = []
        if data.get("answer"):
            results.append(f"요약: {data['answer']}\n")
        for r in data.get("results", []):
            results.append(f"- {r['title']}: {r['content'][:200]}...")
        return "\n".join(results)

    async def _arun(self, query: str, max_results: int = 5) -> str:
        # 비동기 버전
        async with httpx.AsyncClient() as client:
            response = await client.post(
                "https://api.tavily.com/search",
                json={"api_key": self.api_key, "query": query, "max_results": max_results}
            )
        return self._process_response(response.json())

MCP (Model Context Protocol)

MCP는 Anthropic이 2024년 말 발표한 표준화된 도구 통합 프로토콜입니다. 기존 Function Calling이 각 LLM마다 다른 형식을 사용했다면, MCP는 서버-클라이언트 모델로 도구를 표준화합니다.

MCP의 핵심 장점:

재사용성: 한 번 만든 MCP 서버를 어떤 LLM과도 연결 가능
풍부한 컨텍스트: Resources, Prompts, Tools 세 가지 추상화 제공
동적 발견: 런타임에 사용 가능한 도구 목록을 동적으로 조회

# MCP 서버 구현 예시 (Python SDK)
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp import types

app = Server("my-tool-server")

@app.list_tools()
async def list_tools() -> list[types.Tool]:
    return [
        types.Tool(
            name="get_weather",
            description="특정 도시의 현재 날씨를 조회합니다",
            inputSchema={
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "도시명"},
                    "units": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["city"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[types.TextContent]:
    if name == "get_weather":
        city = arguments["city"]
        # 실제 날씨 API 호출
        weather_data = await fetch_weather(city)
        return [types.TextContent(type="text", text=str(weather_data))]

async def main():
    async with stdio_server() as (read_stream, write_stream):
        await app.run(read_stream, write_stream, app.create_initialization_options())

6. LangGraph: 상태 기반 에이전트

LangGraph는 LangChain 팀이 만든 그래프 기반 에이전트 오케스트레이션 프레임워크입니다. 기존 LangChain Expression Language(LCEL)의 DAG와 달리, 사이클(cycle)을 지원하여 에이전트 루프를 자연스럽게 표현합니다.

상태 기반 에이전트 구현

from langgraph.graph import StateGraph, END
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage
from typing import TypedDict, Annotated, Sequence
import operator

# 1. 상태 정의
class AgentState(TypedDict):
    messages: Annotated[Sequence, operator.add]
    tool_calls: list
    iteration_count: int

# 2. LLM 및 도구 설정
llm = ChatAnthropic(model="claude-opus-4-5")
tools = [WebSearchTool(), PythonREPLTool()]
llm_with_tools = llm.bind_tools(tools)

# 3. 노드 정의
def call_model(state: AgentState) -> AgentState:
    """LLM 호출 노드"""
    response = llm_with_tools.invoke(state["messages"])
    return {
        "messages": [response],
        "iteration_count": state["iteration_count"] + 1
    }

def call_tools(state: AgentState) -> AgentState:
    """도구 실행 노드"""
    last_message = state["messages"][-1]
    tool_results = []

    for tool_call in last_message.tool_calls:
        tool = next(t for t in tools if t.name == tool_call["name"])
        result = tool.invoke(tool_call["args"])
        tool_results.append(
            ToolMessage(content=str(result), tool_call_id=tool_call["id"])
        )
    return {"messages": tool_results}

# 4. 라우팅 함수 (조건부 엣지)
def should_continue(state: AgentState) -> str:
    last_message = state["messages"][-1]
    # 도구 호출이 있으면 계속, 없으면 종료
    if hasattr(last_message, "tool_calls") and last_message.tool_calls:
        if state["iteration_count"] < 10:  # 무한 루프 방지
            return "tools"
    return "end"

# 5. 그래프 구성
graph = StateGraph(AgentState)
graph.add_node("agent", call_model)
graph.add_node("tools", call_tools)

graph.set_entry_point("agent")
graph.add_conditional_edges(
    "agent",
    should_continue,
    {"tools": "tools", "end": END}
)
graph.add_edge("tools", "agent")  # 도구 실행 후 다시 에이전트로

# 6. 메모리 체크포인트 추가
from langgraph.checkpoint.memory import MemorySaver
checkpointer = MemorySaver()
app = graph.compile(checkpointer=checkpointer)

# 실행 (thread_id로 대화 세션 관리)
config = {"configurable": {"thread_id": "session_001"}}
result = app.invoke(
    {"messages": [HumanMessage(content="2026년 AI 에이전트 트렌드를 검색하고 요약해줘")], "iteration_count": 0},
    config=config
)

7. 멀티에이전트 시스템

CrewAI: 역할 기반 멀티에이전트

from crewai import Agent, Task, Crew, Process
from crewai_tools import SerperDevTool, FileWriterTool

# 도구 설정
search_tool = SerperDevTool()
file_writer = FileWriterTool()

# 에이전트 정의
researcher = Agent(
    role="AI 리서처",
    goal="최신 AI 에이전트 기술 트렌드를 심층 조사한다",
    backstory="""당신은 AI 분야 전문 리서처입니다.
    최신 논문, 블로그, GitHub를 분석하여 핵심 인사이트를 추출합니다.""",
    tools=[search_tool],
    llm="claude-opus-4-5",
    verbose=True
)

writer = Agent(
    role="기술 작가",
    goal="리서치 결과를 읽기 쉬운 기술 보고서로 작성한다",
    backstory="""당신은 복잡한 AI 개념을 명확하게 설명하는 전문 작가입니다.""",
    tools=[file_writer],
    llm="claude-opus-4-5",
    verbose=True
)

# 태스크 정의
research_task = Task(
    description="2026년 LLM 에이전트 트렌드 Top 5를 조사하세요. 각 트렌드마다 구체적인 사례와 영향을 포함하세요.",
    expected_output="5개 트렌드의 상세 분석 (각 500자 이상)",
    agent=researcher
)

writing_task = Task(
    description="리서치 결과를 바탕으로 기술 블로그 포스트를 작성하세요.",
    expected_output="마크다운 형식의 2000자 기술 블로그 포스트",
    agent=writer,
    output_file="ai_trends_2026.md"
)

# Crew 실행 (순차 프로세스)
crew = Crew(
    agents=[researcher, writer],
    tasks=[research_task, writing_task],
    process=Process.sequential,
    verbose=True
)

result = crew.kickoff()

AutoGen: 대화 기반 멀티에이전트

AutoGen은 Microsoft가 만든 멀티에이전트 프레임워크로, 에이전트 간 대화를 통한 협업이 특징입니다.

import autogen

config_list = [{"model": "claude-opus-4-5", "api_key": "YOUR_KEY"}]

# 오케스트레이터 에이전트
orchestrator = autogen.AssistantAgent(
    name="Orchestrator",
    system_message="""당신은 팀을 조율하는 오케스트레이터입니다.
    작업을 분석하고 적절한 전문가 에이전트에게 위임합니다.
    모든 결과를 통합하여 최종 답변을 생성합니다.""",
    llm_config={"config_list": config_list}
)

# 코드 실행 에이전트
coder = autogen.AssistantAgent(
    name="Coder",
    system_message="당신은 Python 코드를 작성하고 실행하는 전문가입니다.",
    llm_config={"config_list": config_list, "functions": [...]}
)

# 사용자 프록시 (코드 실행 담당)
user_proxy = autogen.UserProxyAgent(
    name="UserProxy",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=10,
    code_execution_config={"work_dir": "coding", "use_docker": False}
)

# 그룹 채팅 실행
groupchat = autogen.GroupChat(
    agents=[orchestrator, coder, user_proxy],
    messages=[],
    max_round=12
)
manager = autogen.GroupChatManager(groupchat=groupchat)
user_proxy.initiate_chat(manager, message="데이터 시각화 코드를 작성해줘")

8. Claude API Tool Use 구현

import anthropic
import json

client = anthropic.Anthropic()

# 도구 정의
tools = [
    {
        "name": "get_stock_price",
        "description": "특정 종목의 현재 주가와 변동률을 조회합니다",
        "input_schema": {
            "type": "object",
            "properties": {
                "symbol": {
                    "type": "string",
                    "description": "주식 종목 코드 (예: AAPL, MSFT)"
                },
                "currency": {
                    "type": "string",
                    "enum": ["USD", "KRW"],
                    "description": "표시 통화"
                }
            },
            "required": ["symbol"]
        }
    }
]

def process_tool_call(tool_name: str, tool_input: dict) -> str:
    """도구 실행 로직"""
    if tool_name == "get_stock_price":
        # 실제 API 호출 (예시)
        return json.dumps({
            "symbol": tool_input["symbol"],
            "price": 185.92,
            "change_percent": "+2.3%",
            "currency": tool_input.get("currency", "USD")
        })

# 에이전트 루프
messages = [{"role": "user", "content": "Apple 주가를 알려줘"}]

while True:
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=4096,
        tools=tools,
        messages=messages
    )

    messages.append({"role": "assistant", "content": response.content})

    if response.stop_reason == "end_turn":
        # 최종 텍스트 응답 추출
        final_text = next(
            block.text for block in response.content
            if hasattr(block, "text")
        )
        print(f"최종 답변: {final_text}")
        break

    if response.stop_reason == "tool_use":
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = process_tool_call(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result
                })

        messages.append({"role": "user", "content": tool_results})

9. 에이전트 평가

주요 벤치마크

벤치마크	측정 영역	특징
AgentBench	8개 환경(OS, DB, 게임 등)	실제 환경 기반 평가
GAIA	일반 AI 보조 능력	인간 수준 비교
SWE-bench	소프트웨어 엔지니어링	실제 GitHub 이슈 해결
WebArena	웹 탐색 능력	실제 웹사이트 조작
OSWorld	컴퓨터 사용 능력	GUI 상호작용

Trajectory Evaluation vs Outcome Evaluation

에이전트 평가에는 두 가지 핵심 관점이 있습니다:

Outcome Evaluation (결과 평가):

최종 목표 달성 여부만 측정
Pass@k, Success Rate
단순하지만 과정을 무시

Trajectory Evaluation (경로 평가):

목표 달성 과정(행동 시퀀스)을 평가
효율성, 안전성, 부작용 없음을 함께 측정
프로덕션 환경에서 필수

from dataclasses import dataclass
from typing import List, Optional

@dataclass
class AgentTrajectory:
    task: str
    steps: List[dict]  # {"thought": ..., "action": ..., "observation": ...}
    final_answer: str
    success: bool
    total_tokens: int

def evaluate_trajectory(trajectory: AgentTrajectory) -> dict:
    """경로 기반 에이전트 평가"""
    metrics = {
        "task_success": trajectory.success,
        "efficiency": calculate_efficiency(trajectory.steps),
        "redundant_steps": count_redundant_steps(trajectory.steps),
        "error_recovery": check_error_recovery(trajectory.steps),
        "tool_usage_appropriateness": evaluate_tool_usage(trajectory.steps),
        "cost_efficiency": 1000 / trajectory.total_tokens  # 토큰당 효율
    }
    return metrics

def count_redundant_steps(steps: List[dict]) -> int:
    """불필요한 중복 도구 호출 수"""
    seen_actions = set()
    redundant = 0
    for step in steps:
        action_key = f"{step.get('action_type')}:{step.get('action_input')}"
        if action_key in seen_actions:
            redundant += 1
        seen_actions.add(action_key)
    return redundant

주요 에이전트 실패 모드

무한 루프: 목표 달성 조건을 잘못 설정해 반복
도구 환각: 존재하지 않는 도구나 파라미터를 호출
컨텍스트 드리프트: 긴 세션에서 초기 목표를 잊음
과도한 계획: 단순한 작업에 불필요한 계획 수립
도구 남용: 필요 없는 도구를 계속 호출

10. 2026년 트렌드: Computer-use & Coding Agents

Computer-use Agents

Claude의 Computer Use API와 GPT-4o의 컴퓨터 제어 기능은 에이전트가 실제 컴퓨터 화면을 보고 조작할 수 있게 합니다.

import anthropic
import base64
from PIL import ImageGrab

def take_screenshot() -> str:
    """화면 캡처 후 base64 인코딩"""
    screenshot = ImageGrab.grab()
    screenshot.save("/tmp/screenshot.png")
    with open("/tmp/screenshot.png", "rb") as f:
        return base64.b64encode(f.read()).decode()

client = anthropic.Anthropic()

# Computer-use 에이전트
response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=4096,
    tools=[
        {"type": "computer_20241022", "name": "computer", "display_width_px": 1920, "display_height_px": 1080},
        {"type": "text_editor_20241022", "name": "str_replace_editor"},
        {"type": "bash_20241022", "name": "bash"}
    ],
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "브라우저를 열고 GitHub 최신 트렌딩 저장소를 확인해줘"},
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": take_screenshot()}}
        ]
    }]
)

Coding Agents: Devin과 SWE-agent

SWE-bench 기준으로 2026년 최신 코딩 에이전트 성능:

에이전트	SWE-bench Verified	특징
Claude Code	~72%	터미널 통합, 코드베이스 이해
Devin 2.0	~65%	전체 개발 워크플로우
SWE-agent	~58%	오픈소스, 연구용
Aider	~55%	로컬 코드베이스 특화

11. OpenAI Assistants API

from openai import OpenAI
import time

client = OpenAI()

# Assistants 생성 (도구 + 지식베이스 포함)
assistant = client.beta.assistants.create(
    name="AI 기술 분석가",
    instructions="당신은 AI/ML 기술 전문가입니다. 최신 논문과 기술 문서를 분석하여 인사이트를 제공합니다.",
    model="gpt-4o",
    tools=[
        {"type": "file_search"},   # 파일 기반 RAG
        {"type": "code_interpreter"}  # 코드 실행
    ]
)

# 파일 업로드 및 벡터 저장소 생성
vector_store = client.beta.vector_stores.create(name="AI 논문 저장소")
with open("ai_papers_2026.pdf", "rb") as f:
    client.beta.vector_stores.file_batches.upload_and_poll(
        vector_store_id=vector_store.id,
        files=[("ai_papers_2026.pdf", f)]
    )

# Thread 생성 및 메시지 추가
thread = client.beta.threads.create()
client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content="업로드된 논문에서 Agentic AI의 주요 한계점을 분석해줘"
)

# Run 실행 및 결과 대기
run = client.beta.threads.runs.create(
    thread_id=thread.id,
    assistant_id=assistant.id
)

while run.status in ["queued", "in_progress"]:
    run = client.beta.threads.runs.retrieve(thread_id=thread.id, run_id=run.id)
    time.sleep(1)

# 결과 출력
messages = client.beta.threads.messages.list(thread_id=thread.id)
print(messages.data[0].content[0].text.value)

퀴즈: 핵심 개념 확인

Q1. ReAct 프레임워크에서 Thought-Action-Observation 사이클이 환각을 줄이는 원리는?

정답: 실시간 외부 근거를 통한 단계적 검증

설명: 일반 LLM은 한 번에 전체 답변을 생성하므로 중간 과정에서 사실을 "지어낼" 수 있습니다. ReAct는 각 추론 단계마다 실제 도구(검색, 계산 등)를 호출하고 Observation으로 근거를 확인합니다. 이 실제 결과가 "사실 앵커" 역할을 하여 이후 추론이 근거 없이 이탈하는 것을 방지합니다. 또한 중간 단계를 명시적으로 기록하므로 오류가 발생한 지점을 쉽게 파악하고 수정할 수 있습니다.

Q2. LangGraph에서 사이클(cycle)이 있는 그래프가 DAG 기반 LangChain과 다른 점은?

정답: 상태 기반 반복 실행과 동적 라우팅 가능

설명: LangChain의 LCEL은 방향성 비순환 그래프(DAG)로, 한 번 실행되면 되돌아올 수 없습니다. LangGraph는 사이클을 지원하여 "도구 호출 → 결과 확인 → 재시도" 같은 에이전트 루프를 자연스럽게 표현합니다. 조건부 엣지(conditional edge)로 현재 상태에 따라 다음 노드를 동적으로 결정하고, 체크포인트(checkpointer)로 상태를 영구 저장하여 세션 간 메모리를 유지합니다. 이는 인간의 "시도-오류-수정" 사고 과정을 코드로 표현한 것입니다.

Q3. MCP(Model Context Protocol)가 기존 Function Calling보다 유연한 이유는?

정답: 표준화된 서버-클라이언트 아키텍처로 LLM 독립적 도구 생태계 형성

설명: 기존 Function Calling은 OpenAI, Anthropic, Google 각자 다른 형식을 사용하므로 특정 LLM에 종속됩니다. MCP는 stdio나 HTTP 기반 표준 프로토콜을 정의하여, 한 번 만든 MCP 서버를 모든 MCP 지원 클라이언트(Claude, Cursor, VS Code 등)에서 재사용할 수 있습니다. 또한 Tools(실행 가능한 함수) 외에 Resources(파일, DB 등 데이터)와 Prompts(재사용 가능한 프롬프트 템플릿) 추상화를 제공하여 더 풍부한 컨텍스트를 에이전트에 제공합니다.

Q4. 멀티에이전트 시스템에서 오케스트레이터와 실행 에이전트를 분리하는 이점은?

정답: 관심사 분리, 전문화, 병렬 처리, 오류 격리

설명: 오케스트레이터는 전체 계획 수립과 조율에만 집중하고, 실행 에이전트는 특정 도메인(코딩, 검색, 글쓰기 등)에 특화됩니다. 이점은 다음과 같습니다: (1) 각 에이전트를 독립적으로 최적화할 수 있음, (2) 여러 실행 에이전트가 병렬로 작업 가능하여 속도 향상, (3) 한 에이전트의 실패가 전체 시스템을 멈추지 않음(오류 격리), (4) 새로운 전문 에이전트를 쉽게 추가 가능(확장성), (5) 각 에이전트의 행동을 독립적으로 감사/로깅 가능.

Q5. 에이전트 평가에서 trajectory evaluation과 outcome evaluation의 차이는?

정답: Outcome은 최종 성공 여부, Trajectory는 과정의 효율성과 안전성까지 평가

설명: Outcome Evaluation은 목표 달성 여부만 측정합니다(0 또는 1). 간단하지만 나쁜 과정으로 올바른 결과에 도달하거나, 부작용이 있어도 통과됩니다. Trajectory Evaluation은 전체 행동 시퀀스를 분석합니다: 불필요한 단계가 없는지(효율성), 안전하지 않은 행동은 없는지(안전성), 오류를 적절히 복구했는지, 토큰/API 비용이 합리적인지 등을 종합 평가합니다. 프로덕션 에이전트는 "목표를 달성했더라도 과도한 비용이나 부작용이 있으면 실패"로 판단해야 하므로 Trajectory Evaluation이 필수적입니다.

마치며

LLM 에이전트는 이제 연구 단계를 넘어 실제 프로덕션에서 가치를 만들고 있습니다. 2026년 핵심 트렌드:

Computer-use 에이전트: 화면을 보고 직접 조작하는 범용 에이전트
장기 메모리 표준화: mem0, Zep 같은 메모리 레이어의 보편화
MCP 생태계 확장: 수천 개의 MCP 서버와 도구
에이전트 안전성: 에이전트 행동 감사, 권한 제한, 인간 감독
멀티모달 에이전트: 텍스트, 이미지, 오디오를 통합 처리

다음 단계로 LangGraph를 활용한 프로덕션 에이전트 구축이나 MCP 서버 개발에 도전해보세요!

LLM Agents & Agentic AI: The Complete Guide — ReAct, Multi-Agent, MCP and Beyond

Introduction
1. What Is an Agent?
- Traditional LLM vs. Agent
2. ReAct Framework: Reasoning + Acting
3. Chain-of-Thought & Tree-of-Thought
- Chain-of-Thought (CoT)
- Tree-of-Thought (ToT)
4. Memory Systems
5. Tool Integration
6. LangGraph: Stateful Agents
- Stateful Agent Implementation
7. Multi-Agent Systems
- CrewAI: Role-Based Multi-Agent
- AutoGen: Conversation-Based Multi-Agent
8. Claude API Tool Use Implementation
9. Agent Evaluation
10. 2026 Trends: Computer-use & Coding Agents
- Computer-use Agents
- Coding Agents: Devin and SWE-agent
11. OpenAI Assistants API
Quiz: Core Concept Check
Conclusion

Introduction

From 2024 through 2026, the AI paradigm has completely shifted from simple "question-answer" chatbots to autonomously acting agents. LLM agents receive a goal, independently create plans, call tools, review results, and achieve objectives.

Devin autonomously resolves GitHub issues. Claude clicks through computer screens to complete tasks. The era of truly agentic AI has arrived.

This guide covers everything from core LLM agent concepts to production-ready implementations.

1. What Is an Agent?

Traditional LLM vs. Agent

A traditional LLM follows a simple input → output pattern. An agent, by contrast:

Perceives: Gathers information from the environment (tool results, user input, memory)
Plans: Decides on a sequence of actions to achieve a goal
Acts: Calls tools, makes API requests, executes code
Reflects: Evaluates results and adjusts the next action

Core components of an agent:

Component	Description
LLM Core	Reasoning and decision-making engine
Tools	Web search, code execution, APIs, etc.
Memory	Short-term and long-term context management
Orchestrator	Agent loop control

2. ReAct Framework: Reasoning + Acting

What Is ReAct?

ReAct (Reasoning + Acting) is a framework proposed by Yao et al. in 2022. The LLM solves problems by repeating a Thought → Action → Observation cycle.

Thought:     Analyze the current situation and decide on the next action
Action:      Call a tool in the form tool_name(arguments)
Observation: Receive the tool execution result
...repeat...
Final Answer: Derive the final answer

Why Does ReAct Reduce Hallucination?

A standard LLM generates the entire answer in one pass, which can lead to "fabricating" facts midway. ReAct addresses this by:

Real-time grounding: Each Observation acts as a factual anchor
Incremental verification: Intermediate results are checked, allowing early error correction
External knowledge access: Actual tools are used for searching and calculation during reasoning

Python Implementation

from langchain.agents import create_react_agent
from langchain_anthropic import ChatAnthropic
from langchain.tools import DuckDuckGoSearchRun, PythonREPLTool
from langchain import hub

llm = ChatAnthropic(model="claude-opus-4-5", temperature=0)
tools = [DuckDuckGoSearchRun(), PythonREPLTool()]

# Load the ReAct prompt template
prompt = hub.pull("hwchase17/react")

agent = create_react_agent(llm, tools, prompt)

from langchain.agents import AgentExecutor
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

result = agent_executor.invoke({
    "input": "Search for the top 3 AI agent frameworks in 2026 and create a comparison table"
})

3. Chain-of-Thought & Tree-of-Thought

Chain-of-Thought (CoT)

CoT dramatically improves LLM reasoning ability simply by prompting it to "think step by step."

cot_prompt = """
Solve the problem step by step:

Problem: {problem}

Solution process:
1. First, organize the given information.
2. Perform the necessary calculations or reasoning.
3. Verify intermediate results.
4. Derive the final answer.
"""

Tree-of-Thought (ToT)

ToT extends CoT by exploring multiple reasoning paths in a tree structure, using BFS or DFS to select the most promising path.

from langchain_experimental.tot.base import ToTChain
from langchain_experimental.tot.thought_generation import ProposePromptStrategy

tot_chain = ToTChain.from_llm(
    llm=llm,
    checker=checker,
    k=3,    # Number of branches to generate at each level
    c=4,    # Evaluation depth
    verbose=True
)

4. Memory Systems

Four Types of Agent Memory

Agent memory is designed similarly to the human memory system:

Memory Type	Storage Location	Characteristics
Sensory Memory	Input context	Processes current input
Short-term Memory	Context window	Current conversation session
Long-term Memory	Vector DB / KV store	Permanent knowledge storage
Episodic Memory	Vector DB	Indexes past experiences

mem0: Long-term Memory Integration

mem0 is an open-source library that adds personalized long-term memory to agents.

from mem0 import Memory

# Initialize mem0 (using Qdrant as vector DB)
config = {
    "vector_store": {
        "provider": "qdrant",
        "config": {
            "collection_name": "agent_memory",
            "host": "localhost",
            "port": 6333,
        }
    },
    "llm": {
        "provider": "anthropic",
        "config": {
            "model": "claude-opus-4-5",
            "temperature": 0,
        }
    }
}

memory = Memory.from_config(config)
user_id = "user_123"

# Store memory
memory.add(
    messages=[
        {"role": "user", "content": "I mainly use Python and prefer FastAPI"},
        {"role": "assistant", "content": "Got it! I'll remember your Python/FastAPI preference."}
    ],
    user_id=user_id
)

# Search and use memory
relevant_memories = memory.search(
    query="What programming language does the user prefer?",
    user_id=user_id
)

context = "\n".join([m["memory"] for m in relevant_memories])
print(f"Relevant memories: {context}")

Vector Store-Based Episodic Memory

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from datetime import datetime

class EpisodicMemory:
    def __init__(self):
        self.embeddings = OpenAIEmbeddings()
        self.store = Chroma(
            collection_name="episodes",
            embedding_function=self.embeddings,
            persist_directory="./episodic_memory"
        )

    def store_episode(self, content: str, metadata: dict = None):
        """Store a conversation or task episode in memory"""
        metadata = metadata or {}
        metadata["timestamp"] = datetime.now().isoformat()
        self.store.add_texts([content], metadatas=[metadata])

    def recall(self, query: str, k: int = 3):
        """Retrieve relevant episodes"""
        docs = self.store.similarity_search(query, k=k)
        return [doc.page_content for doc in docs]

# Usage example
memory = EpisodicMemory()
memory.store_episode(
    "User asked about FastAPI project structure. Successfully answered.",
    {"task_type": "coding", "success": True}
)

5. Tool Integration

Standard Tool Categories

Key tools used by agents:

Web Search: Tavily, SerpAPI, DuckDuckGo
Code Execution: Python REPL, Jupyter Kernel
File System: Read, write, and search files
API Calls: REST, GraphQL
Databases: SQL, vector DB queries
Computer Control: Screen capture, clicks, keyboard input

Custom Web Search Tool Implementation

from langchain.tools import BaseTool
from pydantic import BaseModel, Field
import httpx
from typing import Optional

class WebSearchInput(BaseModel):
    query: str = Field(description="The query to search for")
    max_results: int = Field(default=5, description="Number of results to return")

class TavilySearchTool(BaseTool):
    name: str = "web_search"
    description: str = "Searches the web for up-to-date information. Use when real-time information is needed."
    args_schema: type[BaseModel] = WebSearchInput
    api_key: str = ""

    def _run(self, query: str, max_results: int = 5) -> str:
        url = "https://api.tavily.com/search"
        payload = {
            "api_key": self.api_key,
            "query": query,
            "max_results": max_results,
            "include_answer": True,
        }
        response = httpx.post(url, json=payload)
        data = response.json()

        results = []
        if data.get("answer"):
            results.append(f"Summary: {data['answer']}\n")
        for r in data.get("results", []):
            results.append(f"- {r['title']}: {r['content'][:200]}...")
        return "\n".join(results)

    async def _arun(self, query: str, max_results: int = 5) -> str:
        async with httpx.AsyncClient() as client:
            response = await client.post(
                "https://api.tavily.com/search",
                json={"api_key": self.api_key, "query": query, "max_results": max_results}
            )
        return self._process_response(response.json())

MCP (Model Context Protocol)

MCP is a standardized tool integration protocol announced by Anthropic in late 2024. While traditional Function Calling required different formats for each LLM, MCP standardizes tools through a server-client model.

Key MCP advantages:

Reusability: An MCP server built once works with any LLM
Rich context: Provides three abstractions — Resources, Prompts, and Tools
Dynamic discovery: Tool lists are dynamically queried at runtime

# MCP server implementation (Python SDK)
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp import types

app = Server("my-tool-server")

@app.list_tools()
async def list_tools() -> list[types.Tool]:
    return [
        types.Tool(
            name="get_weather",
            description="Retrieves the current weather for a specific city",
            inputSchema={
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                    "units": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["city"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[types.TextContent]:
    if name == "get_weather":
        city = arguments["city"]
        weather_data = await fetch_weather(city)
        return [types.TextContent(type="text", text=str(weather_data))]

async def main():
    async with stdio_server() as (read_stream, write_stream):
        await app.run(read_stream, write_stream, app.create_initialization_options())

6. LangGraph: Stateful Agents

LangGraph is a graph-based agent orchestration framework built by the LangChain team. Unlike the DAG-based LangChain Expression Language (LCEL), LangGraph supports cycles, naturally expressing agent loops.

Stateful Agent Implementation

from langgraph.graph import StateGraph, END
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage
from typing import TypedDict, Annotated, Sequence
import operator

# 1. Define state
class AgentState(TypedDict):
    messages: Annotated[Sequence, operator.add]
    tool_calls: list
    iteration_count: int

# 2. Set up LLM and tools
llm = ChatAnthropic(model="claude-opus-4-5")
tools = [WebSearchTool(), PythonREPLTool()]
llm_with_tools = llm.bind_tools(tools)

# 3. Define nodes
def call_model(state: AgentState) -> AgentState:
    """LLM call node"""
    response = llm_with_tools.invoke(state["messages"])
    return {
        "messages": [response],
        "iteration_count": state["iteration_count"] + 1
    }

def call_tools(state: AgentState) -> AgentState:
    """Tool execution node"""
    last_message = state["messages"][-1]
    tool_results = []

    for tool_call in last_message.tool_calls:
        tool = next(t for t in tools if t.name == tool_call["name"])
        result = tool.invoke(tool_call["args"])
        tool_results.append(
            ToolMessage(content=str(result), tool_call_id=tool_call["id"])
        )
    return {"messages": tool_results}

# 4. Routing function (conditional edges)
def should_continue(state: AgentState) -> str:
    last_message = state["messages"][-1]
    if hasattr(last_message, "tool_calls") and last_message.tool_calls:
        if state["iteration_count"] < 10:  # Prevent infinite loops
            return "tools"
    return "end"

# 5. Build the graph
graph = StateGraph(AgentState)
graph.add_node("agent", call_model)
graph.add_node("tools", call_tools)

graph.set_entry_point("agent")
graph.add_conditional_edges(
    "agent",
    should_continue,
    {"tools": "tools", "end": END}
)
graph.add_edge("tools", "agent")  # Return to agent after tool execution

# 6. Add memory checkpointing
from langgraph.checkpoint.memory import MemorySaver
checkpointer = MemorySaver()
app = graph.compile(checkpointer=checkpointer)

# Run (manage conversation sessions with thread_id)
config = {"configurable": {"thread_id": "session_001"}}
result = app.invoke(
    {"messages": [HumanMessage(content="Search for AI agent trends in 2026 and summarize them")], "iteration_count": 0},
    config=config
)

7. Multi-Agent Systems

CrewAI: Role-Based Multi-Agent

from crewai import Agent, Task, Crew, Process
from crewai_tools import SerperDevTool, FileWriterTool

search_tool = SerperDevTool()
file_writer = FileWriterTool()

researcher = Agent(
    role="AI Researcher",
    goal="Conduct in-depth research on the latest AI agent technology trends",
    backstory="""You are an expert researcher in the AI field.
    You analyze the latest papers, blogs, and GitHub repositories to extract key insights.""",
    tools=[search_tool],
    llm="claude-opus-4-5",
    verbose=True
)

writer = Agent(
    role="Technical Writer",
    goal="Write readable technical reports from research results",
    backstory="""You are a professional writer who explains complex AI concepts clearly.""",
    tools=[file_writer],
    llm="claude-opus-4-5",
    verbose=True
)

research_task = Task(
    description="Research the Top 5 LLM agent trends of 2026. Include specific examples and impacts for each trend.",
    expected_output="Detailed analysis of 5 trends (500+ words each)",
    agent=researcher
)

writing_task = Task(
    description="Write a technical blog post based on the research results.",
    expected_output="A 2000-word technical blog post in Markdown format",
    agent=writer,
    output_file="ai_trends_2026.md"
)

crew = Crew(
    agents=[researcher, writer],
    tasks=[research_task, writing_task],
    process=Process.sequential,
    verbose=True
)

result = crew.kickoff()

AutoGen: Conversation-Based Multi-Agent

AutoGen is Microsoft's multi-agent framework featuring collaboration through conversation between agents.

import autogen

config_list = [{"model": "claude-opus-4-5", "api_key": "YOUR_KEY"}]

orchestrator = autogen.AssistantAgent(
    name="Orchestrator",
    system_message="""You are an orchestrator who coordinates the team.
    You analyze tasks and delegate them to the appropriate specialist agents.
    You integrate all results to generate the final answer.""",
    llm_config={"config_list": config_list}
)

coder = autogen.AssistantAgent(
    name="Coder",
    system_message="You are an expert at writing and executing Python code.",
    llm_config={"config_list": config_list}
)

user_proxy = autogen.UserProxyAgent(
    name="UserProxy",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=10,
    code_execution_config={"work_dir": "coding", "use_docker": False}
)

groupchat = autogen.GroupChat(
    agents=[orchestrator, coder, user_proxy],
    messages=[],
    max_round=12
)
manager = autogen.GroupChatManager(groupchat=groupchat)
user_proxy.initiate_chat(manager, message="Write a data visualization script")

8. Claude API Tool Use Implementation

import anthropic
import json

client = anthropic.Anthropic()

tools = [
    {
        "name": "get_stock_price",
        "description": "Retrieves the current stock price and change percentage for a specific ticker",
        "input_schema": {
            "type": "object",
            "properties": {
                "symbol": {
                    "type": "string",
                    "description": "Stock ticker symbol (e.g., AAPL, MSFT)"
                },
                "currency": {
                    "type": "string",
                    "enum": ["USD", "KRW"],
                    "description": "Display currency"
                }
            },
            "required": ["symbol"]
        }
    }
]

def process_tool_call(tool_name: str, tool_input: dict) -> str:
    if tool_name == "get_stock_price":
        return json.dumps({
            "symbol": tool_input["symbol"],
            "price": 185.92,
            "change_percent": "+2.3%",
            "currency": tool_input.get("currency", "USD")
        })

messages = [{"role": "user", "content": "What's the current Apple stock price?"}]

while True:
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=4096,
        tools=tools,
        messages=messages
    )

    messages.append({"role": "assistant", "content": response.content})

    if response.stop_reason == "end_turn":
        final_text = next(
            block.text for block in response.content
            if hasattr(block, "text")
        )
        print(f"Final answer: {final_text}")
        break

    if response.stop_reason == "tool_use":
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = process_tool_call(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result
                })
        messages.append({"role": "user", "content": tool_results})

9. Agent Evaluation

Key Benchmarks

Benchmark	Measurement Area	Characteristics
AgentBench	8 environments (OS, DB, games, etc.)	Real-environment-based evaluation
GAIA	General AI assistance capabilities	Comparison with human-level performance
SWE-bench	Software engineering	Solves real GitHub issues
WebArena	Web navigation ability	Manipulates real websites
OSWorld	Computer usage ability	GUI interaction

Trajectory Evaluation vs. Outcome Evaluation

There are two key perspectives in agent evaluation:

Outcome Evaluation:

Measures only whether the final goal was achieved
Pass@k, Success Rate
Simple but ignores the process

Trajectory Evaluation:

Evaluates the entire action sequence toward the goal
Jointly measures efficiency, safety, and absence of side effects
Essential in production environments

from dataclasses import dataclass
from typing import List

@dataclass
class AgentTrajectory:
    task: str
    steps: List[dict]  # {"thought": ..., "action": ..., "observation": ...}
    final_answer: str
    success: bool
    total_tokens: int

def evaluate_trajectory(trajectory: AgentTrajectory) -> dict:
    """Trajectory-based agent evaluation"""
    metrics = {
        "task_success": trajectory.success,
        "efficiency": calculate_efficiency(trajectory.steps),
        "redundant_steps": count_redundant_steps(trajectory.steps),
        "error_recovery": check_error_recovery(trajectory.steps),
        "tool_usage_appropriateness": evaluate_tool_usage(trajectory.steps),
        "cost_efficiency": 1000 / trajectory.total_tokens
    }
    return metrics

def count_redundant_steps(steps: List[dict]) -> int:
    """Count unnecessary duplicate tool calls"""
    seen_actions = set()
    redundant = 0
    for step in steps:
        action_key = f"{step.get('action_type')}:{step.get('action_input')}"
        if action_key in seen_actions:
            redundant += 1
        seen_actions.add(action_key)
    return redundant

Common Agent Failure Modes

Infinite loops: Incorrect goal-completion conditions causing repetition
Tool hallucination: Calling non-existent tools or parameters
Context drift: Forgetting the original goal in long sessions
Over-planning: Unnecessary planning for simple tasks
Tool overuse: Continuously calling tools that are not needed

10. 2026 Trends: Computer-use & Coding Agents

Computer-use Agents

Claude's Computer Use API and GPT-4o's computer control capabilities let agents see and manipulate actual computer screens.

import anthropic
import base64
from PIL import ImageGrab

def take_screenshot() -> str:
    """Capture screen and encode as base64"""
    screenshot = ImageGrab.grab()
    screenshot.save("/tmp/screenshot.png")
    with open("/tmp/screenshot.png", "rb") as f:
        return base64.b64encode(f.read()).decode()

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=4096,
    tools=[
        {"type": "computer_20241022", "name": "computer", "display_width_px": 1920, "display_height_px": 1080},
        {"type": "text_editor_20241022", "name": "str_replace_editor"},
        {"type": "bash_20241022", "name": "bash"}
    ],
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Open a browser and check the latest trending GitHub repositories"},
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": take_screenshot()}}
        ]
    }]
)

Coding Agents: Devin and SWE-agent

Top coding agent performance on SWE-bench in 2026:

Agent	SWE-bench Verified	Characteristics
Claude Code	~72%	Terminal integration, codebase understanding
Devin 2.0	~65%	Full development workflow
SWE-agent	~58%	Open-source, research use
Aider	~55%	Specialized for local codebase

11. OpenAI Assistants API

from openai import OpenAI
import time

client = OpenAI()

assistant = client.beta.assistants.create(
    name="AI Technology Analyst",
    instructions="You are an AI/ML technology expert. Analyze the latest papers and technical documents to provide insights.",
    model="gpt-4o",
    tools=[
        {"type": "file_search"},     # File-based RAG
        {"type": "code_interpreter"}  # Code execution
    ]
)

# Upload file and create vector store
vector_store = client.beta.vector_stores.create(name="AI Papers Repository")
with open("ai_papers_2026.pdf", "rb") as f:
    client.beta.vector_stores.file_batches.upload_and_poll(
        vector_store_id=vector_store.id,
        files=[("ai_papers_2026.pdf", f)]
    )

thread = client.beta.threads.create()
client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content="Analyze the main limitations of Agentic AI from the uploaded papers"
)

run = client.beta.threads.runs.create(
    thread_id=thread.id,
    assistant_id=assistant.id
)

while run.status in ["queued", "in_progress"]:
    run = client.beta.threads.runs.retrieve(thread_id=thread.id, run_id=run.id)
    time.sleep(1)

messages = client.beta.threads.messages.list(thread_id=thread.id)
print(messages.data[0].content[0].text.value)

Quiz: Core Concept Check

Q1. How does the Thought-Action-Observation cycle in the ReAct framework reduce hallucination?

Answer: Step-by-step verification through real-time external grounding

Explanation: A standard LLM generates the entire answer in one pass, leaving room to "fabricate" facts midway through. ReAct calls actual tools (search, calculation, etc.) at each reasoning step and verifies the facts through Observations. These real results act as "factual anchors," preventing subsequent reasoning from drifting without a basis. Because intermediate steps are explicitly recorded, errors can be identified and corrected at the exact point they occurred.

Q2. How does a graph with cycles in LangGraph differ from DAG-based LangChain?

Answer: Enables state-based iterative execution and dynamic routing

Explanation: LangChain's LCEL is a Directed Acyclic Graph (DAG) — once executed, it cannot loop back. LangGraph supports cycles, naturally expressing agent loops like "call tool → check result → retry." Conditional edges dynamically determine the next node based on current state, and checkpointers persist state across sessions for cross-session memory. This mirrors the human "try-error-correct" thought process in code.

Q3. Why is MCP (Model Context Protocol) more flexible than traditional Function Calling?

Answer: A standardized server-client architecture forms an LLM-independent tool ecosystem

Explanation: Traditional Function Calling uses different formats for OpenAI, Anthropic, and Google, making tools tied to specific LLMs. MCP defines a standard protocol over stdio or HTTP, so an MCP server built once can be reused in any MCP-supporting client (Claude, Cursor, VS Code, etc.). Beyond Tools (executable functions), MCP also provides Resources (files, databases, and other data) and Prompts (reusable prompt templates), offering agents much richer context.

Q4. What are the benefits of separating the orchestrator agent from executor agents in a multi-agent system?

Answer: Separation of concerns, specialization, parallel processing, and error isolation

Explanation: The orchestrator focuses exclusively on high-level planning and coordination, while executor agents specialize in specific domains (coding, search, writing, etc.). The benefits are: (1) each agent can be independently optimized; (2) multiple executor agents can work in parallel for higher throughput; (3) the failure of one agent does not halt the entire system (error isolation); (4) new specialist agents can be added easily (scalability); (5) each agent's actions can be independently audited and logged.

Q5. What is the difference between trajectory evaluation and outcome evaluation in agent assessment?

Answer: Outcome measures final success only; trajectory also evaluates process efficiency and safety

Explanation: Outcome Evaluation measures only whether the goal was achieved (0 or 1). It is simple, but allows passing even when the correct result is reached via a bad process or with side effects. Trajectory Evaluation analyzes the entire action sequence: whether there were unnecessary steps (efficiency), whether unsafe actions were taken (safety), whether errors were recovered appropriately, and whether token and API costs were reasonable. Production agents must treat "achieving the goal at excessive cost or with side effects" as a failure, making Trajectory Evaluation essential.

Conclusion

LLM agents have moved beyond the research stage and are now creating real value in production. Key trends for 2026:

Computer-use agents: General-purpose agents that see and manipulate screens directly
Long-term memory standardization: Widespread adoption of memory layers like mem0 and Zep
MCP ecosystem expansion: Thousands of MCP servers and tools
Agent safety: Agent action auditing, permission restrictions, human oversight
Multimodal agents: Integrated processing of text, images, and audio

As a next step, challenge yourself to build a production agent with LangGraph or develop an MCP server!