Split View: LLM 프롬프트 엔지니어링 고급 기법: Chain-of-Thought·Tree-of-Thought·ReAct·Few-Shot 패턴 실전 가이드
LLM 프롬프트 엔지니어링 고급 기법: Chain-of-Thought·Tree-of-Thought·ReAct·Few-Shot 패턴 실전 가이드
- 들어가며
- 프롬프팅 기법 분류 체계
- Zero-shot과 Few-shot 프롬프팅
- Chain-of-Thought (CoT) 프롬프팅
- Self-Consistency 디코딩
- Tree-of-Thought (ToT) 프레임워크
- ReAct: 추론과 행동의 결합
- 구조화된 출력 프롬프팅
- 프롬프트 체이닝
- 프롬프팅 기법 성능 비교
- 일반적인 안티패턴
- 프로덕션 최적화
- 운영 시 주의사항
- 마치며
- 참고자료

들어가며
프롬프트 엔지니어링은 LLM의 잠재 능력을 최대한 끌어내는 핵심 기술이다. 2022년 Wei 등이 발표한 Chain-of-Thought 논문은 "프롬프트에 추론 과정을 포함하면 모델의 추론 능력이 비약적으로 향상된다"는 것을 증명하며, 프롬프트 엔지니어링을 하나의 독립된 연구 분야로 확립시켰다.
이후 Self-Consistency, Tree-of-Thought, ReAct 등의 고급 기법이 연이어 등장하며, 단순한 질문-답변 패턴을 넘어 복잡한 추론, 계획, 외부 도구 활용까지 LLM의 활용 범위를 크게 넓혔다. 특히 ReAct 패턴은 현재 대부분의 AI 에이전트 프레임워크(LangChain, AutoGen 등)의 핵심 아키텍처로 자리 잡았다.
이 글에서는 각 프롬프팅 기법의 이론적 배경, 논문 핵심 발견, Python 구현 코드, 성능 비교, 안티패턴, 프로덕션 최적화 전략을 체계적으로 다룬다.
프롬프팅 기법 분류 체계
프롬프팅 기법은 다음과 같이 분류할 수 있다.
| 분류 | 기법 | 핵심 아이디어 | 논문 |
|---|---|---|---|
| 기본 | Zero-shot | 예시 없이 지시만으로 수행 | - |
| 기본 | Few-shot | 소수 예시 제공 | Brown et al. 2020 |
| 추론 강화 | Chain-of-Thought | 중간 추론 단계 생성 | Wei et al. 2022 |
| 추론 강화 | Zero-shot CoT | "단계별로 생각하자" 한 문장 추가 | Kojima et al. 2022 |
| 앙상블 | Self-Consistency | 다수 경로 샘플링 + 다수결 | Wang et al. 2022 |
| 탐색 | Tree-of-Thought | 트리 구조 추론 경로 탐색 | Yao et al. 2023 |
| 에이전트 | ReAct | 추론 + 행동 + 관찰 루프 | Yao et al. 2022 |
| 구조화 | Structured Output | JSON/XML 형식 강제 출력 | - |
| 조합 | Prompt Chaining | 작업 분해 + 순차 실행 | - |
Zero-shot과 Few-shot 프롬프팅
Zero-shot 프롬프팅
예시 없이 지시문만으로 모델에게 작업을 수행시키는 가장 기본적인 방식이다. 최근 대규모 모델(GPT-4, Claude 3.5 등)의 성능 향상으로 많은 작업에서 Zero-shot만으로도 충분한 성능을 달성할 수 있게 되었다.
from openai import OpenAI
client = OpenAI()
def zero_shot_classification(text: str) -> str:
"""Zero-shot 텍스트 분류"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"You are a text classifier. "
"Classify the given text into one of the following categories: "
"Technology, Business, Science, Sports, Entertainment. "
"Respond with only the category name."
)
},
{"role": "user", "content": text}
],
temperature=0,
max_tokens=20,
)
return response.choices[0].message.content.strip()
Few-shot 프롬프팅
Few-shot 프롬프팅은 프롬프트에 소수의 입출력 예시를 포함하여 모델이 패턴을 학습하도록 유도한다. Brown 등(2020)의 GPT-3 논문에서 체계적으로 제시되었으며, 특히 일관된 형식의 출력이 필요한 작업에서 효과적이다.
def few_shot_entity_extraction(text: str) -> str:
"""Few-shot 개체명 추출"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "Extract named entities from the given text in the specified format."
},
{
"role": "user",
"content": "Samsung Electronics announced the Galaxy S25 series at CES 2025 in Las Vegas."
},
{
"role": "assistant",
"content": (
"- Organization: Samsung Electronics\n"
"- Product: Galaxy S25\n"
"- Event: CES 2025\n"
"- Location: Las Vegas"
)
},
{
"role": "user",
"content": "Elon Musk revealed that Tesla will open a new Gigafactory in Austin, Texas in March 2026."
},
{
"role": "assistant",
"content": (
"- Person: Elon Musk\n"
"- Organization: Tesla\n"
"- Facility: Gigafactory\n"
"- Location: Austin, Texas\n"
"- Date: March 2026"
)
},
{"role": "user", "content": text}
],
temperature=0,
)
return response.choices[0].message.content
# Few-shot 예시 선택 전략
class FewShotSelector:
"""동적 Few-shot 예시 선택기"""
def __init__(self, examples, embedding_model="text-embedding-3-small"):
self.examples = examples
self.client = OpenAI()
self.embedding_model = embedding_model
self._precompute_embeddings()
def _precompute_embeddings(self):
"""모든 예시의 임베딩 사전 계산"""
texts = [ex["input"] for ex in self.examples]
response = self.client.embeddings.create(
model=self.embedding_model,
input=texts
)
self.embeddings = [r.embedding for r in response.data]
def select(self, query: str, k: int = 3) -> list:
"""쿼리와 가장 유사한 k개 예시 선택"""
query_emb = self.client.embeddings.create(
model=self.embedding_model,
input=[query]
).data[0].embedding
# 코사인 유사도 계산
import numpy as np
similarities = []
for emb in self.embeddings:
sim = np.dot(query_emb, emb) / (
np.linalg.norm(query_emb) * np.linalg.norm(emb)
)
similarities.append(sim)
# 상위 k개 선택
top_indices = np.argsort(similarities)[-k:][::-1]
return [self.examples[i] for i in top_indices]
Chain-of-Thought (CoT) 프롬프팅
핵심 원리
Wei 등(2022)의 Chain-of-Thought 논문은 프롬프트에 중간 추론 단계를 포함시키면 LLM의 복잡한 추론 능력이 비약적으로 향상된다는 것을 증명했다. 핵심 발견은 다음과 같다.
- 540B 파라미터 PaLM 모델에서 GSM8K(수학 문제) 정확도가 17.9%에서 **58.1%**로 향상
- CoT는 모델 크기가 충분히 클 때(100B+) 효과가 나타나는 창발적 능력(emergent ability)
- 8개의 CoT 예시만으로도 GPT-3(파인튜닝) + 검증기 성능을 초과
def chain_of_thought_math(problem: str) -> dict:
"""Chain-of-Thought 수학 문제 풀이"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"You are a math tutor. Solve problems step by step.\n"
"Show your reasoning clearly before giving the final answer.\n"
"Format: Step 1, Step 2, ..., Final Answer: [number]"
)
},
{
"role": "user",
"content": (
"A store has 45 apples. They sell 1/3 of them in the morning "
"and 2/5 of the remaining in the afternoon. "
"How many apples are left?"
)
},
{
"role": "assistant",
"content": (
"Step 1: Calculate apples sold in the morning.\n"
"Morning sales = 45 * (1/3) = 15 apples\n\n"
"Step 2: Calculate remaining apples after morning.\n"
"Remaining = 45 - 15 = 30 apples\n\n"
"Step 3: Calculate apples sold in the afternoon.\n"
"Afternoon sales = 30 * (2/5) = 12 apples\n\n"
"Step 4: Calculate final remaining apples.\n"
"Final remaining = 30 - 12 = 18 apples\n\n"
"Final Answer: 18"
)
},
{"role": "user", "content": problem}
],
temperature=0,
)
answer_text = response.choices[0].message.content
# 최종 답변 추출
import re
match = re.search(r"Final Answer:\s*(\d+)", answer_text)
final_answer = int(match.group(1)) if match else None
return {
"reasoning": answer_text,
"answer": final_answer,
"tokens_used": response.usage.total_tokens,
}
Zero-shot CoT
Kojima 등(2022)은 단순히 "Let's think step by step" 이라는 한 문장을 추가하는 것만으로도 CoT 효과를 얻을 수 있음을 발견했다. 이는 별도의 예시 작성이 필요 없어 실용적으로 매우 유용하다.
def zero_shot_cot(problem: str) -> str:
"""Zero-shot Chain-of-Thought"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": f"{problem}\n\nLet's think step by step."
}
],
temperature=0,
)
return response.choices[0].message.content
Self-Consistency 디코딩
Wang 등(2022)의 Self-Consistency는 CoT의 단일 그리디 디코딩 대신 여러 추론 경로를 샘플링하고 다수결로 최종 답변을 결정한다. GSM8K에서 CoT 대비 +17.9% 정확도 향상을 달성했다.
import collections
import re
def self_consistency(problem: str, num_samples: int = 5) -> dict:
"""Self-Consistency 디코딩"""
answers = []
reasoning_paths = []
for i in range(num_samples):
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"Solve the math problem step by step. "
"End with 'Final Answer: [number]'"
)
},
{"role": "user", "content": problem}
],
temperature=0.7, # 다양한 추론 경로를 위해 temperature 상향
max_tokens=500,
)
text = response.choices[0].message.content
reasoning_paths.append(text)
# 답변 추출
match = re.search(r"Final Answer:\s*(\d+)", text)
if match:
answers.append(int(match.group(1)))
# 다수결 투표
if answers:
counter = collections.Counter(answers)
majority_answer = counter.most_common(1)[0][0]
confidence = counter.most_common(1)[0][1] / len(answers)
else:
majority_answer = None
confidence = 0.0
return {
"answer": majority_answer,
"confidence": confidence,
"all_answers": answers,
"num_samples": num_samples,
"answer_distribution": dict(counter) if answers else {},
}
Tree-of-Thought (ToT) 프레임워크
핵심 아이디어
Yao 등(2023)의 **Tree-of-Thought(ToT)**는 CoT를 트리 구조로 확장하여 여러 추론 경로를 동시에 탐색한다. 핵심 발견은 다음과 같다.
- Game of 24 과제: GPT-4 + CoT가 4% 성공률 -> ToT로 74% 달성
- BFS/DFS 탐색 전략으로 추론 경로를 체계적으로 탐색
- 각 경로를 LLM 자체가 평가하여 유망한 경로만 확장
from dataclasses import dataclass
from typing import Optional
@dataclass
class ThoughtNode:
"""ToT의 사고 노드"""
content: str
score: float = 0.0
children: list = None
parent: Optional['ThoughtNode'] = None
depth: int = 0
def __post_init__(self):
if self.children is None:
self.children = []
class TreeOfThought:
"""Tree-of-Thought 프레임워크"""
def __init__(self, model="gpt-4o", max_depth=3, branching_factor=3):
self.client = OpenAI()
self.model = model
self.max_depth = max_depth
self.branching_factor = branching_factor
def generate_thoughts(self, problem: str, current_thought: str) -> list:
"""현재 상태에서 가능한 다음 사고 생성"""
prompt = (
f"Problem: {problem}\n\n"
f"Current reasoning so far:\n{current_thought}\n\n"
f"Generate {self.branching_factor} different possible next steps "
f"for solving this problem. "
f"Format each as 'Step N: [reasoning]' separated by '---'"
)
response = self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
temperature=0.8,
)
text = response.choices[0].message.content
thoughts = [t.strip() for t in text.split("---") if t.strip()]
return thoughts[:self.branching_factor]
def evaluate_thought(self, problem: str, thought_path: str) -> float:
"""사고 경로의 유망성을 0-1 사이로 평가"""
prompt = (
f"Problem: {problem}\n\n"
f"Reasoning path:\n{thought_path}\n\n"
f"Evaluate this reasoning path on a scale of 0.0 to 1.0:\n"
f"- 1.0: Correct and complete solution\n"
f"- 0.7-0.9: On the right track, promising\n"
f"- 0.4-0.6: Partially correct but uncertain\n"
f"- 0.0-0.3: Wrong approach or contains errors\n\n"
f"Respond with only the score (e.g., 0.8)"
)
response = self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
temperature=0,
max_tokens=10,
)
try:
score = float(response.choices[0].message.content.strip())
return min(max(score, 0.0), 1.0)
except ValueError:
return 0.5
def solve_bfs(self, problem: str) -> dict:
"""BFS 기반 ToT 탐색"""
root = ThoughtNode(content="", depth=0)
current_level = [root]
best_solution = None
best_score = 0.0
for depth in range(self.max_depth):
next_level = []
for node in current_level:
# 자식 사고 생성
thought_path = self._get_path(node)
children_thoughts = self.generate_thoughts(problem, thought_path)
for thought in children_thoughts:
full_path = f"{thought_path}\n{thought}" if thought_path else thought
score = self.evaluate_thought(problem, full_path)
child = ThoughtNode(
content=thought,
score=score,
parent=node,
depth=depth + 1
)
node.children.append(child)
next_level.append(child)
if score > best_score:
best_score = score
best_solution = full_path
# 상위 branching_factor개만 유지 (빔 서치)
next_level.sort(key=lambda n: n.score, reverse=True)
current_level = next_level[:self.branching_factor]
return {
"solution": best_solution,
"score": best_score,
"depth_explored": self.max_depth,
}
def _get_path(self, node: ThoughtNode) -> str:
"""노드까지의 전체 사고 경로 반환"""
path = []
current = node
while current and current.content:
path.append(current.content)
current = current.parent
return "\n".join(reversed(path))
ReAct: 추론과 행동의 결합
핵심 원리
Yao 등(2022)의 ReAct는 LLM이 추론(Reasoning)과 행동(Acting)을 교차로 수행하며 외부 도구를 활용하는 프레임워크이다. Thought-Action-Observation 루프를 통해 환각(hallucination)을 줄이고 검증 가능한 결과를 생성한다.
| 구성 요소 | 역할 | 예시 |
|---|---|---|
| Thought | 현재 상태 분석과 다음 행동 계획 | "사용자가 2024년 매출을 물어봤으니 DB를 조회해야겠다" |
| Action | 외부 도구 호출 | search("2024 revenue report"), calculate("150 * 1.1") |
| Observation | 도구 실행 결과 관찰 | "2024년 매출은 150억원으로 확인됨" |
import json
from typing import Callable
class ReActAgent:
"""ReAct 패턴 기반 에이전트"""
def __init__(self, model="gpt-4o"):
self.client = OpenAI()
self.model = model
self.tools = {}
self.max_iterations = 10
def register_tool(self, name: str, func: Callable, description: str):
"""외부 도구 등록"""
self.tools[name] = {
"function": func,
"description": description,
}
def _build_system_prompt(self) -> str:
"""시스템 프롬프트 구성"""
tool_descriptions = "\n".join([
f"- {name}: {info['description']}"
for name, info in self.tools.items()
])
return (
"You are a helpful assistant that solves problems step by step.\n"
"You have access to the following tools:\n"
f"{tool_descriptions}\n\n"
"For each step, respond in the following format:\n"
"Thought: [your reasoning about what to do next]\n"
"Action: [tool_name(argument)]\n\n"
"After receiving an observation, continue with another Thought.\n"
"When you have the final answer, respond with:\n"
"Thought: [final reasoning]\n"
"Final Answer: [your answer]\n\n"
"IMPORTANT: Use exactly one Action per step. "
"Wait for the Observation before proceeding."
)
def run(self, query: str) -> dict:
"""ReAct 루프 실행"""
messages = [
{"role": "system", "content": self._build_system_prompt()},
{"role": "user", "content": query},
]
steps = []
for iteration in range(self.max_iterations):
response = self.client.chat.completions.create(
model=self.model,
messages=messages,
temperature=0,
max_tokens=500,
)
assistant_msg = response.choices[0].message.content
# Final Answer 체크
if "Final Answer:" in assistant_msg:
final_answer = assistant_msg.split("Final Answer:")[-1].strip()
steps.append({
"type": "final",
"content": assistant_msg,
})
return {
"answer": final_answer,
"steps": steps,
"iterations": iteration + 1,
}
# Action 파싱 및 실행
import re
action_match = re.search(r"Action:\s*(\w+)\((.+?)\)", assistant_msg)
if action_match:
tool_name = action_match.group(1)
tool_arg = action_match.group(2).strip("'\"")
steps.append({
"type": "thought_action",
"content": assistant_msg,
"tool": tool_name,
"argument": tool_arg,
})
# 도구 실행
if tool_name in self.tools:
try:
observation = self.tools[tool_name]["function"](tool_arg)
except Exception as e:
observation = f"Error: {str(e)}"
else:
observation = f"Error: Tool '{tool_name}' not found"
steps.append({
"type": "observation",
"content": str(observation),
})
# 메시지 이력에 추가
messages.append({"role": "assistant", "content": assistant_msg})
messages.append({
"role": "user",
"content": f"Observation: {observation}"
})
else:
# Action이 없으면 응답을 이력에 추가하고 계속
messages.append({"role": "assistant", "content": assistant_msg})
messages.append({
"role": "user",
"content": "Please continue with an Action or provide the Final Answer."
})
return {
"answer": "Max iterations reached",
"steps": steps,
"iterations": self.max_iterations,
}
# 사용 예시
def create_research_agent():
"""리서치 에이전트 생성"""
agent = ReActAgent()
# 도구 등록
def search(query):
# 실제로는 검색 API 호출
return f"Search results for '{query}': [simulated results]"
def calculate(expression):
return str(eval(expression))
def get_current_date():
from datetime import datetime
return datetime.now().strftime("%Y-%m-%d")
agent.register_tool("search", search, "Search the web for information")
agent.register_tool("calculate", calculate, "Evaluate a math expression")
agent.register_tool("get_date", lambda _: get_current_date(), "Get current date")
return agent
구조화된 출력 프롬프팅
프로덕션 환경에서는 LLM의 출력을 프로그래밍적으로 처리할 수 있는 구조화된 형식(JSON, XML 등)으로 받아야 한다.
from pydantic import BaseModel, Field
from typing import Literal
# Pydantic 모델을 활용한 구조화된 출력
class SentimentResult(BaseModel):
"""감성 분석 결과 스키마"""
sentiment: Literal["positive", "negative", "neutral"]
confidence: float = Field(ge=0.0, le=1.0)
key_phrases: list[str]
reasoning: str
def structured_sentiment_analysis(text: str) -> SentimentResult:
"""구조화된 출력으로 감성 분석 수행"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"Analyze the sentiment of the given text. "
"Respond in JSON format with the following fields:\n"
"- sentiment: 'positive', 'negative', or 'neutral'\n"
"- confidence: float between 0.0 and 1.0\n"
"- key_phrases: list of key phrases that influenced the sentiment\n"
"- reasoning: brief explanation of the analysis"
)
},
{"role": "user", "content": text}
],
response_format={"type": "json_object"},
temperature=0,
)
result = json.loads(response.choices[0].message.content)
return SentimentResult(**result)
# 함수 호출(Function Calling) 기반 구조화
def function_calling_extraction(text: str) -> dict:
"""Function Calling을 활용한 정보 추출"""
tools = [
{
"type": "function",
"function": {
"name": "extract_meeting_info",
"description": "Extract meeting information from text",
"parameters": {
"type": "object",
"properties": {
"date": {
"type": "string",
"description": "Meeting date in YYYY-MM-DD format"
},
"time": {
"type": "string",
"description": "Meeting time in HH:MM format"
},
"participants": {
"type": "array",
"items": {"type": "string"},
"description": "List of participants"
},
"agenda": {
"type": "array",
"items": {"type": "string"},
"description": "Meeting agenda items"
},
"location": {
"type": "string",
"description": "Meeting location or meeting link"
}
},
"required": ["date", "time", "participants"]
}
}
}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"Extract meeting info: {text}"}],
tools=tools,
tool_choice={"type": "function", "function": {"name": "extract_meeting_info"}},
)
tool_call = response.choices[0].message.tool_calls[0]
return json.loads(tool_call.function.arguments)
프롬프트 체이닝
복잡한 작업을 여러 단계의 프롬프트로 분해하여 순차적으로 실행하는 기법이다. 각 단계의 출력이 다음 단계의 입력이 된다.
class PromptChain:
"""프롬프트 체이닝 프레임워크"""
def __init__(self, model="gpt-4o"):
self.client = OpenAI()
self.model = model
self.steps = []
self.results = {}
def add_step(self, name: str, prompt_template: str, depends_on: list = None):
"""체인에 단계 추가"""
self.steps.append({
"name": name,
"prompt_template": prompt_template,
"depends_on": depends_on or [],
})
def run(self, initial_input: str) -> dict:
"""체인 전체 실행"""
self.results["input"] = initial_input
for step in self.steps:
# 의존 단계 결과로 프롬프트 구성
prompt = step["prompt_template"]
prompt = prompt.replace("INPUT", self.results.get("input", ""))
for dep in step["depends_on"]:
prompt = prompt.replace(
f"RESULT_{dep.upper()}",
self.results.get(dep, "")
)
response = self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
temperature=0,
)
self.results[step["name"]] = response.choices[0].message.content
return self.results
# 사용 예시: 기술 문서 요약 + 번역 + 키워드 추출
def create_document_pipeline():
"""문서 처리 파이프라인"""
chain = PromptChain()
chain.add_step(
name="summary",
prompt_template=(
"Summarize the following technical document in 3-5 bullet points:\n\n"
"INPUT"
)
)
chain.add_step(
name="translation",
prompt_template=(
"Translate the following summary to Korean:\n\n"
"RESULT_SUMMARY"
),
depends_on=["summary"]
)
chain.add_step(
name="keywords",
prompt_template=(
"Extract 5-10 technical keywords from the following summary. "
"Format as a comma-separated list:\n\n"
"RESULT_SUMMARY"
),
depends_on=["summary"]
)
return chain
프롬프팅 기법 성능 비교
벤치마크 결과
| 기법 | GSM8K (수학) | HotpotQA (QA) | Game of 24 | 토큰 비용 |
|---|---|---|---|---|
| Zero-shot | 17.9% | 28.7% | - | 1x |
| Few-shot | 33.0% | 35.2% | - | 1.5x |
| Zero-shot CoT | 40.7% | 33.8% | - | 1.5x |
| Few-shot CoT | 58.1% | 42.1% | 4% | 2x |
| Self-Consistency (k=40) | 76.0% | 47.3% | - | 40x |
| Tree-of-Thought | - | - | 74% | 10-50x |
| ReAct | - | 40.2% | - | 3-5x |
기법 선택 가이드
# 프롬프팅 기법 선택 의사결정 트리
decision_tree:
simple_classification:
recommended: 'Zero-shot 또는 Few-shot'
reason: '단순 분류는 고급 기법 불필요'
math_reasoning:
recommended: 'CoT + Self-Consistency'
reason: '수학 추론에서 가장 안정적인 성능'
multi_step_search:
recommended: 'ReAct'
reason: '외부 정보 필요 시 도구 활용 가능'
creative_problem_solving:
recommended: 'Tree-of-Thought'
reason: '탐색 공간이 넓은 창의적 문제에 적합'
production_api:
recommended: 'Few-shot + Structured Output'
reason: '일관성과 파싱 가능성이 중요'
일반적인 안티패턴
안티패턴 1: 과도한 지시문
# BAD: 너무 많은 지시문은 모델을 혼란시킴
bad_prompt = """
You are an expert data scientist with 20 years of experience.
You must always be accurate and never hallucinate.
You should think carefully before answering.
Make sure your answer is complete and comprehensive.
Consider all edge cases and potential issues.
Be concise but thorough.
Use technical language but also be accessible.
Format your response nicely.
Include examples when appropriate.
Double-check your work before responding.
Question: What is the capital of France?
"""
# GOOD: 간결하고 구체적인 지시
good_prompt = """
Answer the following geography question with just the city name.
Question: What is the capital of France?
"""
안티패턴 2: 모호한 출력 형식
# BAD: 출력 형식이 불명확
bad_format = "Analyze this data and give me insights."
# GOOD: 명확한 출력 형식 지정
good_format = """
Analyze the following sales data and provide:
1. Top 3 insights (one sentence each)
2. Trend direction: "increasing", "decreasing", or "stable"
3. Recommended actions (bulleted list, max 3 items)
Respond in JSON format with keys: insights, trend, actions.
"""
안티패턴 3: 컨텍스트 윈도우 낭비
# BAD: 불필요한 반복 컨텍스트
def bad_batch_processing(items):
"""각 요청마다 동일한 긴 시스템 프롬프트 반복"""
results = []
for item in items:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": VERY_LONG_SYSTEM_PROMPT},
{"role": "user", "content": item},
]
)
results.append(response.choices[0].message.content)
return results
# GOOD: 배치 처리로 효율화
def good_batch_processing(items):
"""여러 항목을 한 번에 처리"""
combined = "\n---\n".join([f"Item {i+1}: {item}" for i, item in enumerate(items)])
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"Process each item below and return results "
"in JSON array format."
)
},
{"role": "user", "content": combined},
],
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)
프로덕션 최적화
프롬프트 버전 관리
import hashlib
from datetime import datetime
class PromptRegistry:
"""프롬프트 버전 관리 시스템"""
def __init__(self):
self.prompts = {}
self.history = []
def register(self, name: str, template: str, version: str = None) -> str:
"""프롬프트 등록 및 버전 관리"""
content_hash = hashlib.md5(template.encode()).hexdigest()[:8]
version = version or f"v{len(self.history) + 1}_{content_hash}"
entry = {
"name": name,
"version": version,
"template": template,
"hash": content_hash,
"created_at": datetime.now().isoformat(),
}
self.prompts[name] = entry
self.history.append(entry)
return version
def get(self, name: str) -> str:
"""현재 활성 프롬프트 반환"""
if name not in self.prompts:
raise KeyError(f"Prompt '{name}' not registered")
return self.prompts[name]["template"]
def get_version(self, name: str) -> str:
"""현재 프롬프트 버전 반환"""
return self.prompts[name]["version"]
비용 최적화 전략
class CostOptimizer:
"""LLM API 비용 최적화"""
# 모델별 가격 (1M 토큰당, 2026년 3월 기준 근사값)
PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
"claude-3-5-haiku": {"input": 0.80, "output": 4.00},
}
@staticmethod
def estimate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
"""비용 추정"""
pricing = CostOptimizer.PRICING.get(model, {})
input_cost = (input_tokens / 1_000_000) * pricing.get("input", 0)
output_cost = (output_tokens / 1_000_000) * pricing.get("output", 0)
return input_cost + output_cost
@staticmethod
def select_model(task_complexity: str) -> str:
"""작업 복잡도에 따른 모델 선택"""
model_map = {
"simple": "gpt-4o-mini", # 분류, 추출 등 단순 작업
"moderate": "gpt-4o-mini", # CoT 가 필요한 보통 작업
"complex": "gpt-4o", # 복잡한 추론, 코드 생성
"critical": "gpt-4o", # 정확도가 최우선인 작업
}
return model_map.get(task_complexity, "gpt-4o-mini")
캐싱 전략
import hashlib
import json
from functools import lru_cache
class PromptCache:
"""프롬프트 응답 캐싱"""
def __init__(self, cache_backend="memory"):
self.cache = {}
self.hits = 0
self.misses = 0
def _make_key(self, model: str, messages: list, temperature: float) -> str:
"""캐시 키 생성"""
content = json.dumps({
"model": model,
"messages": messages,
"temperature": temperature,
}, sort_keys=True)
return hashlib.sha256(content.encode()).hexdigest()
def get(self, model: str, messages: list, temperature: float):
"""캐시에서 응답 조회"""
if temperature > 0:
# 비결정적 응답은 캐싱하지 않음
return None
key = self._make_key(model, messages, temperature)
result = self.cache.get(key)
if result:
self.hits += 1
else:
self.misses += 1
return result
def set(self, model: str, messages: list, temperature: float, response: str):
"""캐시에 응답 저장"""
if temperature > 0:
return
key = self._make_key(model, messages, temperature)
self.cache[key] = response
def stats(self) -> dict:
"""캐시 통계"""
total = self.hits + self.misses
return {
"hits": self.hits,
"misses": self.misses,
"hit_rate": self.hits / total if total > 0 else 0,
"cache_size": len(self.cache),
}
운영 시 주의사항
프롬프트 인젝션 방어
프로덕션 환경에서 가장 중요한 보안 이슈는 프롬프트 인젝션이다. 사용자 입력이 시스템 프롬프트를 우회하여 의도하지 않은 동작을 유도할 수 있다.
def sanitize_user_input(user_input: str) -> str:
"""사용자 입력 정화"""
# 1. 시스템 프롬프트 우회 시도 탐지
injection_patterns = [
"ignore previous instructions",
"ignore all instructions",
"disregard the above",
"forget your instructions",
"you are now",
"new instruction:",
"system prompt:",
]
lower_input = user_input.lower()
for pattern in injection_patterns:
if pattern in lower_input:
return "[BLOCKED: Potential prompt injection detected]"
# 2. 입력 길이 제한
max_length = 4000
if len(user_input) > max_length:
user_input = user_input[:max_length] + "... [truncated]"
return user_input
장애 사례와 복구
# 일반적인 장애 시나리오
failure_scenarios:
rate_limiting:
symptom: '429 Too Many Requests'
cause: 'API 호출 한도 초과'
recovery:
- '지수 백오프(exponential backoff) 적용'
- '요청 큐 구현으로 트래픽 평활화'
- '복수 API 키 로테이션'
hallucination:
symptom: '모델이 존재하지 않는 정보 생성'
cause: '충분하지 않은 컨텍스트 또는 과도한 temperature'
recovery:
- 'temperature를 0으로 낮춤'
- 'RAG 파이프라인으로 근거 자료 제공'
- '출력 검증 레이어 추가'
format_failure:
symptom: 'JSON 파싱 실패'
cause: '모델이 요청된 형식을 따르지 않음'
recovery:
- 'response_format 파라미터 사용'
- 'Few-shot 예시로 형식 강제'
- '실패 시 재시도 + 더 명확한 지시 추가'
context_overflow:
symptom: '컨텍스트 윈도우 초과 에러'
cause: '입력 토큰이 모델 한도 초과'
recovery:
- '입력 텍스트 요약 또는 청킹'
- '불필요한 Few-shot 예시 제거'
- '더 긴 컨텍스트 모델로 전환'
평가 파이프라인
class PromptEvaluator:
"""프롬프트 A/B 테스트 평가기"""
def __init__(self):
self.results = []
def evaluate(self, test_cases: list, prompt_a: str, prompt_b: str) -> dict:
"""두 프롬프트 비교 평가"""
scores_a = []
scores_b = []
for case in test_cases:
# 프롬프트 A 실행
result_a = self._run_prompt(prompt_a, case["input"])
score_a = self._score(result_a, case["expected"])
scores_a.append(score_a)
# 프롬프트 B 실행
result_b = self._run_prompt(prompt_b, case["input"])
score_b = self._score(result_b, case["expected"])
scores_b.append(score_b)
import numpy as np
return {
"prompt_a_avg": np.mean(scores_a),
"prompt_b_avg": np.mean(scores_b),
"prompt_a_std": np.std(scores_a),
"prompt_b_std": np.std(scores_b),
"winner": "A" if np.mean(scores_a) > np.mean(scores_b) else "B",
"improvement": abs(np.mean(scores_a) - np.mean(scores_b)),
"num_cases": len(test_cases),
}
def _run_prompt(self, prompt: str, input_text: str) -> str:
"""프롬프트 실행"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": prompt},
{"role": "user", "content": input_text},
],
temperature=0,
)
return response.choices[0].message.content
def _score(self, result: str, expected: str) -> float:
"""결과 평가 (0-1)"""
# 간단한 문자열 유사도 기반 점수
result_lower = result.lower().strip()
expected_lower = expected.lower().strip()
if result_lower == expected_lower:
return 1.0
elif expected_lower in result_lower:
return 0.8
else:
return 0.0
마치며
프롬프트 엔지니어링은 단순한 텍스트 작성을 넘어 LLM의 추론 메커니즘을 이해하고 활용하는 엔지니어링 분야로 발전했다. Chain-of-Thought가 "추론 단계를 보여달라"는 간단한 아이디어에서 시작하여, Self-Consistency의 앙상블 전략, Tree-of-Thought의 체계적 탐색, ReAct의 도구 활용 패턴으로 확장되었다.
프로덕션 환경에서는 기법의 성능뿐 아니라 비용, 지연 시간, 일관성, 보안(프롬프트 인젝션 방어)을 종합적으로 고려해야 한다. 가장 중요한 것은 작업 특성에 맞는 기법을 선택하고, 체계적인 평가 파이프라인으로 지속적으로 개선하는 것이다.
앞으로 LLM의 기본 추론 능력이 향상됨에 따라 개별 프롬프팅 기법의 상대적 이점은 변할 수 있지만, "모델이 어떻게 추론하는지 이해하고 이를 가이드하는" 프롬프트 엔지니어링의 근본 원리는 변하지 않을 것이다.
참고자료
- Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022.
- Wang, X., et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023.
- Yao, S., et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS 2023.
- Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.
- Kojima, T., et al. (2022). Large Language Models are Zero-Shot Reasoners.
- DAIR.AI Prompt Engineering Guide
- OpenAI Prompt Engineering Best Practices
- Anthropic Prompt Engineering Documentation
Advanced LLM Prompt Engineering: Chain-of-Thought, Tree-of-Thought, ReAct, and Few-Shot Pattern Practical Guide
- Introduction
- Prompting Technique Taxonomy
- Zero-shot and Few-shot Prompting
- Chain-of-Thought (CoT) Prompting
- Self-Consistency Decoding
- Tree-of-Thought (ToT) Framework
- ReAct: Synergizing Reasoning and Acting
- Structured Output Prompting
- Prompt Chaining
- Prompting Technique Performance Comparison
- Common Anti-patterns
- Production Optimization
- Operational Considerations
- Conclusion
- References

Introduction
Prompt engineering is a core technology for maximizing the latent capabilities of LLMs. The Chain-of-Thought paper published by Wei et al. in 2022 proved that "including reasoning processes in prompts dramatically improves the model's reasoning ability," establishing prompt engineering as an independent research field.
Subsequently, advanced techniques such as Self-Consistency, Tree-of-Thought, and ReAct emerged in succession, expanding the scope of LLM applications far beyond simple question-answer patterns to complex reasoning, planning, and external tool utilization. In particular, the ReAct pattern has become the core architecture of most AI agent frameworks (LangChain, AutoGen, etc.).
This article systematically covers the theoretical background, key paper findings, Python implementation code, performance comparisons, anti-patterns, and production optimization strategies for each prompting technique.
Prompting Technique Taxonomy
Prompting techniques can be classified as follows:
| Category | Technique | Core Idea | Paper |
|---|---|---|---|
| Basic | Zero-shot | Perform with instructions only, no examples | - |
| Basic | Few-shot | Provide a few examples | Brown et al. 2020 |
| Reasoning Enhancement | Chain-of-Thought | Generate intermediate reasoning steps | Wei et al. 2022 |
| Reasoning Enhancement | Zero-shot CoT | Add a single phrase: "Let's think step by step" | Kojima et al. 2022 |
| Ensemble | Self-Consistency | Multi-path sampling + majority voting | Wang et al. 2022 |
| Search | Tree-of-Thought | Tree-structured reasoning path exploration | Yao et al. 2023 |
| Agent | ReAct | Reasoning + Acting + Observation loop | Yao et al. 2022 |
| Structured | Structured Output | Enforce JSON/XML format output | - |
| Composition | Prompt Chaining | Task decomposition + sequential execution | - |
Zero-shot and Few-shot Prompting
Zero-shot Prompting
The most basic approach where the model performs a task using only instructions without examples. With recent performance improvements in large models (GPT-4, Claude 3.5, etc.), many tasks can achieve sufficient performance with Zero-shot alone.
from openai import OpenAI
client = OpenAI()
def zero_shot_classification(text: str) -> str:
"""Zero-shot text classification"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"You are a text classifier. "
"Classify the given text into one of the following categories: "
"Technology, Business, Science, Sports, Entertainment. "
"Respond with only the category name."
)
},
{"role": "user", "content": text}
],
temperature=0,
max_tokens=20,
)
return response.choices[0].message.content.strip()
Few-shot Prompting
Few-shot prompting includes a small number of input-output examples in the prompt to help the model learn patterns. It was systematically presented in the GPT-3 paper by Brown et al. (2020) and is particularly effective for tasks requiring consistent output formats.
def few_shot_entity_extraction(text: str) -> str:
"""Few-shot named entity extraction"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "Extract named entities from the given text in the specified format."
},
{
"role": "user",
"content": "Samsung Electronics announced the Galaxy S25 series at CES 2025 in Las Vegas."
},
{
"role": "assistant",
"content": (
"- Organization: Samsung Electronics\n"
"- Product: Galaxy S25\n"
"- Event: CES 2025\n"
"- Location: Las Vegas"
)
},
{
"role": "user",
"content": "Elon Musk revealed that Tesla will open a new Gigafactory in Austin, Texas in March 2026."
},
{
"role": "assistant",
"content": (
"- Person: Elon Musk\n"
"- Organization: Tesla\n"
"- Facility: Gigafactory\n"
"- Location: Austin, Texas\n"
"- Date: March 2026"
)
},
{"role": "user", "content": text}
],
temperature=0,
)
return response.choices[0].message.content
# Few-shot example selection strategy
class FewShotSelector:
"""Dynamic few-shot example selector"""
def __init__(self, examples, embedding_model="text-embedding-3-small"):
self.examples = examples
self.client = OpenAI()
self.embedding_model = embedding_model
self._precompute_embeddings()
def _precompute_embeddings(self):
"""Precompute embeddings for all examples"""
texts = [ex["input"] for ex in self.examples]
response = self.client.embeddings.create(
model=self.embedding_model,
input=texts
)
self.embeddings = [r.embedding for r in response.data]
def select(self, query: str, k: int = 3) -> list:
"""Select k most similar examples to the query"""
query_emb = self.client.embeddings.create(
model=self.embedding_model,
input=[query]
).data[0].embedding
# Compute cosine similarity
import numpy as np
similarities = []
for emb in self.embeddings:
sim = np.dot(query_emb, emb) / (
np.linalg.norm(query_emb) * np.linalg.norm(emb)
)
similarities.append(sim)
# Select top k
top_indices = np.argsort(similarities)[-k:][::-1]
return [self.examples[i] for i in top_indices]
Chain-of-Thought (CoT) Prompting
Core Principle
The Chain-of-Thought paper by Wei et al. (2022) demonstrated that including intermediate reasoning steps in prompts dramatically improves the complex reasoning ability of LLMs. Key findings include:
- 540B parameter PaLM model improved GSM8K (math problems) accuracy from 17.9% to 58.1%
- CoT is an emergent ability that manifests only when the model is sufficiently large (100B+)
- Just 8 CoT examples surpassed GPT-3 (fine-tuned) + verifier performance
def chain_of_thought_math(problem: str) -> dict:
"""Chain-of-Thought math problem solving"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"You are a math tutor. Solve problems step by step.\n"
"Show your reasoning clearly before giving the final answer.\n"
"Format: Step 1, Step 2, ..., Final Answer: [number]"
)
},
{
"role": "user",
"content": (
"A store has 45 apples. They sell 1/3 of them in the morning "
"and 2/5 of the remaining in the afternoon. "
"How many apples are left?"
)
},
{
"role": "assistant",
"content": (
"Step 1: Calculate apples sold in the morning.\n"
"Morning sales = 45 * (1/3) = 15 apples\n\n"
"Step 2: Calculate remaining apples after morning.\n"
"Remaining = 45 - 15 = 30 apples\n\n"
"Step 3: Calculate apples sold in the afternoon.\n"
"Afternoon sales = 30 * (2/5) = 12 apples\n\n"
"Step 4: Calculate final remaining apples.\n"
"Final remaining = 30 - 12 = 18 apples\n\n"
"Final Answer: 18"
)
},
{"role": "user", "content": problem}
],
temperature=0,
)
answer_text = response.choices[0].message.content
# Extract final answer
import re
match = re.search(r"Final Answer:\s*(\d+)", answer_text)
final_answer = int(match.group(1)) if match else None
return {
"reasoning": answer_text,
"answer": final_answer,
"tokens_used": response.usage.total_tokens,
}
Zero-shot CoT
Kojima et al. (2022) discovered that simply adding the phrase "Let's think step by step" achieves CoT effects without requiring separate examples. This is extremely practical as it eliminates the need for crafting examples.
def zero_shot_cot(problem: str) -> str:
"""Zero-shot Chain-of-Thought"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": f"{problem}\n\nLet's think step by step."
}
],
temperature=0,
)
return response.choices[0].message.content
Self-Consistency Decoding
Self-Consistency by Wang et al. (2022) replaces CoT's single greedy decoding with sampling multiple reasoning paths and determining the final answer through majority voting. It achieved +17.9% accuracy improvement over CoT on GSM8K.
import collections
import re
def self_consistency(problem: str, num_samples: int = 5) -> dict:
"""Self-Consistency decoding"""
answers = []
reasoning_paths = []
for i in range(num_samples):
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"Solve the math problem step by step. "
"End with 'Final Answer: [number]'"
)
},
{"role": "user", "content": problem}
],
temperature=0.7, # Higher temperature for diverse reasoning paths
max_tokens=500,
)
text = response.choices[0].message.content
reasoning_paths.append(text)
# Extract answer
match = re.search(r"Final Answer:\s*(\d+)", text)
if match:
answers.append(int(match.group(1)))
# Majority voting
if answers:
counter = collections.Counter(answers)
majority_answer = counter.most_common(1)[0][0]
confidence = counter.most_common(1)[0][1] / len(answers)
else:
majority_answer = None
confidence = 0.0
return {
"answer": majority_answer,
"confidence": confidence,
"all_answers": answers,
"num_samples": num_samples,
"answer_distribution": dict(counter) if answers else {},
}
Tree-of-Thought (ToT) Framework
Core Idea
Tree-of-Thought (ToT) by Yao et al. (2023) extends CoT into a tree structure that simultaneously explores multiple reasoning paths. Key findings include:
- Game of 24 task: GPT-4 + CoT achieved 4% success rate -> ToT achieved 74%
- Systematic exploration of reasoning paths using BFS/DFS strategies
- The LLM itself evaluates each path, expanding only promising ones
from dataclasses import dataclass
from typing import Optional
@dataclass
class ThoughtNode:
"""ToT thought node"""
content: str
score: float = 0.0
children: list = None
parent: Optional['ThoughtNode'] = None
depth: int = 0
def __post_init__(self):
if self.children is None:
self.children = []
class TreeOfThought:
"""Tree-of-Thought Framework"""
def __init__(self, model="gpt-4o", max_depth=3, branching_factor=3):
self.client = OpenAI()
self.model = model
self.max_depth = max_depth
self.branching_factor = branching_factor
def generate_thoughts(self, problem: str, current_thought: str) -> list:
"""Generate possible next thoughts from current state"""
prompt = (
f"Problem: {problem}\n\n"
f"Current reasoning so far:\n{current_thought}\n\n"
f"Generate {self.branching_factor} different possible next steps "
f"for solving this problem. "
f"Format each as 'Step N: [reasoning]' separated by '---'"
)
response = self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
temperature=0.8,
)
text = response.choices[0].message.content
thoughts = [t.strip() for t in text.split("---") if t.strip()]
return thoughts[:self.branching_factor]
def evaluate_thought(self, problem: str, thought_path: str) -> float:
"""Evaluate the promise of a thought path on a 0-1 scale"""
prompt = (
f"Problem: {problem}\n\n"
f"Reasoning path:\n{thought_path}\n\n"
f"Evaluate this reasoning path on a scale of 0.0 to 1.0:\n"
f"- 1.0: Correct and complete solution\n"
f"- 0.7-0.9: On the right track, promising\n"
f"- 0.4-0.6: Partially correct but uncertain\n"
f"- 0.0-0.3: Wrong approach or contains errors\n\n"
f"Respond with only the score (e.g., 0.8)"
)
response = self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
temperature=0,
max_tokens=10,
)
try:
score = float(response.choices[0].message.content.strip())
return min(max(score, 0.0), 1.0)
except ValueError:
return 0.5
def solve_bfs(self, problem: str) -> dict:
"""BFS-based ToT search"""
root = ThoughtNode(content="", depth=0)
current_level = [root]
best_solution = None
best_score = 0.0
for depth in range(self.max_depth):
next_level = []
for node in current_level:
# Generate child thoughts
thought_path = self._get_path(node)
children_thoughts = self.generate_thoughts(problem, thought_path)
for thought in children_thoughts:
full_path = f"{thought_path}\n{thought}" if thought_path else thought
score = self.evaluate_thought(problem, full_path)
child = ThoughtNode(
content=thought,
score=score,
parent=node,
depth=depth + 1
)
node.children.append(child)
next_level.append(child)
if score > best_score:
best_score = score
best_solution = full_path
# Keep only top branching_factor nodes (beam search)
next_level.sort(key=lambda n: n.score, reverse=True)
current_level = next_level[:self.branching_factor]
return {
"solution": best_solution,
"score": best_score,
"depth_explored": self.max_depth,
}
def _get_path(self, node: ThoughtNode) -> str:
"""Return the full thought path up to the node"""
path = []
current = node
while current and current.content:
path.append(current.content)
current = current.parent
return "\n".join(reversed(path))
ReAct: Synergizing Reasoning and Acting
Core Principle
ReAct by Yao et al. (2022) is a framework where LLMs alternate between reasoning and acting to leverage external tools. Through the Thought-Action-Observation loop, it reduces hallucination and generates verifiable results.
| Component | Role | Example |
|---|---|---|
| Thought | Analyze current state and plan next action | "The user asked for 2024 revenue, so I need to query the DB" |
| Action | Call external tool | search("2024 revenue report"), calculate("150 * 1.1") |
| Observation | Observe tool execution result | "2024 revenue confirmed at 15 billion" |
import json
from typing import Callable
class ReActAgent:
"""ReAct pattern-based agent"""
def __init__(self, model="gpt-4o"):
self.client = OpenAI()
self.model = model
self.tools = {}
self.max_iterations = 10
def register_tool(self, name: str, func: Callable, description: str):
"""Register external tool"""
self.tools[name] = {
"function": func,
"description": description,
}
def _build_system_prompt(self) -> str:
"""Build system prompt"""
tool_descriptions = "\n".join([
f"- {name}: {info['description']}"
for name, info in self.tools.items()
])
return (
"You are a helpful assistant that solves problems step by step.\n"
"You have access to the following tools:\n"
f"{tool_descriptions}\n\n"
"For each step, respond in the following format:\n"
"Thought: [your reasoning about what to do next]\n"
"Action: [tool_name(argument)]\n\n"
"After receiving an observation, continue with another Thought.\n"
"When you have the final answer, respond with:\n"
"Thought: [final reasoning]\n"
"Final Answer: [your answer]\n\n"
"IMPORTANT: Use exactly one Action per step. "
"Wait for the Observation before proceeding."
)
def run(self, query: str) -> dict:
"""Execute ReAct loop"""
messages = [
{"role": "system", "content": self._build_system_prompt()},
{"role": "user", "content": query},
]
steps = []
for iteration in range(self.max_iterations):
response = self.client.chat.completions.create(
model=self.model,
messages=messages,
temperature=0,
max_tokens=500,
)
assistant_msg = response.choices[0].message.content
# Check for Final Answer
if "Final Answer:" in assistant_msg:
final_answer = assistant_msg.split("Final Answer:")[-1].strip()
steps.append({
"type": "final",
"content": assistant_msg,
})
return {
"answer": final_answer,
"steps": steps,
"iterations": iteration + 1,
}
# Parse and execute Action
import re
action_match = re.search(r"Action:\s*(\w+)\((.+?)\)", assistant_msg)
if action_match:
tool_name = action_match.group(1)
tool_arg = action_match.group(2).strip("'\"")
steps.append({
"type": "thought_action",
"content": assistant_msg,
"tool": tool_name,
"argument": tool_arg,
})
# Execute tool
if tool_name in self.tools:
try:
observation = self.tools[tool_name]["function"](tool_arg)
except Exception as e:
observation = f"Error: {str(e)}"
else:
observation = f"Error: Tool '{tool_name}' not found"
steps.append({
"type": "observation",
"content": str(observation),
})
# Add to message history
messages.append({"role": "assistant", "content": assistant_msg})
messages.append({
"role": "user",
"content": f"Observation: {observation}"
})
else:
# If no Action, add to history and continue
messages.append({"role": "assistant", "content": assistant_msg})
messages.append({
"role": "user",
"content": "Please continue with an Action or provide the Final Answer."
})
return {
"answer": "Max iterations reached",
"steps": steps,
"iterations": self.max_iterations,
}
# Usage example
def create_research_agent():
"""Create a research agent"""
agent = ReActAgent()
# Register tools
def search(query):
# In practice, this would call a search API
return f"Search results for '{query}': [simulated results]"
def calculate(expression):
return str(eval(expression))
def get_current_date():
from datetime import datetime
return datetime.now().strftime("%Y-%m-%d")
agent.register_tool("search", search, "Search the web for information")
agent.register_tool("calculate", calculate, "Evaluate a math expression")
agent.register_tool("get_date", lambda _: get_current_date(), "Get current date")
return agent
Structured Output Prompting
In production environments, LLM outputs must be received in structured formats (JSON, XML, etc.) that can be programmatically processed.
from pydantic import BaseModel, Field
from typing import Literal
# Structured output using Pydantic models
class SentimentResult(BaseModel):
"""Sentiment analysis result schema"""
sentiment: Literal["positive", "negative", "neutral"]
confidence: float = Field(ge=0.0, le=1.0)
key_phrases: list[str]
reasoning: str
def structured_sentiment_analysis(text: str) -> SentimentResult:
"""Perform sentiment analysis with structured output"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"Analyze the sentiment of the given text. "
"Respond in JSON format with the following fields:\n"
"- sentiment: 'positive', 'negative', or 'neutral'\n"
"- confidence: float between 0.0 and 1.0\n"
"- key_phrases: list of key phrases that influenced the sentiment\n"
"- reasoning: brief explanation of the analysis"
)
},
{"role": "user", "content": text}
],
response_format={"type": "json_object"},
temperature=0,
)
result = json.loads(response.choices[0].message.content)
return SentimentResult(**result)
# Function Calling-based structuring
def function_calling_extraction(text: str) -> dict:
"""Information extraction using Function Calling"""
tools = [
{
"type": "function",
"function": {
"name": "extract_meeting_info",
"description": "Extract meeting information from text",
"parameters": {
"type": "object",
"properties": {
"date": {
"type": "string",
"description": "Meeting date in YYYY-MM-DD format"
},
"time": {
"type": "string",
"description": "Meeting time in HH:MM format"
},
"participants": {
"type": "array",
"items": {"type": "string"},
"description": "List of participants"
},
"agenda": {
"type": "array",
"items": {"type": "string"},
"description": "Meeting agenda items"
},
"location": {
"type": "string",
"description": "Meeting location or meeting link"
}
},
"required": ["date", "time", "participants"]
}
}
}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"Extract meeting info: {text}"}],
tools=tools,
tool_choice={"type": "function", "function": {"name": "extract_meeting_info"}},
)
tool_call = response.choices[0].message.tool_calls[0]
return json.loads(tool_call.function.arguments)
Prompt Chaining
A technique that decomposes complex tasks into multiple prompt stages and executes them sequentially. Each stage's output becomes the next stage's input.
class PromptChain:
"""Prompt Chaining Framework"""
def __init__(self, model="gpt-4o"):
self.client = OpenAI()
self.model = model
self.steps = []
self.results = {}
def add_step(self, name: str, prompt_template: str, depends_on: list = None):
"""Add a step to the chain"""
self.steps.append({
"name": name,
"prompt_template": prompt_template,
"depends_on": depends_on or [],
})
def run(self, initial_input: str) -> dict:
"""Execute the entire chain"""
self.results["input"] = initial_input
for step in self.steps:
# Construct prompt with dependent step results
prompt = step["prompt_template"]
prompt = prompt.replace("INPUT", self.results.get("input", ""))
for dep in step["depends_on"]:
prompt = prompt.replace(
f"RESULT_{dep.upper()}",
self.results.get(dep, "")
)
response = self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
temperature=0,
)
self.results[step["name"]] = response.choices[0].message.content
return self.results
# Usage example: Technical document summarize + translate + keyword extraction
def create_document_pipeline():
"""Document processing pipeline"""
chain = PromptChain()
chain.add_step(
name="summary",
prompt_template=(
"Summarize the following technical document in 3-5 bullet points:\n\n"
"INPUT"
)
)
chain.add_step(
name="translation",
prompt_template=(
"Translate the following summary to Korean:\n\n"
"RESULT_SUMMARY"
),
depends_on=["summary"]
)
chain.add_step(
name="keywords",
prompt_template=(
"Extract 5-10 technical keywords from the following summary. "
"Format as a comma-separated list:\n\n"
"RESULT_SUMMARY"
),
depends_on=["summary"]
)
return chain
Prompting Technique Performance Comparison
Benchmark Results
| Technique | GSM8K (Math) | HotpotQA (QA) | Game of 24 | Token Cost |
|---|---|---|---|---|
| Zero-shot | 17.9% | 28.7% | - | 1x |
| Few-shot | 33.0% | 35.2% | - | 1.5x |
| Zero-shot CoT | 40.7% | 33.8% | - | 1.5x |
| Few-shot CoT | 58.1% | 42.1% | 4% | 2x |
| Self-Consistency (k=40) | 76.0% | 47.3% | - | 40x |
| Tree-of-Thought | - | - | 74% | 10-50x |
| ReAct | - | 40.2% | - | 3-5x |
Technique Selection Guide
# Prompting technique selection decision tree
decision_tree:
simple_classification:
recommended: 'Zero-shot or Few-shot'
reason: 'Simple classification does not require advanced techniques'
math_reasoning:
recommended: 'CoT + Self-Consistency'
reason: 'Most stable performance for mathematical reasoning'
multi_step_search:
recommended: 'ReAct'
reason: 'Tool utilization possible when external information is needed'
creative_problem_solving:
recommended: 'Tree-of-Thought'
reason: 'Suitable for creative problems with large search spaces'
production_api:
recommended: 'Few-shot + Structured Output'
reason: 'Consistency and parsability are paramount'
Common Anti-patterns
Anti-pattern 1: Excessive Instructions
# BAD: Too many instructions confuse the model
bad_prompt = """
You are an expert data scientist with 20 years of experience.
You must always be accurate and never hallucinate.
You should think carefully before answering.
Make sure your answer is complete and comprehensive.
Consider all edge cases and potential issues.
Be concise but thorough.
Use technical language but also be accessible.
Format your response nicely.
Include examples when appropriate.
Double-check your work before responding.
Question: What is the capital of France?
"""
# GOOD: Concise and specific instructions
good_prompt = """
Answer the following geography question with just the city name.
Question: What is the capital of France?
"""
Anti-pattern 2: Ambiguous Output Format
# BAD: Output format is unclear
bad_format = "Analyze this data and give me insights."
# GOOD: Clear output format specification
good_format = """
Analyze the following sales data and provide:
1. Top 3 insights (one sentence each)
2. Trend direction: "increasing", "decreasing", or "stable"
3. Recommended actions (bulleted list, max 3 items)
Respond in JSON format with keys: insights, trend, actions.
"""
Anti-pattern 3: Context Window Waste
# BAD: Repeating the same long system prompt for each request
def bad_batch_processing(items):
"""Repeats identical long system prompt for every request"""
results = []
for item in items:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": VERY_LONG_SYSTEM_PROMPT},
{"role": "user", "content": item},
]
)
results.append(response.choices[0].message.content)
return results
# GOOD: Optimize with batch processing
def good_batch_processing(items):
"""Process multiple items at once"""
combined = "\n---\n".join([f"Item {i+1}: {item}" for i, item in enumerate(items)])
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"Process each item below and return results "
"in JSON array format."
)
},
{"role": "user", "content": combined},
],
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)
Production Optimization
Prompt Version Management
import hashlib
from datetime import datetime
class PromptRegistry:
"""Prompt version management system"""
def __init__(self):
self.prompts = {}
self.history = []
def register(self, name: str, template: str, version: str = None) -> str:
"""Register and version manage prompts"""
content_hash = hashlib.md5(template.encode()).hexdigest()[:8]
version = version or f"v{len(self.history) + 1}_{content_hash}"
entry = {
"name": name,
"version": version,
"template": template,
"hash": content_hash,
"created_at": datetime.now().isoformat(),
}
self.prompts[name] = entry
self.history.append(entry)
return version
def get(self, name: str) -> str:
"""Return the currently active prompt"""
if name not in self.prompts:
raise KeyError(f"Prompt '{name}' not registered")
return self.prompts[name]["template"]
def get_version(self, name: str) -> str:
"""Return current prompt version"""
return self.prompts[name]["version"]
Cost Optimization Strategy
class CostOptimizer:
"""LLM API cost optimization"""
# Per-model pricing (per 1M tokens, approximate as of March 2026)
PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
"claude-3-5-haiku": {"input": 0.80, "output": 4.00},
}
@staticmethod
def estimate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
"""Estimate cost"""
pricing = CostOptimizer.PRICING.get(model, {})
input_cost = (input_tokens / 1_000_000) * pricing.get("input", 0)
output_cost = (output_tokens / 1_000_000) * pricing.get("output", 0)
return input_cost + output_cost
@staticmethod
def select_model(task_complexity: str) -> str:
"""Select model based on task complexity"""
model_map = {
"simple": "gpt-4o-mini", # Classification, extraction, etc.
"moderate": "gpt-4o-mini", # Tasks requiring CoT
"complex": "gpt-4o", # Complex reasoning, code generation
"critical": "gpt-4o", # Tasks where accuracy is top priority
}
return model_map.get(task_complexity, "gpt-4o-mini")
Caching Strategy
import hashlib
import json
from functools import lru_cache
class PromptCache:
"""Prompt response caching"""
def __init__(self, cache_backend="memory"):
self.cache = {}
self.hits = 0
self.misses = 0
def _make_key(self, model: str, messages: list, temperature: float) -> str:
"""Generate cache key"""
content = json.dumps({
"model": model,
"messages": messages,
"temperature": temperature,
}, sort_keys=True)
return hashlib.sha256(content.encode()).hexdigest()
def get(self, model: str, messages: list, temperature: float):
"""Look up response in cache"""
if temperature > 0:
# Do not cache non-deterministic responses
return None
key = self._make_key(model, messages, temperature)
result = self.cache.get(key)
if result:
self.hits += 1
else:
self.misses += 1
return result
def set(self, model: str, messages: list, temperature: float, response: str):
"""Store response in cache"""
if temperature > 0:
return
key = self._make_key(model, messages, temperature)
self.cache[key] = response
def stats(self) -> dict:
"""Cache statistics"""
total = self.hits + self.misses
return {
"hits": self.hits,
"misses": self.misses,
"hit_rate": self.hits / total if total > 0 else 0,
"cache_size": len(self.cache),
}
Operational Considerations
Prompt Injection Defense
The most important security issue in production environments is prompt injection. User input can bypass system prompts to induce unintended behavior.
def sanitize_user_input(user_input: str) -> str:
"""Sanitize user input"""
# 1. Detect system prompt bypass attempts
injection_patterns = [
"ignore previous instructions",
"ignore all instructions",
"disregard the above",
"forget your instructions",
"you are now",
"new instruction:",
"system prompt:",
]
lower_input = user_input.lower()
for pattern in injection_patterns:
if pattern in lower_input:
return "[BLOCKED: Potential prompt injection detected]"
# 2. Limit input length
max_length = 4000
if len(user_input) > max_length:
user_input = user_input[:max_length] + "... [truncated]"
return user_input
Failure Cases and Recovery
# Common failure scenarios
failure_scenarios:
rate_limiting:
symptom: '429 Too Many Requests'
cause: 'API call limit exceeded'
recovery:
- 'Apply exponential backoff'
- 'Implement request queue for traffic smoothing'
- 'Rotate multiple API keys'
hallucination:
symptom: 'Model generates non-existent information'
cause: 'Insufficient context or excessive temperature'
recovery:
- 'Lower temperature to 0'
- 'Provide grounding material via RAG pipeline'
- 'Add output verification layer'
format_failure:
symptom: 'JSON parsing failure'
cause: 'Model does not follow requested format'
recovery:
- 'Use response_format parameter'
- 'Enforce format with Few-shot examples'
- 'Retry on failure with clearer instructions'
context_overflow:
symptom: 'Context window exceeded error'
cause: 'Input tokens exceed model limit'
recovery:
- 'Summarize or chunk input text'
- 'Remove unnecessary Few-shot examples'
- 'Switch to a model with longer context'
Evaluation Pipeline
class PromptEvaluator:
"""Prompt A/B test evaluator"""
def __init__(self):
self.results = []
def evaluate(self, test_cases: list, prompt_a: str, prompt_b: str) -> dict:
"""Comparative evaluation of two prompts"""
scores_a = []
scores_b = []
for case in test_cases:
# Execute Prompt A
result_a = self._run_prompt(prompt_a, case["input"])
score_a = self._score(result_a, case["expected"])
scores_a.append(score_a)
# Execute Prompt B
result_b = self._run_prompt(prompt_b, case["input"])
score_b = self._score(result_b, case["expected"])
scores_b.append(score_b)
import numpy as np
return {
"prompt_a_avg": np.mean(scores_a),
"prompt_b_avg": np.mean(scores_b),
"prompt_a_std": np.std(scores_a),
"prompt_b_std": np.std(scores_b),
"winner": "A" if np.mean(scores_a) > np.mean(scores_b) else "B",
"improvement": abs(np.mean(scores_a) - np.mean(scores_b)),
"num_cases": len(test_cases),
}
def _run_prompt(self, prompt: str, input_text: str) -> str:
"""Execute prompt"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": prompt},
{"role": "user", "content": input_text},
],
temperature=0,
)
return response.choices[0].message.content
def _score(self, result: str, expected: str) -> float:
"""Score result (0-1)"""
# Simple string similarity-based scoring
result_lower = result.lower().strip()
expected_lower = expected.lower().strip()
if result_lower == expected_lower:
return 1.0
elif expected_lower in result_lower:
return 0.8
else:
return 0.0
Conclusion
Prompt engineering has evolved from simple text crafting into an engineering discipline that understands and leverages the reasoning mechanisms of LLMs. Starting from Chain-of-Thought's simple idea of "show me the reasoning steps," it has expanded to Self-Consistency's ensemble strategy, Tree-of-Thought's systematic search, and ReAct's tool utilization pattern.
In production environments, not only technique performance but also cost, latency, consistency, and security (prompt injection defense) must be holistically considered. The most important thing is selecting the right technique for the task characteristics and continuously improving through systematic evaluation pipelines.
As LLMs' baseline reasoning capabilities improve in the future, the relative advantages of individual prompting techniques may change, but the fundamental principle of prompt engineering -- "understanding how the model reasons and guiding it" -- will remain unchanged.
References
- Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022.
- Wang, X., et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023.
- Yao, S., et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS 2023.
- Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.
- Kojima, T., et al. (2022). Large Language Models are Zero-Shot Reasoners.
- DAIR.AI Prompt Engineering Guide
- OpenAI Prompt Engineering Best Practices
- Anthropic Prompt Engineering Documentation