Skip to content

필사 모드: Multi-Turn Chatbot Conversation State Management and Context Compression Strategies 2026

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Overview

A multi-turn chatbot processes continuous conversations spanning multiple turns, not just single question-answer pairs. When a user references "what was said earlier" or asks follow-up questions that depend on context, the chatbot must accurately remember prior conversation content and maintain appropriate context. As of 2026, Claude 4 Sonnet offers a 200K token context window and GPT-5 offers 400K tokens, but a longer context window is not always better.

As the context window grows, attention computation increases at O(n^2) and costs rise proportionally. The more serious problem is "context rot" -- research has confirmed that model accuracy and recall degrade as input length increases. Therefore, rather than unconditionally stuffing in the entire conversation history, strategies for selecting and compressing key information are essential.

This article covers memory architectures for managing multi-turn chatbot conversation state, compression techniques for efficiently utilizing the context window, LangGraph state machine implementation, Redis-based session persistence, and token budget management -- all with production-ready strategies.

Challenges of Multi-Turn Conversations

Token Limits and Cost Issues

The first wall you hit in multi-turn conversations is the token limit. Taking a customer support chatbot as an example, it is common for sessions to exceed 50 turns. At an average of 200 tokens per turn, 10,000 tokens are consumed by conversation history alone at the 50-turn mark. Add system prompts, RAG documents, and function call results, and the token budget shrinks rapidly.

Loss of Critical Information

If you use a sliding window that keeps only the most recent N messages, core requirements set by the user early in the conversation disappear. If the user said "my budget is under 5 million won" 10 turns ago and that message slides out of the window, the chatbot ends up making irrelevant recommendations.

Complexity of State Management

Beyond simple Q&A, task-oriented conversations like bookings, orders, and troubleshooting require managing structured state such as the current step, collected information (slots), and confirmation status. This state must be tracked separately from the conversation history and must be reset or branched under specific conditions.

Concurrent Session Isolation

In production environments, hundreds to thousands of users converse simultaneously. Each user's conversation state must be isolated to prevent cross-contamination, and if a user closes their browser and reopens it, the previous state must be restored.

Memory Architecture Types

LangChain and LlamaIndex provide various memory types for multi-turn conversations. Understanding the pros and cons of each and selecting the right combination for your situation is critical.

Memory Type Comparison

| Memory Type | Storage Method | Token Usage | Information Fidelity | Suitable Scenarios |

| -------------- | ----------------------------------------------- | ------------------------- | ------------------------ | -------------------------------- |

| Buffer Memory | Stores entire conversation history | High (linear growth) | Very high | Short conversations, debugging |

| Window Memory | Keeps only the most recent K messages | Fixed (window size) | Medium (early info lost) | General customer support |

| Summary Memory | Generates summary via LLM | Low (summary length) | Low (detail loss) | Long conversations, cost savings |

| Summary Buffer | Hybrid of summary + recent buffer | Medium | High | Most production use cases |

| Vector Memory | Retrieves relevant conversations via embeddings | Variable (search results) | High (relevance-based) | Long-term memory, cross-session |

ConversationBufferMemory vs ConversationSummaryBufferMemory

Buffer Memory is the simplest approach. It stores all messages as-is and feeds them into the prompt. It is easy to debug and has no information loss, but the token limit is quickly reached as conversations grow longer.

Summary Buffer Memory is the most commonly used approach in practice. It keeps recent messages in their original form while compressing older messages into summaries via LLM calls. Summarization is automatically triggered when the token count exceeds a configured threshold.

from langchain.memory import ConversationSummaryBufferMemory

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o", temperature=0)

Automatically summarizes old messages when max_token_limit is exceeded

memory = ConversationSummaryBufferMemory(

llm=llm,

max_token_limit=2000,

return_messages=True,

memory_key="chat_history",

human_prefix="Customer",

ai_prefix="Agent",

)

Save conversations

memory.save_context(

{"input": "I'm considering buying a laptop and my budget is 1.5 million won"},

{"output": "I'll recommend a great laptop within your 1.5 million won budget. What will be the primary use?"},

)

memory.save_context(

{"input": "I'll be doing programming and light video editing"},

{"output": "For development and video editing, I recommend specs with at least 16GB RAM and 512GB SSD."},

)

memory.save_context(

{"input": "Which is better, the MacBook Air M4 or Lenovo ThinkPad?"},

{"output": "Both are excellent products, but there are differences depending on use case. The MacBook Air M4 is ..."},

)

Automatic summary + recent messages retained when token limit is exceeded

loaded = memory.load_memory_variables({})

print(loaded["chat_history"])

SystemMessage: "The customer is looking for a laptop for programming and video editing with a 1.5M won budget..."

+ recent original messages

LlamaIndex ChatSummaryMemoryBuffer

LlamaIndex provides a similar mechanism. `ChatSummaryMemoryBuffer` periodically summarizes older messages when the configured token limit is exceeded, while keeping recent messages in their original form.

from llama_index.core.memory import ChatSummaryMemoryBuffer

from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o", temperature=0)

memory = ChatSummaryMemoryBuffer.from_defaults(

llm=llm,

token_limit=3000,

Summary trigger token ratio (triggers when exceeding 70% of total limit)

summarize_threshold=0.7,

)

Connect memory to Chat Engine

from llama_index.core.chat_engine import CondensePlusContextChatEngine

chat_engine = CondensePlusContextChatEngine.from_defaults(

retriever=index.as_retriever(similarity_top_k=3),

memory=memory,

llm=llm,

system_prompt="You are a technical support specialist.",

)

response = chat_engine.chat("Tell me more about the error code I mentioned earlier")

Context Window Management

Sliding Window Strategy

The sliding window is the most intuitive context management method. It keeps only the most recent K messages and discards the rest. It is simple to implement and token usage is predictable, but the downside is that early conversation information is completely lost.

An improved sliding window determines window size based on token count rather than simply message count. Twenty short messages and five long messages should not be treated the same.

Token-Based Window Implementation

def sliding_window_by_tokens(

messages: list[dict],

max_tokens: int = 4000,

model: str = "gpt-4o",

always_keep_system: bool = True,

) -> list[dict]:

"""Token count-based sliding window.

Always keeps system messages, fills from recent messages in reverse order.

"""

enc = tiktoken.encoding_for_model(model)

result = []

current_tokens = 0

Reserve system messages first

system_messages = [m for m in messages if m["role"] == "system"]

non_system = [m for m in messages if m["role"] != "system"]

if always_keep_system:

for sm in system_messages:

sm_tokens = len(enc.encode(sm["content"]))

result.append(sm)

current_tokens += sm_tokens

Add from most recent messages in reverse order

selected = []

for msg in reversed(non_system):

msg_tokens = len(enc.encode(msg["content"]))

if current_tokens + msg_tokens > max_tokens:

break

selected.append(msg)

current_tokens += msg_tokens

result.extend(reversed(selected))

return result

Context Window Management Method Comparison

| Management Method | Implementation Complexity | Token Efficiency | Information Preservation | Latency |

| --------------------------- | ------------------------- | ---------------- | ------------------------ | --------------------------- |

| Message count-based sliding | Very low | Medium | Low | None |

| Token count-based sliding | Low | High | Low | Very low |

| Summary + sliding | Medium | High | High | Medium (LLM call) |

| Vector search-based | High | Very high | High | Medium (embedding + search) |

| Hybrid (summary + vector) | Very high | Very high | Very high | High |

Conversation Summarization Strategies

Conversation summarization is a core technique for saving context window space while preserving key information. Simply saying "summarize the conversation" can cause important details to be omitted, so structured summarization prompts should be used.

Progressive Summarization

Rather than summarizing the entire conversation at once, this approach merges new content into the existing summary at regular turn intervals. This approach yields high summary quality at low cost.

from langchain_openai import ChatOpenAI

from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

PROGRESSIVE_SUMMARY_PROMPT = ChatPromptTemplate.from_messages([

("system", """You are a customer support conversation summarization expert.

Merge the existing summary with the new conversation to produce an updated summary.

Rules:

1. Always preserve the customer's core requirements, constraints, and preferences

2. Never omit confirmed facts (names, order numbers, dates, etc.)

3. Specify the current progress stage and next required action

4. Record resolved issues briefly, unresolved issues in detail

5. Keep within 200 characters"""),

("human", """Existing summary:

{existing_summary}

New conversation:

{new_messages}

Updated summary:"""),

])

class ProgressiveSummarizer:

def __init__(self, llm, summary_interval: int = 5):

self.llm = llm

self.chain = PROGRESSIVE_SUMMARY_PROMPT | llm

self.summary = ""

self.buffer = []

self.summary_interval = summary_interval

self.turn_count = 0

def add_turn(self, user_msg: str, assistant_msg: str):

self.buffer.append(f"Customer: {user_msg}")

self.buffer.append(f"Agent: {assistant_msg}")

self.turn_count += 1

if self.turn_count % self.summary_interval == 0:

self._update_summary()

def _update_summary(self):

new_messages = "\n".join(self.buffer)

result = self.chain.invoke({

"existing_summary": self.summary or "(none)",

"new_messages": new_messages,

})

self.summary = result.content

self.buffer = [] # Clear buffer

def get_context(self) -> str:

"""Return context combining summary + recent buffer"""

parts = []

if self.summary:

parts.append(f"[Conversation Summary]\n{self.summary}")

if self.buffer:

parts.append(f"[Recent Conversation]\n" + "\n".join(self.buffer))

return "\n\n".join(parts)

Structured vs Free-Form Summarization

| Summarization Method | Pros | Cons | Recommended Scenarios |

| ----------------------- | ------------------------------- | -------------------------- | ------------------------- |

| Free-form summarization | Simple implementation, flexible | May miss critical info | General chat bots |

| Slot-based structured | Guarantees required info | Requires prompt design | Booking/ordering chatbots |

| Key-value extraction | Searchable/filterable | May lose context | Data collection purposes |

| Progressive merge | Cost-efficient, high quality | Possible cumulative errors | Long support sessions |

Context Compression Techniques

Prompt Compression with LLMLingua

Microsoft's LLMLingua series can compress prompts up to 20x while minimizing performance degradation. It removes unimportant tokens based on the perplexity of a small language model. LLMLingua-2 is trained on GPT-4 distillation data, enabling domain-agnostic general-purpose compression that is 3-6x faster than the original LLMLingua.

from llmlingua import PromptCompressor

Initialize LLMLingua-2

compressor = PromptCompressor(

model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",

use_llmlingua2=True,

device_map="cpu", # Use "cuda" for GPU

)

Compress long conversation history

conversation_history = """

Customer: Hello, I'd like to inquire about an order I placed last week.

Agent: Hello! Please provide your order number and I'll check for you.

Customer: The order number is ORD-2026-03-1234. My delivery hasn't arrived yet.

Agent: Let me check on that. Please wait a moment.

Agent: I've confirmed order number ORD-2026-03-1234. It's currently in transit and scheduled to arrive tomorrow.

Customer: Tomorrow? It was supposed to arrive yesterday. Why is it delayed?

Agent: It was delayed by one day due to logistics center circumstances. We apologize for the inconvenience.

Customer: Can I get a shipping fee refund?

Agent: Yes, a shipping fee refund is available due to the delivery delay. Shall I proceed with the refund?

Customer: Yes, please.

Agent: The shipping fee refund of 3,000 won has been processed. It will be refunded to the original payment method within 1-3 days.

"""

compressed = compressor.compress_prompt(

conversation_history,

rate=0.5, # 50% compression rate

force_tokens=["order number", "ORD-2026-03-1234", "refund"], # Tokens to always keep

)

print(f"Original tokens: {compressed['origin_tokens']}")

print(f"Compressed tokens: {compressed['compressed_tokens']}")

print(f"Compression ratio: {compressed['ratio']:.1f}x")

print(f"Compressed result:\n{compressed['compressed_prompt']}")

Compression Technique Comparison

| Compression Technique | Ratio | Performance Retention | Speed | Training Required |

| --------------------- | --------- | --------------------- | --------------- | ------------------- |

| LLMLingua | Up to 20x | High | Medium | No (inference only) |

| LLMLingua-2 | Up to 20x | Very high | Fast (3-6x) | No |

| LongLLMLingua | Up to 4x | Very high | Medium | No |

| LLM summarization | Variable | Medium | Slow (LLM call) | No |

| Rule-based filtering | 2-3x | Low | Very fast | No |

| Selective Context | Up to 10x | High | Fast | No |

LangGraph State Machine Implementation

LangGraph can model conversation flows as graph-based state machines. Unlike LangChain's traditional memory approach, it uses explicit state schemas and reducer functions to reliably manage complex multi-turn workflows. State is automatically persisted through checkpointers, making session restoration seamless.

from typing import Annotated, TypedDict

from langgraph.graph import StateGraph, START, END

from langgraph.graph.message import add_messages

from langgraph.checkpoint.memory import MemorySaver

from langchain_openai import ChatOpenAI

from langchain_core.messages import SystemMessage, HumanMessage, AIMessage

1. State schema definition

class OrderSupportState(TypedDict):

messages: Annotated[list, add_messages] # Accumulate messages with reducer

order_id: str | None

issue_type: str | None # "shipping", "refund", "exchange", "other"

step: str # "greeting", "identify", "diagnose", "resolve", "close"

collected_info: dict

summary: str # Previous conversation summary

2. Node function definitions

llm = ChatOpenAI(model="gpt-4o", temperature=0)

def greeting_node(state: OrderSupportState) -> dict:

"""Greeting and initial classification"""

response = llm.invoke([

SystemMessage(content="You are a customer support chatbot. Identify the customer's inquiry type."),

*state["messages"],

])

return {

"messages": [response],

"step": "identify",

}

def identify_node(state: OrderSupportState) -> dict:

"""Identify order number and issue type"""

context_parts = []

if state.get("summary"):

context_parts.append(f"Previous conversation summary: {state['summary']}")

system_msg = f"""Identify the customer's order number and problem type.

Collected info: {state.get('collected_info', dict())}

{chr(10).join(context_parts)}"""

response = llm.invoke([

SystemMessage(content=system_msg),

*state["messages"][-10:], # Use only the last 10 messages

])

Extract order number from response (more sophisticated parsing needed in practice)

return {

"messages": [response],

"step": "diagnose",

}

def resolve_node(state: OrderSupportState) -> dict:

"""Propose problem resolution"""

response = llm.invoke([

SystemMessage(content=f"Issue type: {state.get('issue_type', 'unidentified')}. "

f"Order number: {state.get('order_id', 'unidentified')}. Propose a solution."),

*state["messages"][-6:],

])

return {

"messages": [response],

"step": "close",

}

3. Routing function

def route_by_step(state: OrderSupportState) -> str:

step = state.get("step", "greeting")

if step == "greeting":

return "greeting"

elif step == "identify":

return "identify"

elif step in ("diagnose", "resolve"):

return "resolve"

else:

return END

4. Graph construction

graph = StateGraph(OrderSupportState)

graph.add_node("greeting", greeting_node)

graph.add_node("identify", identify_node)

graph.add_node("resolve", resolve_node)

graph.add_conditional_edges(START, route_by_step)

graph.add_conditional_edges("greeting", route_by_step)

graph.add_conditional_edges("identify", route_by_step)

graph.add_conditional_edges("resolve", route_by_step)

5. Persist state with checkpointer

checkpointer = MemorySaver()

app = graph.compile(checkpointer=checkpointer)

6. Execute (sessions distinguished by thread_id)

config = {"configurable": {"thread_id": "user-session-abc123"}}

result = app.invoke(

{

"messages": [HumanMessage(content="My ordered item hasn't arrived yet")],

"step": "greeting",

"collected_info": {},

"summary": "",

},

config=config,

)

LangGraph's checkpointer automatically saves state after each node execution. `MemorySaver` is in-memory storage suitable for development/testing; in production, you should use `SqliteSaver`, `PostgresSaver`, or MongoDB Store. Sessions are distinguished by `thread_id`, allowing multiple users' conversations to be isolated simultaneously.

Session Management and Persistence

Redis-Based Session Store

When persisting conversation state in production, Redis is the most common choice. It supports low-latency reads/writes, TTL-based automatic expiration, and real-time notifications via Pub/Sub.

class ChatSessionManager:

"""Redis-based multi-turn conversation session manager"""

def __init__(

self,

redis_url: str = "redis://localhost:6379",

session_ttl: int = 3600, # 1 hour

max_history_tokens: int = 4000,

):

self.redis = redis.from_url(redis_url, decode_responses=True)

self.session_ttl = session_ttl

self.max_history_tokens = max_history_tokens

self.encoder = tiktoken.encoding_for_model("gpt-4o")

def _key(self, session_id: str, suffix: str) -> str:

return f"chat:session:{session_id}:{suffix}"

def create_session(self, session_id: str, metadata: dict | None = None) -> dict:

"""Create a new session"""

session_data = {

"session_id": session_id,

"created_at": time.time(),

"updated_at": time.time(),

"turn_count": 0,

"total_tokens": 0,

"metadata": json.dumps(metadata or {}),

"summary": "",

}

self.redis.hset(self._key(session_id, "meta"), mapping=session_data)

self.redis.expire(self._key(session_id, "meta"), self.session_ttl)

return session_data

def add_message(self, session_id: str, role: str, content: str) -> None:

"""Add a message and manage tokens"""

msg = json.dumps({

"role": role,

"content": content,

"timestamp": time.time(),

"tokens": len(self.encoder.encode(content)),

})

history_key = self._key(session_id, "history")

self.redis.rpush(history_key, msg)

self.redis.expire(history_key, self.session_ttl)

Update metadata

self.redis.hincrby(self._key(session_id, "meta"), "turn_count", 1)

self.redis.hset(

self._key(session_id, "meta"), "updated_at", str(time.time())

)

Renew TTL

self.redis.expire(self._key(session_id, "meta"), self.session_ttl)

def get_context_messages(self, session_id: str) -> list[dict]:

"""Return context messages within the token budget"""

history_key = self._key(session_id, "history")

all_messages = self.redis.lrange(history_key, 0, -1)

if not all_messages:

return []

parsed = [json.loads(m) for m in all_messages]

result = []

token_count = 0

Add summary first if available

summary = self.redis.hget(self._key(session_id, "meta"), "summary")

if summary:

summary_tokens = len(self.encoder.encode(summary))

token_count += summary_tokens

result.append({"role": "system", "content": f"Previous conversation summary: {summary}"})

Add recent messages in reverse order within token budget

selected = []

for msg in reversed(parsed):

msg_tokens = msg.get("tokens", 0)

if token_count + msg_tokens > self.max_history_tokens:

break

selected.append({"role": msg["role"], "content": msg["content"]})

token_count += msg_tokens

result.extend(reversed(selected))

return result

def update_summary(self, session_id: str, summary: str) -> None:

"""Update conversation summary"""

self.redis.hset(self._key(session_id, "meta"), "summary", summary)

def delete_session(self, session_id: str) -> None:

"""Delete session"""

for suffix in ("meta", "history"):

self.redis.delete(self._key(session_id, suffix))

Session Store Comparison

| Store | Latency | Persistence | Scalability | TTL Support | Suitable Scenarios |

| ---------------- | ------------ | --------------------- | ---------------- | ------------ | ---------------------- |

| In-memory (dict) | Nanoseconds | None | Single process | Manual impl | Development/testing |

| Redis | Milliseconds | Conditional (AOF/RDB) | Cluster support | Built-in | Production real-time |

| PostgreSQL | Few ms | Full | High | Trigger impl | Audit logging required |

| MongoDB | Few ms | Full | Sharding support | TTL index | Unstructured state |

| DynamoDB | Few ms | Full | Unlimited | TTL built-in | AWS-based services |

In production, the common hybrid pattern is to use Redis as the main session store while asynchronously flushing to PostgreSQL or MongoDB. Reading current conversation state quickly from Redis and saving the complete history to a relational DB at conversation end provides both performance and persistence.

Token Budget Management

In production chatbots, token budget management is central to cost control and response quality. A strategy is needed to divide the model's context window into allocations for system prompts, conversation history, RAG documents, and response reservation.

from dataclasses import dataclass

@dataclass

class TokenBudget:

"""Token budget allocation calculator"""

model: str = "gpt-4o"

max_context: int = 128000 # gpt-4o context window

system_prompt_tokens: int = 500

response_reserve: int = 4000 # Reserved for response

rag_budget: int = 3000 # For RAG documents

tool_result_budget: int = 2000 # For tool execution results

def __post_init__(self):

self.encoder = tiktoken.encoding_for_model(self.model)

@property

def conversation_budget(self) -> int:

"""Number of tokens available for conversation history"""

reserved = (

self.system_prompt_tokens

+ self.response_reserve

+ self.rag_budget

+ self.tool_result_budget

)

return self.max_context - reserved

def count_tokens(self, text: str) -> int:

return len(self.encoder.encode(text))

def allocate(self, messages: list[dict]) -> dict:

"""Token usage status report for current messages"""

msg_tokens = sum(

self.count_tokens(m.get("content", "")) for m in messages

)

budget = self.conversation_budget

return {

"total_context": self.max_context,

"system_prompt": self.system_prompt_tokens,

"response_reserve": self.response_reserve,

"rag_budget": self.rag_budget,

"tool_result_budget": self.tool_result_budget,

"conversation_budget": budget,

"conversation_used": msg_tokens,

"conversation_remaining": budget - msg_tokens,

"utilization_pct": round(msg_tokens / budget * 100, 1),

"needs_compression": msg_tokens > budget * 0.8,

}

Usage example

budget = TokenBudget(model="gpt-4o", max_context=128000)

print(f"Available tokens for conversation history: {budget.conversation_budget:,}")

report = budget.allocate([

{"role": "user", "content": "Please check my previous order status"},

{"role": "assistant", "content": "Could you provide your order number?"},

])

print(f"Utilization: {report['utilization_pct']}%")

print(f"Compression needed: {report['needs_compression']}")

Triggering compression automatically when 80% of the token budget is exceeded is a good practice. This threshold should be adjusted based on service characteristics. For customer support where accuracy is critical, lower it to 70%; for casual conversations, 90% is acceptable.

Troubleshooting

Repeated Questions Due to Context Loss

Symptom: The chatbot asks again for information it has already collected.

Diagnostic order:

1. Check if the sliding window size is too small. The key information may be in messages that have slid out of the window.

2. Check if summarization is working properly. The summarization prompt may be omitting critical slot information (names, order numbers, etc.).

3. Check if `collected_info` is being properly updated in the state management logic.

Solution: Specify a "fields that must be preserved" list in the structured summarization prompt. Manage slot information in a separate state dictionary.

Conversation Disconnection Due to Redis Session Expiration

Symptom: When a user leaves briefly and returns, the chatbot starts from the beginning.

Root cause: TTL is set too short.

Solution: Renew the TTL every time a message is added, and set TTL appropriate to the business requirements. 2 hours for customer support and 24 hours for shopping assistants are typical. Sending a warning message before expiration is also recommended.

Summary Drift (Cumulative Summary Errors)

Symptom: Facts become distorted or hallucinations are included as progressive summarization is repeated.

Root cause: Information loss and distortion accumulate as summaries of summaries are repeated.

Solution: Regenerate summaries from original messages every 5-10 summarization cycles. Extract factual information like numbers, dates, and proper nouns separately from the summary and store them in state.

State Conflicts from Concurrent Requests

Symptom: When a user sends messages in rapid succession, responses are mixed up or state breaks.

Root cause: Two concurrently executing requests read and write the same session state, causing race conditions.

Solution: Use Redis `WATCH`/`MULTI`/`EXEC` transactions or distributed locks. With LangGraph, the checkpointer guarantees sequential execution, naturally resolving this issue.

Operational Checklist

Items to verify before deploying a production multi-turn chatbot.

**Memory and Context Management:**

- Memory type selection completed (Buffer, Summary Buffer, Vector, etc.)

- Token budget allocation defined (system prompt, conversation history, RAG, response reservation)

- Context compression threshold set (trigger at 80% or above)

- Required preservation fields specified in summarization prompt

**Session Management:**

- Redis or persistent store connection confirmed

- Session TTL configured (appropriate for service type)

- Concurrent request handling strategy established (locks, queues, sequential execution)

- User notification logic implemented for session expiration

**Monitoring:**

- Average tokens per turn tracked

- Summarization call frequency and cost monitored

- Repeated question rate due to context loss measured

- Average session duration and turn count tracked

**Incident Response:**

- In-memory fallback logic for Redis failures

- Original message retention strategy for summarization LLM call failures

- State recovery procedures documented

Failure Cases

Case 1: Infinite Context Expansion

In a customer support chatbot project, Buffer Memory was used with no context management under the policy "never allow information loss." Initially there were no issues, but as the average turn count exceeded 30, API costs surged from 5 million won to 20 million won per month. Response latency also increased from an average of 2 seconds to 8 seconds.

Lesson: Keeping all messages is not always optimal. After switching to Summary Buffer Memory and setting a token budget of 4,000, costs were reduced by 70% with no significant change in customer satisfaction.

Case 2: The Pitfall of Free-Form Summarization

In a travel booking chatbot using free-form summarization, incidents repeatedly occurred where departure and arrival dates were swapped or the number of travelers was omitted during summarization. The information "2 people" from the customer was dropped from the summary, leading to bookings processed at single-person rates.

Lesson: Slot-based structured summarization must be used for task-oriented conversations. Required slots like departure, destination, dates, number of travelers, and seat class were specified, and validation logic was added to verify these values are present in the summary.

Case 3: Session Isolation Failure

In a multi-tenant SaaS chatbot, the session key was constructed using only `user_id`. When the same user attempted conversations on different topics in multiple browser tabs, the two conversations' states mixed, producing irrelevant responses.

Lesson: Session keys should be composite keys of `user_id + session_id`. A unique `session_id` was issued for each browser tab, and the user dashboard was updated to manage the list of active sessions.

References

- [LangGraph Memory Official Docs - Memory overview](https://docs.langchain.com/oss/python/langgraph/memory)

- [LlamaIndex Chat Engine Context Mode](https://docs.llamaindex.ai/en/stable/examples/chat_engine/chat_engine_context/)

- [LLMLingua - Prompt Compression for Accelerated Inference](https://llmlingua.com/)

- [Microsoft LLMLingua GitHub - LLMLingua-2](https://github.com/microsoft/LLMLingua)

- [Redis for GenAI Apps - Session Management](https://redis.io/docs/latest/develop/get-started/redis-in-ai/)

- [Context Window Management Strategies for Long-Context AI Agents and Chatbots](https://www.getmaxim.ai/articles/context-window-management-strategies-for-long-context-ai-agents-and-chatbots/)

- [LangGraph Checkpointing Best Practices 2025](https://sparkco.ai/blog/mastering-langgraph-checkpointing-best-practices-for-2025/)

- [Prompt Compression Survey - NAACL 2025](https://github.com/ZongqianLi/Prompt-Compression-Survey)

Quiz

Q1: What is the main topic covered in "Multi-Turn Chatbot Conversation State Management and

Context Compression Strategies 2026"?

Multi-turn chatbot conversation state management and context compression strategies. Covers

session management, memory architecture, conversation summarization, sliding windows, token budget

management, and LangGraph state machines with production-ready implementations.

Token Limits and Cost Issues The first wall you hit in multi-turn conversations is the token

limit. Taking a customer support chatbot as an example, it is common for sessions to exceed 50

turns.

LangChain and LlamaIndex provide various memory types for multi-turn conversations. Understanding

the pros and cons of each and selecting the right combination for your situation is critical.

Sliding Window Strategy The sliding window is the most intuitive context management method. It

keeps only the most recent K messages and discards the rest.

Conversation summarization is a core technique for saving context window space while preserving

key information. Simply saying "summarize the conversation" can cause important details to be

omitted, so structured summarization prompts should be used.

현재 단락 (1/471)

A multi-turn chatbot processes continuous conversations spanning multiple turns, not just single que...

작성 글자: 0원문 글자: 27,546작성 단락: 0/471