Skip to content
Published on

Multi-Turn Chatbot Conversation State Management and Context Compression Strategies 2026

Authors
  • Name
    Twitter
Multi-Turn Chatbot Conversation State Management and Context Compression Strategies 2026

Overview

A multi-turn chatbot processes continuous conversations spanning multiple turns, not just single question-answer pairs. When a user references "what was said earlier" or asks follow-up questions that depend on context, the chatbot must accurately remember prior conversation content and maintain appropriate context. As of 2026, Claude 4 Sonnet offers a 200K token context window and GPT-5 offers 400K tokens, but a longer context window is not always better.

As the context window grows, attention computation increases at O(n^2) and costs rise proportionally. The more serious problem is "context rot" -- research has confirmed that model accuracy and recall degrade as input length increases. Therefore, rather than unconditionally stuffing in the entire conversation history, strategies for selecting and compressing key information are essential.

This article covers memory architectures for managing multi-turn chatbot conversation state, compression techniques for efficiently utilizing the context window, LangGraph state machine implementation, Redis-based session persistence, and token budget management -- all with production-ready strategies.

Challenges of Multi-Turn Conversations

Token Limits and Cost Issues

The first wall you hit in multi-turn conversations is the token limit. Taking a customer support chatbot as an example, it is common for sessions to exceed 50 turns. At an average of 200 tokens per turn, 10,000 tokens are consumed by conversation history alone at the 50-turn mark. Add system prompts, RAG documents, and function call results, and the token budget shrinks rapidly.

Loss of Critical Information

If you use a sliding window that keeps only the most recent N messages, core requirements set by the user early in the conversation disappear. If the user said "my budget is under 5 million won" 10 turns ago and that message slides out of the window, the chatbot ends up making irrelevant recommendations.

Complexity of State Management

Beyond simple Q&A, task-oriented conversations like bookings, orders, and troubleshooting require managing structured state such as the current step, collected information (slots), and confirmation status. This state must be tracked separately from the conversation history and must be reset or branched under specific conditions.

Concurrent Session Isolation

In production environments, hundreds to thousands of users converse simultaneously. Each user's conversation state must be isolated to prevent cross-contamination, and if a user closes their browser and reopens it, the previous state must be restored.

Memory Architecture Types

LangChain and LlamaIndex provide various memory types for multi-turn conversations. Understanding the pros and cons of each and selecting the right combination for your situation is critical.

Memory Type Comparison

Memory TypeStorage MethodToken UsageInformation FidelitySuitable Scenarios
Buffer MemoryStores entire conversation historyHigh (linear growth)Very highShort conversations, debugging
Window MemoryKeeps only the most recent K messagesFixed (window size)Medium (early info lost)General customer support
Summary MemoryGenerates summary via LLMLow (summary length)Low (detail loss)Long conversations, cost savings
Summary BufferHybrid of summary + recent bufferMediumHighMost production use cases
Vector MemoryRetrieves relevant conversations via embeddingsVariable (search results)High (relevance-based)Long-term memory, cross-session

ConversationBufferMemory vs ConversationSummaryBufferMemory

Buffer Memory is the simplest approach. It stores all messages as-is and feeds them into the prompt. It is easy to debug and has no information loss, but the token limit is quickly reached as conversations grow longer.

Summary Buffer Memory is the most commonly used approach in practice. It keeps recent messages in their original form while compressing older messages into summaries via LLM calls. Summarization is automatically triggered when the token count exceeds a configured threshold.

from langchain.memory import ConversationSummaryBufferMemory
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Automatically summarizes old messages when max_token_limit is exceeded
memory = ConversationSummaryBufferMemory(
    llm=llm,
    max_token_limit=2000,
    return_messages=True,
    memory_key="chat_history",
    human_prefix="Customer",
    ai_prefix="Agent",
)

# Save conversations
memory.save_context(
    {"input": "I'm considering buying a laptop and my budget is 1.5 million won"},
    {"output": "I'll recommend a great laptop within your 1.5 million won budget. What will be the primary use?"},
)
memory.save_context(
    {"input": "I'll be doing programming and light video editing"},
    {"output": "For development and video editing, I recommend specs with at least 16GB RAM and 512GB SSD."},
)
memory.save_context(
    {"input": "Which is better, the MacBook Air M4 or Lenovo ThinkPad?"},
    {"output": "Both are excellent products, but there are differences depending on use case. The MacBook Air M4 is ..."},
)

# Automatic summary + recent messages retained when token limit is exceeded
loaded = memory.load_memory_variables({})
print(loaded["chat_history"])
# SystemMessage: "The customer is looking for a laptop for programming and video editing with a 1.5M won budget..."
# + recent original messages

LlamaIndex ChatSummaryMemoryBuffer

LlamaIndex provides a similar mechanism. ChatSummaryMemoryBuffer periodically summarizes older messages when the configured token limit is exceeded, while keeping recent messages in their original form.

from llama_index.core.memory import ChatSummaryMemoryBuffer
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o", temperature=0)

memory = ChatSummaryMemoryBuffer.from_defaults(
    llm=llm,
    token_limit=3000,
    # Summary trigger token ratio (triggers when exceeding 70% of total limit)
    summarize_threshold=0.7,
)

# Connect memory to Chat Engine
from llama_index.core.chat_engine import CondensePlusContextChatEngine

chat_engine = CondensePlusContextChatEngine.from_defaults(
    retriever=index.as_retriever(similarity_top_k=3),
    memory=memory,
    llm=llm,
    system_prompt="You are a technical support specialist.",
)

response = chat_engine.chat("Tell me more about the error code I mentioned earlier")

Context Window Management

Sliding Window Strategy

The sliding window is the most intuitive context management method. It keeps only the most recent K messages and discards the rest. It is simple to implement and token usage is predictable, but the downside is that early conversation information is completely lost.

An improved sliding window determines window size based on token count rather than simply message count. Twenty short messages and five long messages should not be treated the same.

Token-Based Window Implementation

import tiktoken


def sliding_window_by_tokens(
    messages: list[dict],
    max_tokens: int = 4000,
    model: str = "gpt-4o",
    always_keep_system: bool = True,
) -> list[dict]:
    """Token count-based sliding window.
    Always keeps system messages, fills from recent messages in reverse order.
    """
    enc = tiktoken.encoding_for_model(model)
    result = []
    current_tokens = 0

    # Reserve system messages first
    system_messages = [m for m in messages if m["role"] == "system"]
    non_system = [m for m in messages if m["role"] != "system"]

    if always_keep_system:
        for sm in system_messages:
            sm_tokens = len(enc.encode(sm["content"]))
            result.append(sm)
            current_tokens += sm_tokens

    # Add from most recent messages in reverse order
    selected = []
    for msg in reversed(non_system):
        msg_tokens = len(enc.encode(msg["content"]))
        if current_tokens + msg_tokens > max_tokens:
            break
        selected.append(msg)
        current_tokens += msg_tokens

    result.extend(reversed(selected))
    return result

Context Window Management Method Comparison

Management MethodImplementation ComplexityToken EfficiencyInformation PreservationLatency
Message count-based slidingVery lowMediumLowNone
Token count-based slidingLowHighLowVery low
Summary + slidingMediumHighHighMedium (LLM call)
Vector search-basedHighVery highHighMedium (embedding + search)
Hybrid (summary + vector)Very highVery highVery highHigh

Conversation Summarization Strategies

Conversation summarization is a core technique for saving context window space while preserving key information. Simply saying "summarize the conversation" can cause important details to be omitted, so structured summarization prompts should be used.

Progressive Summarization

Rather than summarizing the entire conversation at once, this approach merges new content into the existing summary at regular turn intervals. This approach yields high summary quality at low cost.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

PROGRESSIVE_SUMMARY_PROMPT = ChatPromptTemplate.from_messages([
    ("system", """You are a customer support conversation summarization expert.
Merge the existing summary with the new conversation to produce an updated summary.

Rules:
1. Always preserve the customer's core requirements, constraints, and preferences
2. Never omit confirmed facts (names, order numbers, dates, etc.)
3. Specify the current progress stage and next required action
4. Record resolved issues briefly, unresolved issues in detail
5. Keep within 200 characters"""),
    ("human", """Existing summary:
{existing_summary}

New conversation:
{new_messages}

Updated summary:"""),
])


class ProgressiveSummarizer:
    def __init__(self, llm, summary_interval: int = 5):
        self.llm = llm
        self.chain = PROGRESSIVE_SUMMARY_PROMPT | llm
        self.summary = ""
        self.buffer = []
        self.summary_interval = summary_interval
        self.turn_count = 0

    def add_turn(self, user_msg: str, assistant_msg: str):
        self.buffer.append(f"Customer: {user_msg}")
        self.buffer.append(f"Agent: {assistant_msg}")
        self.turn_count += 1

        if self.turn_count % self.summary_interval == 0:
            self._update_summary()

    def _update_summary(self):
        new_messages = "\n".join(self.buffer)
        result = self.chain.invoke({
            "existing_summary": self.summary or "(none)",
            "new_messages": new_messages,
        })
        self.summary = result.content
        self.buffer = []  # Clear buffer

    def get_context(self) -> str:
        """Return context combining summary + recent buffer"""
        parts = []
        if self.summary:
            parts.append(f"[Conversation Summary]\n{self.summary}")
        if self.buffer:
            parts.append(f"[Recent Conversation]\n" + "\n".join(self.buffer))
        return "\n\n".join(parts)

Structured vs Free-Form Summarization

Summarization MethodProsConsRecommended Scenarios
Free-form summarizationSimple implementation, flexibleMay miss critical infoGeneral chat bots
Slot-based structuredGuarantees required infoRequires prompt designBooking/ordering chatbots
Key-value extractionSearchable/filterableMay lose contextData collection purposes
Progressive mergeCost-efficient, high qualityPossible cumulative errorsLong support sessions

Context Compression Techniques

Prompt Compression with LLMLingua

Microsoft's LLMLingua series can compress prompts up to 20x while minimizing performance degradation. It removes unimportant tokens based on the perplexity of a small language model. LLMLingua-2 is trained on GPT-4 distillation data, enabling domain-agnostic general-purpose compression that is 3-6x faster than the original LLMLingua.

from llmlingua import PromptCompressor

# Initialize LLMLingua-2
compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2=True,
    device_map="cpu",  # Use "cuda" for GPU
)

# Compress long conversation history
conversation_history = """
Customer: Hello, I'd like to inquire about an order I placed last week.
Agent: Hello! Please provide your order number and I'll check for you.
Customer: The order number is ORD-2026-03-1234. My delivery hasn't arrived yet.
Agent: Let me check on that. Please wait a moment.
Agent: I've confirmed order number ORD-2026-03-1234. It's currently in transit and scheduled to arrive tomorrow.
Customer: Tomorrow? It was supposed to arrive yesterday. Why is it delayed?
Agent: It was delayed by one day due to logistics center circumstances. We apologize for the inconvenience.
Customer: Can I get a shipping fee refund?
Agent: Yes, a shipping fee refund is available due to the delivery delay. Shall I proceed with the refund?
Customer: Yes, please.
Agent: The shipping fee refund of 3,000 won has been processed. It will be refunded to the original payment method within 1-3 days.
"""

compressed = compressor.compress_prompt(
    conversation_history,
    rate=0.5,  # 50% compression rate
    force_tokens=["order number", "ORD-2026-03-1234", "refund"],  # Tokens to always keep
)

print(f"Original tokens: {compressed['origin_tokens']}")
print(f"Compressed tokens: {compressed['compressed_tokens']}")
print(f"Compression ratio: {compressed['ratio']:.1f}x")
print(f"Compressed result:\n{compressed['compressed_prompt']}")

Compression Technique Comparison

Compression TechniqueRatioPerformance RetentionSpeedTraining Required
LLMLinguaUp to 20xHighMediumNo (inference only)
LLMLingua-2Up to 20xVery highFast (3-6x)No
LongLLMLinguaUp to 4xVery highMediumNo
LLM summarizationVariableMediumSlow (LLM call)No
Rule-based filtering2-3xLowVery fastNo
Selective ContextUp to 10xHighFastNo

LangGraph State Machine Implementation

LangGraph can model conversation flows as graph-based state machines. Unlike LangChain's traditional memory approach, it uses explicit state schemas and reducer functions to reliably manage complex multi-turn workflows. State is automatically persisted through checkpointers, making session restoration seamless.

from typing import Annotated, TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langgraph.checkpoint.memory import MemorySaver
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage, AIMessage


# 1. State schema definition
class OrderSupportState(TypedDict):
    messages: Annotated[list, add_messages]  # Accumulate messages with reducer
    order_id: str | None
    issue_type: str | None  # "shipping", "refund", "exchange", "other"
    step: str  # "greeting", "identify", "diagnose", "resolve", "close"
    collected_info: dict
    summary: str  # Previous conversation summary


# 2. Node function definitions
llm = ChatOpenAI(model="gpt-4o", temperature=0)


def greeting_node(state: OrderSupportState) -> dict:
    """Greeting and initial classification"""
    response = llm.invoke([
        SystemMessage(content="You are a customer support chatbot. Identify the customer's inquiry type."),
        *state["messages"],
    ])
    return {
        "messages": [response],
        "step": "identify",
    }


def identify_node(state: OrderSupportState) -> dict:
    """Identify order number and issue type"""
    context_parts = []
    if state.get("summary"):
        context_parts.append(f"Previous conversation summary: {state['summary']}")

    system_msg = f"""Identify the customer's order number and problem type.
Collected info: {state.get('collected_info', dict())}
{chr(10).join(context_parts)}"""

    response = llm.invoke([
        SystemMessage(content=system_msg),
        *state["messages"][-10:],  # Use only the last 10 messages
    ])

    # Extract order number from response (more sophisticated parsing needed in practice)
    return {
        "messages": [response],
        "step": "diagnose",
    }


def resolve_node(state: OrderSupportState) -> dict:
    """Propose problem resolution"""
    response = llm.invoke([
        SystemMessage(content=f"Issue type: {state.get('issue_type', 'unidentified')}. "
                              f"Order number: {state.get('order_id', 'unidentified')}. Propose a solution."),
        *state["messages"][-6:],
    ])
    return {
        "messages": [response],
        "step": "close",
    }


# 3. Routing function
def route_by_step(state: OrderSupportState) -> str:
    step = state.get("step", "greeting")
    if step == "greeting":
        return "greeting"
    elif step == "identify":
        return "identify"
    elif step in ("diagnose", "resolve"):
        return "resolve"
    else:
        return END


# 4. Graph construction
graph = StateGraph(OrderSupportState)
graph.add_node("greeting", greeting_node)
graph.add_node("identify", identify_node)
graph.add_node("resolve", resolve_node)

graph.add_conditional_edges(START, route_by_step)
graph.add_conditional_edges("greeting", route_by_step)
graph.add_conditional_edges("identify", route_by_step)
graph.add_conditional_edges("resolve", route_by_step)

# 5. Persist state with checkpointer
checkpointer = MemorySaver()
app = graph.compile(checkpointer=checkpointer)

# 6. Execute (sessions distinguished by thread_id)
config = {"configurable": {"thread_id": "user-session-abc123"}}
result = app.invoke(
    {
        "messages": [HumanMessage(content="My ordered item hasn't arrived yet")],
        "step": "greeting",
        "collected_info": {},
        "summary": "",
    },
    config=config,
)

LangGraph's checkpointer automatically saves state after each node execution. MemorySaver is in-memory storage suitable for development/testing; in production, you should use SqliteSaver, PostgresSaver, or MongoDB Store. Sessions are distinguished by thread_id, allowing multiple users' conversations to be isolated simultaneously.

Session Management and Persistence

Redis-Based Session Store

When persisting conversation state in production, Redis is the most common choice. It supports low-latency reads/writes, TTL-based automatic expiration, and real-time notifications via Pub/Sub.

import json
import time
import redis
import tiktoken


class ChatSessionManager:
    """Redis-based multi-turn conversation session manager"""

    def __init__(
        self,
        redis_url: str = "redis://localhost:6379",
        session_ttl: int = 3600,  # 1 hour
        max_history_tokens: int = 4000,
    ):
        self.redis = redis.from_url(redis_url, decode_responses=True)
        self.session_ttl = session_ttl
        self.max_history_tokens = max_history_tokens
        self.encoder = tiktoken.encoding_for_model("gpt-4o")

    def _key(self, session_id: str, suffix: str) -> str:
        return f"chat:session:{session_id}:{suffix}"

    def create_session(self, session_id: str, metadata: dict | None = None) -> dict:
        """Create a new session"""
        session_data = {
            "session_id": session_id,
            "created_at": time.time(),
            "updated_at": time.time(),
            "turn_count": 0,
            "total_tokens": 0,
            "metadata": json.dumps(metadata or {}),
            "summary": "",
        }
        self.redis.hset(self._key(session_id, "meta"), mapping=session_data)
        self.redis.expire(self._key(session_id, "meta"), self.session_ttl)
        return session_data

    def add_message(self, session_id: str, role: str, content: str) -> None:
        """Add a message and manage tokens"""
        msg = json.dumps({
            "role": role,
            "content": content,
            "timestamp": time.time(),
            "tokens": len(self.encoder.encode(content)),
        })
        history_key = self._key(session_id, "history")
        self.redis.rpush(history_key, msg)
        self.redis.expire(history_key, self.session_ttl)

        # Update metadata
        self.redis.hincrby(self._key(session_id, "meta"), "turn_count", 1)
        self.redis.hset(
            self._key(session_id, "meta"), "updated_at", str(time.time())
        )
        # Renew TTL
        self.redis.expire(self._key(session_id, "meta"), self.session_ttl)

    def get_context_messages(self, session_id: str) -> list[dict]:
        """Return context messages within the token budget"""
        history_key = self._key(session_id, "history")
        all_messages = self.redis.lrange(history_key, 0, -1)

        if not all_messages:
            return []

        parsed = [json.loads(m) for m in all_messages]
        result = []
        token_count = 0

        # Add summary first if available
        summary = self.redis.hget(self._key(session_id, "meta"), "summary")
        if summary:
            summary_tokens = len(self.encoder.encode(summary))
            token_count += summary_tokens
            result.append({"role": "system", "content": f"Previous conversation summary: {summary}"})

        # Add recent messages in reverse order within token budget
        selected = []
        for msg in reversed(parsed):
            msg_tokens = msg.get("tokens", 0)
            if token_count + msg_tokens > self.max_history_tokens:
                break
            selected.append({"role": msg["role"], "content": msg["content"]})
            token_count += msg_tokens

        result.extend(reversed(selected))
        return result

    def update_summary(self, session_id: str, summary: str) -> None:
        """Update conversation summary"""
        self.redis.hset(self._key(session_id, "meta"), "summary", summary)

    def delete_session(self, session_id: str) -> None:
        """Delete session"""
        for suffix in ("meta", "history"):
            self.redis.delete(self._key(session_id, suffix))

Session Store Comparison

StoreLatencyPersistenceScalabilityTTL SupportSuitable Scenarios
In-memory (dict)NanosecondsNoneSingle processManual implDevelopment/testing
RedisMillisecondsConditional (AOF/RDB)Cluster supportBuilt-inProduction real-time
PostgreSQLFew msFullHighTrigger implAudit logging required
MongoDBFew msFullSharding supportTTL indexUnstructured state
DynamoDBFew msFullUnlimitedTTL built-inAWS-based services

In production, the common hybrid pattern is to use Redis as the main session store while asynchronously flushing to PostgreSQL or MongoDB. Reading current conversation state quickly from Redis and saving the complete history to a relational DB at conversation end provides both performance and persistence.

Token Budget Management

In production chatbots, token budget management is central to cost control and response quality. A strategy is needed to divide the model's context window into allocations for system prompts, conversation history, RAG documents, and response reservation.

import tiktoken
from dataclasses import dataclass


@dataclass
class TokenBudget:
    """Token budget allocation calculator"""
    model: str = "gpt-4o"
    max_context: int = 128000  # gpt-4o context window
    system_prompt_tokens: int = 500
    response_reserve: int = 4000  # Reserved for response
    rag_budget: int = 3000  # For RAG documents
    tool_result_budget: int = 2000  # For tool execution results

    def __post_init__(self):
        self.encoder = tiktoken.encoding_for_model(self.model)

    @property
    def conversation_budget(self) -> int:
        """Number of tokens available for conversation history"""
        reserved = (
            self.system_prompt_tokens
            + self.response_reserve
            + self.rag_budget
            + self.tool_result_budget
        )
        return self.max_context - reserved

    def count_tokens(self, text: str) -> int:
        return len(self.encoder.encode(text))

    def allocate(self, messages: list[dict]) -> dict:
        """Token usage status report for current messages"""
        msg_tokens = sum(
            self.count_tokens(m.get("content", "")) for m in messages
        )
        budget = self.conversation_budget
        return {
            "total_context": self.max_context,
            "system_prompt": self.system_prompt_tokens,
            "response_reserve": self.response_reserve,
            "rag_budget": self.rag_budget,
            "tool_result_budget": self.tool_result_budget,
            "conversation_budget": budget,
            "conversation_used": msg_tokens,
            "conversation_remaining": budget - msg_tokens,
            "utilization_pct": round(msg_tokens / budget * 100, 1),
            "needs_compression": msg_tokens > budget * 0.8,
        }


# Usage example
budget = TokenBudget(model="gpt-4o", max_context=128000)
print(f"Available tokens for conversation history: {budget.conversation_budget:,}")

report = budget.allocate([
    {"role": "user", "content": "Please check my previous order status"},
    {"role": "assistant", "content": "Could you provide your order number?"},
])
print(f"Utilization: {report['utilization_pct']}%")
print(f"Compression needed: {report['needs_compression']}")

Triggering compression automatically when 80% of the token budget is exceeded is a good practice. This threshold should be adjusted based on service characteristics. For customer support where accuracy is critical, lower it to 70%; for casual conversations, 90% is acceptable.

Troubleshooting

Repeated Questions Due to Context Loss

Symptom: The chatbot asks again for information it has already collected.

Diagnostic order:

  1. Check if the sliding window size is too small. The key information may be in messages that have slid out of the window.
  2. Check if summarization is working properly. The summarization prompt may be omitting critical slot information (names, order numbers, etc.).
  3. Check if collected_info is being properly updated in the state management logic.

Solution: Specify a "fields that must be preserved" list in the structured summarization prompt. Manage slot information in a separate state dictionary.

Conversation Disconnection Due to Redis Session Expiration

Symptom: When a user leaves briefly and returns, the chatbot starts from the beginning.

Root cause: TTL is set too short.

Solution: Renew the TTL every time a message is added, and set TTL appropriate to the business requirements. 2 hours for customer support and 24 hours for shopping assistants are typical. Sending a warning message before expiration is also recommended.

Summary Drift (Cumulative Summary Errors)

Symptom: Facts become distorted or hallucinations are included as progressive summarization is repeated.

Root cause: Information loss and distortion accumulate as summaries of summaries are repeated.

Solution: Regenerate summaries from original messages every 5-10 summarization cycles. Extract factual information like numbers, dates, and proper nouns separately from the summary and store them in state.

State Conflicts from Concurrent Requests

Symptom: When a user sends messages in rapid succession, responses are mixed up or state breaks.

Root cause: Two concurrently executing requests read and write the same session state, causing race conditions.

Solution: Use Redis WATCH/MULTI/EXEC transactions or distributed locks. With LangGraph, the checkpointer guarantees sequential execution, naturally resolving this issue.

Operational Checklist

Items to verify before deploying a production multi-turn chatbot.

Memory and Context Management:

  • Memory type selection completed (Buffer, Summary Buffer, Vector, etc.)
  • Token budget allocation defined (system prompt, conversation history, RAG, response reservation)
  • Context compression threshold set (trigger at 80% or above)
  • Required preservation fields specified in summarization prompt

Session Management:

  • Redis or persistent store connection confirmed
  • Session TTL configured (appropriate for service type)
  • Concurrent request handling strategy established (locks, queues, sequential execution)
  • User notification logic implemented for session expiration

Monitoring:

  • Average tokens per turn tracked
  • Summarization call frequency and cost monitored
  • Repeated question rate due to context loss measured
  • Average session duration and turn count tracked

Incident Response:

  • In-memory fallback logic for Redis failures
  • Original message retention strategy for summarization LLM call failures
  • State recovery procedures documented

Failure Cases

Case 1: Infinite Context Expansion

In a customer support chatbot project, Buffer Memory was used with no context management under the policy "never allow information loss." Initially there were no issues, but as the average turn count exceeded 30, API costs surged from 5 million won to 20 million won per month. Response latency also increased from an average of 2 seconds to 8 seconds.

Lesson: Keeping all messages is not always optimal. After switching to Summary Buffer Memory and setting a token budget of 4,000, costs were reduced by 70% with no significant change in customer satisfaction.

Case 2: The Pitfall of Free-Form Summarization

In a travel booking chatbot using free-form summarization, incidents repeatedly occurred where departure and arrival dates were swapped or the number of travelers was omitted during summarization. The information "2 people" from the customer was dropped from the summary, leading to bookings processed at single-person rates.

Lesson: Slot-based structured summarization must be used for task-oriented conversations. Required slots like departure, destination, dates, number of travelers, and seat class were specified, and validation logic was added to verify these values are present in the summary.

Case 3: Session Isolation Failure

In a multi-tenant SaaS chatbot, the session key was constructed using only user_id. When the same user attempted conversations on different topics in multiple browser tabs, the two conversations' states mixed, producing irrelevant responses.

Lesson: Session keys should be composite keys of user_id + session_id. A unique session_id was issued for each browser tab, and the user dashboard was updated to manage the list of active sessions.

References