Skip to content
Published on

Context Engineering — How to Design Memory for AI Agents

Authors

Introduction

If you had to pick the single most frequent phrase in the AI engineering community in the first half of 2026, it would be "context engineering." Supermemory, an open-source LLM memory layer, recently trended on GeekNews, and articles about agent memory design keep hitting the Hacker News front page day after day. Now that coding agents like Claude Code and Codex routinely run autonomously for hours at a stretch, the question of "what to show the model, and how" has become far more important than polishing the wording of a prompt.

If prompt engineering was the craft of writing a single good request, context engineering is the craft of designing the entire composition of information the model sees at every moment. And at its center sits memory — the problem of designing recollection for an agent. In this post we will walk through the background of the paradigm shift, memory tier design, fact extraction pipelines, compaction strategies, evaluation methods, and pitfalls such as memory poisoning.

From Prompt Engineering to Context Engineering

What changed

LLM usage in 2023 was mostly one-shot. A user typed a question, the model answered, and when the conversation ended everything vanished. The optimization target in those days was "a single prompt," and the main concerns were whether to include few-shot examples, whether to assign a persona, and whether to elicit chain-of-thought.

Agents in 2026 are different. They call tools, read files, carry work across dozens of turns, and must reference decisions made in yesterday's session during today's session. The input delivered to the model is no longer "a single prompt" but a composite of the following elements.

  • The system prompt and tool definitions
  • Past conversation history or its summaries
  • Document fragments fetched via retrieval
  • User profiles and facts loaded from long-term memory
  • Intermediate artifacts of the current task (file contents, command outputs)

The quality of this composite is the quality of the agent. As the Anthropic engineering blog put it, context engineering can be defined as "selecting and arranging the smallest set of high-signal tokens at each inference step, within the budget of a limited context window."

Comparing the two paradigms

AspectPrompt engineeringContext engineering
Optimization targetWording of a single requestComposition and flow of the whole input
Time horizonOne turnMulti-turn, multi-session
Core techniquesFew-shot, personas, CoT elicitationMemory tiers, retrieval, compression, isolation
Failure modesAwkward answersContext amnesia, broken consistency, cost explosion
DeliverablesPrompt templatesMemory schemas, pipelines, eval sets

Prompt engineering did not disappear. The wording of the system prompt still matters. But it is now just one component of a larger design, and the bottleneck has moved to "what to remember and what to let the agent forget."

The Economics of the Context Window

Tokens are not free

Frontier model context windows have grown from 200 thousand to a million tokens, but "can fit" and "should fit" are entirely different questions. Start with cost. Assume an input price of 3 dollars per million tokens: an agent dragging a 150-thousand-token context on every turn burns roughly 0.45 dollars per turn on input alone. A 50-turn session exceeds 22 dollars, and a service handling a thousand sessions a day spends hundreds of thousands of dollars per month on input cost alone. Prompt caching can reduce this, but if your structure produces cache misses, the savings are limited.

Performance is not free either — lost in the middle

Worse than cost is performance degradation. The "Lost in the Middle" study by Liu et al. (2023) showed that information placed in the middle of a long context is used markedly less well than information at the beginning or end, and in the latest 2026 models this tendency has only shrunk, not vanished. There is also the phenomenon commonly called "context rot": as low-relevance information accumulates, the model misses instructions, confuses stale information with fresh information, and hallucinates more.

Retrieval accuracy (conceptual curve)

 100% |■■■
      |■■■■■
  80% |■■■■■■■           ■■■■
      |■■■■■■■■        ■■■■■■
  60% |■■■■■■■■■     ■■■■■■■■
      |■■■■■■■■■■■■■■■■■■■■■■
      +--------------------------
       Position: front  middle  end

Recall is lowest when the key information
sits in the MIDDLE of the context (lost in the middle)

Conclusion: context is a scarce resource

In summary, the context window is scarce in three senses.

  1. Monetary cost — token pricing and cache efficiency
  2. Latency — longer inputs mean longer time to first token
  3. Attention budget — the more tokens there are, the more the model's attention to each piece of information is diluted

Context engineering starts by accepting these three constraints and designing "put in selectively" instead of "put in everything." And to select, you need somewhere to keep the candidates. That is memory.

Designing the Memory Tiers

Borrowing the taxonomy of memory from cognitive science, splitting agent memory into three tiers has become the standard pattern.

+--------------------------------------------------+
|                Agent memory tiers                 |
+--------------------------------------------------+
|  Working memory                                   |
|  - The current context window itself              |
|  - Lifetime: current session, capacity: tokens    |
+--------------------------------------------------+
|  Episodic memory                                  |
|  - Event records: "what happened when"            |
|  - Session summaries, task logs, decision history |
+--------------------------------------------------+
|  Semantic memory                                  |
|  - Time-independent knowledge: "what is true"     |
|  - User profile, preferences, domain facts        |
+--------------------------------------------------+

Working memory — the context window itself

Working memory is not a separate store; it is the context the model is looking at right now. The design points are placement and priority. Put the system prompt and core instructions at the very front, the latest conversation and the immediate task at the very end, and place auxiliary material that can tolerate lower recall in the middle. For bulky items like tool execution results, define a lifetime policy and truncate them after a set number of turns.

Episodic memory — a record of events

Episodic memory holds event-level records such as "in last Tuesday's session we refactored the payment module and failed twice due to a transaction isolation issue." The usual implementation generates a summary with an LLM at session end (or at compaction time) and stores it with a timestamp. Injecting the most recent few episodes into the context when the next session starts makes "continue where we left off yesterday" possible.

Semantic memory — a store of facts

Semantic memory is a collection of propositions that are true independent of time. Things like "the user prefers TypeScript" or "this project uses PostgreSQL 16." The core design challenges are threefold: extraction (how to pull facts out of conversation), updating (how to overwrite when a contradicting new fact arrives), and retrieval (how to select only the facts relevant to the current query).

Comparing the tiers

TierLifetimeStorageRetrievalUpdate frequency
Working memoryWithin a sessionContext windowAlways visibleEvery turn
EpisodicWeeks to monthsDB, vector storeChronological plus similarityPer session
SemanticSemi-permanentDB, graphKey lookup plus similarityWhen facts change

Extracting Facts from Conversations — User Profile Patterns

This is exactly the core capability shared by the memory layer open-source projects — Supermemory, which trended on GeekNews, plus mem0 and Letta (formerly MemGPT). While conversation flows, a background pipeline extracts "facts worth remembering," compares them against existing memory, and decides whether to add, update, or discard.

Structure of the extraction pipeline

Conversation turn --> [Extractor LLM] --> candidate facts
                                              |
                                              v
                 search existing memory (similarity top-k)
                                              |
                                              v
               [Judge LLM] --> ADD / UPDATE / DELETE / NOOP
                                              |
                                              v
                          apply to memory store

Implementation example

Below is a minimal implementation separating extraction from reconciliation. In production it is common to use a small model for the extractor and a mid-size model for the judge to keep costs down.

import json
from anthropic import Anthropic

client = Anthropic()

EXTRACT_PROMPT = """From the following conversation, extract only facts
about the user worth remembering long term. Exclude transient states
(e.g., currently hungry). Answer with a JSON array only. Each element
has the fields fact, category, confidence.
category is one of preference, identity, project, constraint."""

def extract_facts(conversation_text: str) -> list[dict]:
    resp = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=1024,
        system=EXTRACT_PROMPT,
        messages=[{"role": "user", "content": conversation_text}],
    )
    return json.loads(resp.content[0].text)

RECONCILE_PROMPT = """Compare the new facts against the existing memory
list and decide ADD, UPDATE (with target id), or NOOP for each new fact.
Choose UPDATE when an existing memory contradicts it.
Answer with a JSON array only."""

def reconcile(new_facts: list[dict], existing: list[dict]) -> list[dict]:
    payload = json.dumps(
        {"new_facts": new_facts, "existing": existing},
        ensure_ascii=False,
    )
    resp = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=2048,
        system=RECONCILE_PROMPT,
        messages=[{"role": "user", "content": payload}],
    )
    return json.loads(resp.content[0].text)

Filtering criteria at extraction time

Store everything and your memory soon becomes a junkyard. Filter criteria proven in practice are the following.

  • Reusability: is there a chance it will be used again in future conversations
  • Stability: is it likely to still be true next week (exclude transient emotions, today's weather)
  • Source clarity: did the user state it directly, or did the model infer it (lower confidence for inferences)
  • Sensitivity: do not store sensitive information such as health or political orientation without explicit consent

RAG vs Memory — What Is the Difference

Mechanically the two look similar — both fetch text from external storage and put it into context — but their design purposes differ. Confusing them leads to choosing the wrong tool.

AspectRAGMemory
Data sourceExternal document corpusThe agent's own interactions
Data natureStatic, large, sharedDynamic, small, per-user
Update ownerBatch indexing pipelineAgent runtime
Write operationsRare (read-heavy)Frequent (read-write)
Conflict handlingUnnecessaryCore challenge (fact updates)
Representative questionWhat does this document sayWho is this user

In practice you combine the two. Bring domain knowledge in via RAG and user/task history via memory, but when injecting them into context, separate them into clearly labeled sections with their provenance — this reduces model confusion.

Compaction and Summarization Strategies

When the context approaches its limit in a long session, something has to go. The auto-compact feature in Claude Code is a representative implementation; generalized, there are three strategies.

Strategy 1 — sliding window plus summary

Instead of dropping old turns wholesale, replace them with a summary. Explicitly listing what must be preserved is what determines summary quality.

COMPACT_PROMPT = """Summarize the following conversation history.
You MUST preserve:
1. The user's original goal and constraints
2. Decisions made and their reasons
3. Approaches that were tried and failed (to prevent retries)
4. The exact state of work currently in progress
5. Concrete identifiers: file paths, function names, config values
Small talk and intermediate reasoning may be discarded."""

def compact(history: list[dict], keep_recent: int = 10) -> list[dict]:
    old, recent = history[:-keep_recent], history[-keep_recent:]
    summary = summarize_with_llm(COMPACT_PROMPT, old)
    return [{"role": "user", "content": f"[Summary of earlier conversation]\n{summary}"}] + recent

Strategy 2 — structured notes (note-taking)

Instead of summarizing, have the agent itself maintain a structured note file while working. This is especially effective with the generation of models capable of long autonomous runs. Even if the context is reset, re-reading the note file restores the state.

Strategy 3 — subagent isolation

Delegate bulky exploration work (codebase search, document research) to a subagent, with the main agent receiving only the compressed result. This cuts off context pollution at the source.

StrategyProsConsBest fit
Summary replacementSimple to implementLoses detailGeneral conversational agents
Structured notesMinimal loss, auditableNote management overheadLong-running coding agents
SubagentsBlocks pollutionAdded latency and costLarge-scale exploration

A Memory Schema Example

Pile memory up as free text and updating and conflict resolution become impossible. You need a schema. Below is a JSON schema example usable as a practical starting point.

{
  "memory_id": "mem_01HXYZ",
  "subject": "user:fjvbn2003",
  "category": "preference",
  "fact": "Wants code review comments in Korean and code comments in English",
  "confidence": 0.92,
  "source": {
    "type": "explicit_statement",
    "session_id": "sess_20260610_a",
    "turn": 14
  },
  "created_at": "2026-06-10T09:32:00Z",
  "updated_at": "2026-06-10T09:32:00Z",
  "expires_at": null,
  "supersedes": null,
  "embedding_ref": "vec_8f3a",
  "access_count": 7,
  "last_accessed_at": "2026-06-12T01:10:00Z"
}

Key design points are as follows.

  • The source field distinguishes explicit statements from inferences. Essential when investigating memory poisoning incidents.
  • The supersedes field points to the previous memory id on update, preserving history. Use soft delete for deletions.
  • The access_count and last_accessed_at fields feed retrieval ranking and decay (forgetting) policies.
  • The expires_at field is for time-bound facts like "moving house next month."

Persisting State Across Sessions — Implementation Pattern

The last piece of the puzzle is tying all of this together into an agent that crosses session boundaries. The core consists of two hooks: assembling context from memory at session start, and writing new memories at session end.

class MemoryAwareAgent:
    def __init__(self, store, user_id: str):
        self.store = store
        self.user_id = user_id

    def start_session(self, first_message: str) -> str:
        profile = self.store.get_semantic(self.user_id, limit=20)
        episodes = self.store.get_recent_episodes(self.user_id, limit=3)
        relevant = self.store.search(self.user_id, first_message, k=5)

        memory_block = render_memory_block(profile, episodes, relevant)
        return (
            "The following is what is remembered about this user.\n"
            "If anything is stale or contradicts the current conversation, "
            "prefer the conversation.\n\n"
            + memory_block
        )

    def end_session(self, history: list[dict]) -> None:
        episode = summarize_episode(history)
        self.store.add_episode(self.user_id, episode)

        facts = extract_facts(render_history(history))
        existing = self.store.get_semantic(self.user_id, limit=100)
        for op in reconcile(facts, existing):
            self.store.apply(self.user_id, op)

Two easily missed details deserve emphasis here.

First, when injecting the memory block you must include the instruction "prefer the current conversation when there is a contradiction." Without this one line, you get incidents where the model trusts stale memory over what the user just said.

Second, the end_session hook must run asynchronously. Fact extraction and reconciliation involve two LLM calls, so handling them synchronously makes the user wait several seconds at session end.

How to Evaluate a Memory System

"It feels better after wiring it in" is not an evaluation. A memory system is measured along four axes.

1. Extraction quality

Measure the precision and recall of the extractor on a manually labeled conversation set. You must look at both "facts that should have been remembered but were missed" (recall) and "noise that should not have been stored but was" (precision).

2. Retrieval quality

Measure whether relevant memories appear within the top-k for each query. Use the same methodology as general retrieval evaluation (recall at k, MRR), but verify time decay and update history as separate test cases.

In practice it is common to rank not by similarity alone but by a score that combines recency and usage frequency. The weights themselves become an evaluation target.

import math

def memory_score(m, query_sim: float, now) -> float:
    days = (now - m.last_accessed_at).days
    recency = math.exp(-days / 30)       # assume 30-day half-life
    frequency = math.log(1 + m.access_count)
    return 0.6 * query_sim + 0.25 * recency + 0.15 * frequency

3. End-to-end task success rate

The most important metric. Script scenarios such as "does the agent apply a preference stated three sessions ago without asking again in a new session" and measure pass rates. Public benchmarks like LongMemEval systematically cover temporal reasoning, multi-session reasoning, and prioritizing updated facts, so they are a good reference for designing your own eval set.

4. Cost and latency

Measure the LLM call cost of the memory pipeline itself, and whether the input tokens added by memory injection are larger or smaller than the tokens saved (no re-explanation needed). Whether memory is a cost saver or a cost adder cannot be known before measuring.

Example evaluation scenario (multi-session probe)

Session 1: user mentions "our team only uses pnpm"
Session 2: (unrelated conversation)
Session 3: request "add a dependency"
  Pass: uses pnpm add
  Fail: uses npm install, or asks which one to use

Pitfalls and Critical Perspectives

Memory poisoning

The most dangerous failure mode. Once a wrong fact is stored, it gets injected into every subsequent session and reproduces the error. Worse still is malicious poisoning. If the agent "remembers" instructions embedded in third-party content it reads — web pages, emails, documents — you get prompt injection with persistence. Defenses include the following.

  • Trust boundary separation: store direct user statements and information derived from external content at different trust levels
  • Write validation: scan for injection patterns before writing to memory
  • Periodic audits: run LLM audits over the memory store on a schedule (detect contradictions, anomalous instructions)
  • User visibility: a UI where users can view and delete stored memories

Privacy and regulation

Memory is, by nature, user profiling. To honor the GDPR right to erasure, complete per-user deletion must be possible from the design stage of the memory store, and embeddings and backups must be in the deletion scope. For sensitive categories (health, beliefs, sexual orientation), block them by default at the extraction stage and store them only with explicit opt-in.

The paradox of over-remembering

The assumption that injecting more memory is always better is wrong. Injecting low-relevance memories reproduces exactly the context rot we saw earlier, and from the user's perspective it becomes the unsettling experience of "dragging up something I said long ago with no context." The practical recommendation is to set the relevance threshold of the retrieval stage conservatively and cap the injection volume.

The counterargument: "just grow the context"

There is a counterargument that memory systems are unnecessary in the era of million-token contexts. It has merit, but memory remains necessary for three reasons. First, accumulated multi-session history quickly exceeds even a million tokens. Second, cost and latency scale with context length. Third, lost in the middle did not disappear as windows grew. Long context and memory are complements, not substitutes.

A Practical Adoption Guide

You do not need to build all three memory tiers from day one. A staged roadmap is recommended.

  1. Stage 1 — start with compaction: implementing only in-session summary replacement noticeably improves long-session quality.
  2. Stage 2 — episodic memory: store a summary at session end, inject the latest three at session start. The best effect-to-effort ratio.
  3. Stage 3 — semantic memory: the fact extraction and update pipeline. Evaluate existing open source like Supermemory and mem0 first, and build your own only when requirements are unusual.
  4. Stage 4 — evaluation and governance: multi-session probe eval sets, memory view/delete UI, audit pipelines.

At each stage, measure "how much did task success rate improve compared to no memory" and decide whether to proceed to the next stage — this avoids over-engineering.

Closing

If prompt engineering was the art of speaking well, context engineering is the art of designing memory and attention. And it is not a one-off trick but a domain of software engineering with schema design, pipelines, evaluation, and governance. The reason open source like Supermemory is trending is precisely that this problem has become homework shared by every agent builder.

An agent's intelligence comes from model weights, but an agent's usefulness comes in large part from memory design. When you build your next agent, sketch the memory schema before you pick the model.

References