Skip to content
Published on

AI Agent Memory & Long-Term Context 2026 — Mem0 / Zep / Letta / Cognee / Graphiti / Anthropic Memory Deep-Dive

Authors

Prologue — A 1M Window Did Not Solve Memory

In 2024 we believed "once context windows hit 1M tokens, the memory problem disappears." In 2026 we know that was a lie.

  • Sending 1M tokens every turn explodes latency, explodes cost, and crushes needle-in-haystack performance.
  • Context evaporates with the session. Yesterday's conversation is gone today.
  • Even if a model receives context, there is no guarantee it uses it. "Lost in the middle" is alive and well.

That is why one of the hottest infrastructure categories in 2026 AI engineering is agent memory. Mem0 came out of YC, Zep raised a Series A, and Letta (formerly MemGPT) staked out the "agent OS" position. Anthropic shipped its Memory API in 2025 preview, and OpenAI baked Memory into ChatGPT as a default.

This article walks the full map. We break down the memory hierarchy, examine each major library and API, compare the storage backends, cover the Korean and Japanese movements, and end with explicit recommendations for who should pick what.


Chapter 1 · The 2026 Agent Memory Map — Three Models: Vector / Graph / Episodic

By 2026, agent memory has largely converged on three models.

┌─────────────────────────────────────────────────────────────────┐
│                  The Three Agent Memory Models                  │
│                                                                 │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────────┐    │
│  │ Vector       │   │ Graph        │   │ Episodic         │    │
│  │ Memory       │   │ Memory       │   │ Memory           │    │
│  │              │   │              │   │                  │    │
│  │ Embed+search │   │ Entities+    │   │ Sequences of     │    │
│  │ Similarity   │   │ relations    │   │ events           │    │
│  │              │   │ Reasoning    │   │ Time + cause     │    │
│  │              │   │              │   │                  │    │
│  │ Representative│  │ Representative│  │ Representative   │    │
│  │ - Mem0       │   │ - Cognee     │   │ - Letta          │    │
│  │ - Verba      │   │ - Graphiti   │   │ - Generative     │    │
│  │ - OpenAI Mem │   │ - Zep(graph) │   │   Agents         │    │
│  │              │   │              │   │ - MemoryBank     │    │
│  └──────────────┘   └──────────────┘   └──────────────────┘    │
│                                                                 │
│        (real-world products are almost always hybrid)           │
└─────────────────────────────────────────────────────────────────┘

Each model has clear strengths and weaknesses.

ModelStrengthsWeaknessesGood fit
VectorFast retrieval, simple mental model, rich infraWeak at relational reasoning, no "why"Chatbot FAQ, document RAG, user preferences
GraphMulti-hop reasoning, explicit facts, updatableExtraction cost, schema-design burdenCRM, codebases, org charts
EpisodicTime + causality, flow of eventsHeavy, complex recall algorithmsCharacter agents, long-running companions, simulation

In practice, almost every shipped product mixes two or more. Zep is vector + graph + temporal, Letta wraps episodic + semantic + procedural like an operating system, and Mem0 is vector-first with an optional graph mode.


Chapter 2 · Memory Taxonomy — Short-term / Working / Long-term

Before comparing libraries, fix the vocabulary. The hierarchy almost every memory system agrees on in 2026:

┌────────────────────────────────────────────────────┐
│                                                    │
│  Short-term Memory                                 │
│   = the LLM context window itself                  │
│   = every message visible this turn                │
│   = gone when the session ends                     │
│                                                    │
├────────────────────────────────────────────────────┤
│                                                    │
│  Working Memory                                    │
│   = the agent's "scratchpad"                       │
│   = extractions/summaries/plans for current task   │
│   = context + the active pages from external store │
│                                                    │
├────────────────────────────────────────────────────┤
│                                                    │
│  Long-term Memory — 3 flavors                      │
│                                                    │
│   ┌──────────────────────────────────────────┐    │
│   │ Semantic   — "knowledge". Facts, prefs    │    │
│   │   e.g. "the user lives in Korea"         │    │
│   │   e.g. "project X is written in Rust"    │    │
│   └──────────────────────────────────────────┘    │
│                                                    │
│   ┌──────────────────────────────────────────┐    │
│   │ Episodic   — "events". Time + causality  │    │
│   │   e.g. "last Tuesday tried A and failed" │    │
│   │   e.g. "tests passed after refactor of B"│    │
│   └──────────────────────────────────────────┘    │
│                                                    │
│   ┌──────────────────────────────────────────┐    │
│   │ Procedural — "how". Procedures, skills    │    │
│   │   e.g. "how PRs open in this codebase"   │    │
│   │   e.g. "this user's debugging style"     │    │
│   └──────────────────────────────────────────┘    │
│                                                    │
└────────────────────────────────────────────────────┘

This taxonomy is borrowed from cognitive science (Tulving 1972, Squire 1992). When the 2023 Stanford Generative Agents paper applied it to LLM agents, it became the de facto industry standard.

Why the hierarchy matters:

  • Each memory type needs different storage, recall, and forgetting strategies.
  • Semantic memory naturally updates and overwrites; episodic memory only appends.
  • Procedural memory weights by usage frequency; semantic memory weights by trust.
  • Context injection priority differs — procedural near the start, semantic in the middle, episodic at the end.

Keep this hierarchy in mind and the libraries' specialties light up.


Mem0 graduated YC in 2024 and rapidly became the de facto standard. A clean SDK, sensible defaults, and the "memory in 5 minutes" pitch landed exactly where the market lived.

Core model

Mem0's mental model is simple:

  1. Use an LLM to extract valuable facts from conversation.
  2. Embed the facts into a vector store (an LLM merges duplicates).
  3. For new messages, recall related facts via similarity search.
  4. Inject the recalled facts into context and call the model.
from mem0 import Memory

m = Memory()

# Auto-extract facts from a user message
m.add("My name is Youngju and I use Postgres", user_id="user-42")

# Recall next turn
results = m.search("What DB do I use?", user_id="user-42")
# -> [{"memory": "User uses Postgres", "score": 0.91, ...}]

Internally Mem0 makes two LLM calls:

  • An extraction call — pull facts, and decide updates/deletes when they conflict with existing ones.
  • The retrieval path is embedding similarity, not an LLM call.

The "extraction call" cost is both Mem0's biggest drawback and its biggest strength. It costs more, but you get a low-noise memory store.

Graph mode and multi-agent

In mid-2025 Mem0 shipped Graph Memory GA. It stores entities and relations alongside vectors using Neo4j as the backend. Useful when you need to reason over user — project — tool relationships.

Mem0 also supports multi-actor memory — separating memories not only by user_id but by agent_id and run_id. In a multi-agent system, each agent gets its own memory.

Where Mem0 fits

  • Best at: chatbots, assistants, preference-based recommendations
  • Worst at: complex codebase reasoning, long-running simulations, academic research
  • One-liner: "If you are adding memory for the first time in 2026, 90 percent of the time you start with Mem0."

Chapter 4 · Zep — Hybrid Graph + Vector (Series A)

Zep raised a Series A in 2024 and claimed the enterprise memory category. Its differentiator is a graph + vector + temporal hybrid.

The core component — Graphiti

Zep's engine is Graphiti, an open-source knowledge-graph framework. Graphiti's job:

  • Extract entities and relations from conversations and documents.
  • Insert the facts into a bi-temporal KG — every fact carries valid_from / valid_to timestamps.
  • When a contradicting fact arrives for the same entity, the new fact becomes valid and the old one is marked invalid (not deleted).
  • Recall combines graph traversal + vector similarity + temporal filtering.
from zep_python.client import Zep
from zep_python.types import Message

client = Zep(api_key="...")
client.user.add(user_id="user-42", first_name="Youngju")

# Add a message — Zep auto-integrates it into the KG
client.memory.add(
    session_id="sess-1",
    messages=[Message(role="user", content="I moved from Acme to Bravo")],
)

# Recall — graph facts + related messages
mem = client.memory.get(session_id="sess-1")
# -> facts: ["User works at Bravo (previously: Acme)"]

Why temporal reasoning matters

The weakness of traditional vector memory: even when new facts arrive, the old facts survive. "User works at Acme" and "User works at Bravo" both come back from a search and confuse the model.

Zep/Graphiti solve this with a bi-temporal model. Every fact records both when the event happened and when it entered the DB. Recall naturally filters to "facts valid right now."

Where Zep fits

  • Best at: CRM, sales assistants, long-running operational agents, regulated industries
  • Worst at: simple chatbots, throwaway prototypes
  • One-liner: "Pick Zep when 'consistency of facts' matters in the enterprise."

Chapter 5 · Letta (formerly MemGPT) — The Agent OS Approach

Letta (formerly MemGPT) started at UC Berkeley and rebranded as a company in 2024. It is shaped differently from the other libraries — not a memory library but a memory-centric agent runtime.

The core idea — Memory as OS

Letta's metaphor is the virtual memory of an operating system.

  • LLM context window = RAM
  • External memory (vector + KV store) = disk
  • The LLM itself pages memory in and out using tool calls.

A Letta agent always has the following context:

  • Core memory — user persona + self persona (pinned, visible every turn)
  • Conversation buffer — recent messages (rolling)
  • Archival memory — external vector store (accessed by search)
  • Recall memory — every past message (accessed by search)

The agent edits its own memory through tools.

from letta_client import Letta

client = Letta(base_url="http://localhost:8283")
agent = client.agents.create(
    name="my-agent",
    memory_blocks=[
        {"label": "human", "value": "User is an ML engineer living in Korea"},
        {"label": "persona", "value": "Helpful AI colleague"},
    ],
)

# Chat. The agent updates core memory at its own discretion.
client.agents.messages.create(
    agent_id=agent.id,
    messages=[{"role": "user", "content": "I just moved to Bravo"}],
)
# -> The agent calls core_memory_replace and updates the "human" block

What sets Letta apart

  • State lives on the server. A Letta agent is not a stateless call — it exists with persistent state on the Letta server.
  • Inter-agent messaging — agents can send messages to each other. Multi-agent flows feel natural.
  • Self-editing memory — agents directly edit their own persona and the user profile. For better and for worse.

Where Letta fits

  • Best at: always-on assistants, multi-agent collaboration, character and companion agents
  • Worst at: transient RAG, stateless API services
  • One-liner: "Pick Letta when you want an agent OS that treats memory as a first-class citizen."

Chapter 6 · Cognee — Automatic Knowledge Graph Generation

Cognee is an open-source project that appeared in 2024, focused on "data to KG, automatically." It is less a memory library and more a KG builder for agents.

Pipeline — ECL (Extract, Cognify, Load)

ECL mimics ETL:

  1. Extract — pull in documents, conversations, or code.
  2. Cognify — an LLM extracts entities, relations, and ontology. An abstraction called DataPoint lives here.
  3. Load — write to a graph DB (Neo4j, Kuzu, NetworkX) and a vector DB (LanceDB, Qdrant, ...).
import cognee

await cognee.add("Project X is written in Rust and depends on Y")
await cognee.cognify()

results = await cognee.search(
    query_type="GRAPH_COMPLETION",
    query_text="What language is Project X written in?",
)
# -> "Rust"

How it differs from other libraries

  • Mem0 recalls a single fact-line.
  • Zep recalls facts + related messages.
  • Cognee recalls graph patterns — for example, "Y-typed nodes connected to node X."

Where Cognee fits

  • Best at: document KGs, codebase KGs, domain knowledge graphs
  • Worst at: fast chat memory, user preferences
  • One-liner: "If KG is the shape of memory, pick Cognee."

Chapter 7 · Anthropic Memory API (2025 Preview)

Anthropic shipped its Memory API in 2025 preview. Its central stance: memory should be server-side state.

Model

  • Conversation state is hosted by Anthropic. The client only sends a conversation_id.
  • When context-window pressure builds, Claude performs automatic compaction (the Claude Code compact feature, now exposed through the API).
  • Beyond that, explicit memory tools are provided — memory_save, memory_recall, memory_list, and friends — so Claude can write memory at its own discretion.
import anthropic

client = anthropic.Anthropic()

resp = client.messages.create(
    model="claude-sonnet-4-7",
    conversation_id="conv-42",   # server-side state identifier
    memory={"enabled": True, "scope": "user-42"},
    messages=[{"role": "user", "content": "Tell me again about that book I mentioned yesterday"}],
)
# Anthropic references the past of conv-42 to compose the reply

Why this is a big shift

  • Until now, all memory was the client's responsibility: collecting context, compacting it, sending it back.
  • Server-side memory hands that burden to the model provider. And the model knows its own memory best — compaction aligned with the model's internal representation becomes possible.
  • Downside: vendor lock-in. If memory lives on Anthropic's servers, you cannot switch models.

Where Anthropic Memory fits

  • Best at: Claude-deep products, fast time-to-market, client simplification
  • Worst at: multi-model routing, self-host requirements, memory data ownership
  • One-liner: "Pick Anthropic Memory API if you are Claude-only and shipping fast."

Chapter 8 · OpenAI Memory — The Consumer ChatGPT Feature

OpenAI Memory is a different category. It is not a developer SDK but a feature inside consumer ChatGPT.

How it works

  • When users chat, GPT auto-stores valuable facts (GA in early 2024).
  • A "Memory updated" indicator surfaces so the user notices.
  • Users can review and delete entries in Settings.
  • In 2025 the "Improved Memory" update extended the model to reference all past conversations.

What it means for developers

OpenAI Memory does not exist in the API. So:

  • The ChatGPT consumer experience made memory the default expectation. That trained users to demand "my AI should remember me."
  • To replicate it in your own product, you write it yourself or use Mem0 / Zep.
  • That said, from 2025 the OpenAI Assistants API offers something similar via threads + files + vector stores.

In this category OpenAI's role is market educator. They taught the mainstream what "AI memory" is, and that fueled the growth of the next category (Mem0, Zep, Letta).


Chapter 9 · Graphiti — Zep's KG Framework (Open Source)

Graphiti is the KG framework Zep open-sourced in 2024. It powers the Zep product but is usable standalone.

Core design

Graphiti's tagline: "Temporal Knowledge Graphs for AI Agents."

  • Every node and edge carries valid_from / valid_to time attributes.
  • When new information arrives, an LLM judges conflicts with existing edges and invalidates them.
  • Recall is hybrid search combining time + graph + vector.
from graphiti_core import Graphiti
from datetime import datetime

g = Graphiti("neo4j://localhost:7687", "neo4j", "password")
await g.build_indices_and_constraints()

await g.add_episode(
    name="meeting-2026-05-15",
    episode_body="Youngju moved to Bravo. Title: Staff Engineer.",
    source_description="meeting notes",
    reference_time=datetime.now(),
)

results = await g.search("where does Youngju work?")
# -> [{"fact": "Youngju works at Bravo", "valid_at": "2026-05-15"}]

How it differs from other KG frameworks

  • LangChain's KG memory — no time, weak conflict handling.
  • Neo4j LLM Graph Builder — does extraction but offers no recall abstraction.
  • LlamaIndex KG Index — static, hard to update or resolve conflicts.

Graphiti's differentiator is that time + conflict resolution are first-class citizens.

Where Graphiti fits

  • Best at: Zep backends, custom KG memory builds, domains where temporal reasoning is core
  • Worst at: simple RAG without a KG
  • One-liner: "Pick Graphiti if you want KG memory but not Zep Cloud."

Chapter 10 · Verba (Weaviate) / Cody Memories (Sourcegraph) / MemPress

These three are specialized memory systems for narrow domains.

Verba (Weaviate)

  • Weaviate's open-source RAG chatbot framework.
  • Vector-memory-centric, deeply integrated with Weaviate as the backend.
  • More of a full-stack "RAG + chat + memory" demo than a memory product on its own.
  • Best when: you need to stand up an internal-document chatbot on top of Weaviate quickly.

Cody Memories (Sourcegraph)

  • A memory feature inside Sourcegraph's code assistant Cody.
  • Stores codebase context + user coding preferences + project conventions.
  • Not a general memory SDK — specialized for the code domain. Facts like "this user prefers tabs and snake_case."
  • Best when: a code assistant must remember "who I am in this codebase."

MemPress

  • A 2025 library focused on agent memory compression.
  • Core idea: when memory accumulates, an LLM produces a hierarchical summary to accelerate retrieval.
  • A tree structure: raw messages to daily summaries to weekly summaries to monthly summaries.
  • Recall walks top-down, pulling more detail as it descends.
  • Best when: long-running agents whose memory grows beyond tens of thousands of entries.

Chapter 11 · Generative Agents (Stanford 2023) — The Academic Inspiration

Almost every commercial memory system owes a design debt to the 2023 Stanford paper "Generative Agents: Interactive Simulacra of Human Behavior."

The experiment

  • 25 AI characters live in a small town (Smallville).
  • Each has a persona, occupation, relationships, and a schedule.
  • Without user input, the characters interact and pass a day.
  • Result: spontaneous social behavior emerges — a Valentine's party self-organizes, an election campaign for mayor unfolds.

The memory architecture

The paper proposes three components:

  1. Memory Stream — every observation is recorded in natural language (episodic).
  2. Reflection — periodically, an LLM reads the memory and produces higher-order reasoning ("I do not have many friends"). The reflection goes back into memory.
  3. Planning — a daily plan is built from memory and reflections.

Recall is a weighted sum of importance + recency + relevance:

score = a*importance + b*recency + g*similarity

This formula is still the canonical recall algorithm in almost every memory system.

The legacy of Generative Agents

  • The reflection concept — Mem0's extraction, Zep's fact consistency, and Letta's self-edit all derive from it.
  • Importance scoring — the idea that not every fact is equal became standardized.
  • Episodic-first — the model that semantic memory derives from episodic memory.

A rare case of a single academic paper writing the design language of an entire product category.


Chapter 12 · Storage Backends — pgvector / Qdrant / Neo4j / Memgraph / Kuzu

Memory libraries eventually have to write data somewhere. The backends that show up most in 2026:

Vector backends

BackendCharacteristicsGood fit
Postgres + pgvectorLow operational burden, full SQL, transactionsTeams already on Postgres, memory + metadata joins
QdrantRust-fast, strong filtering, self-host friendly100M+ vectors, complex payload filters
PineconeManaged, fast to adopt, solid SLAWhen you do not want to run infra
WeaviateMultimodal, GraphQL, modularNon-text modalities, custom transform pipelines
LanceDBEmbedded, Arrow-based, local-friendlyNotebook and edge agents
ChromaLocal-friendly, simple DXPrototypes, demos

Graph backends

BackendCharacteristicsGood fit
Neo4jKG standard, rich CypherEnterprise, large KGs
MemgraphC++ fast, Neo4j-compatibleReal-time KG, streaming
KuzuEmbedded, OLAP columnar graphAnalytical KGs, notebooks
NetworkXPure Python in-memoryPrototypes, small graphs
AWS NeptuneManaged, GremlinAWS ecosystem

A 2026 practical guide

  • Start with a single Postgres + pgvector DB. Keeping memory and app data together simplifies operations.
  • When graph becomes necessary, start with Kuzu (embedded). Neo4j has a real operational cost.
  • If you need both vector and graph persistently, Postgres + pgvector + AGE (the Apache AGE graph extension) is also a single-DB candidate.
  • If recall latency becomes a problem, split out to Qdrant or Pinecone.

Chapter 13 · Korea / Japan — Upstage, NAVER HCX, Sakana, PFN

Korea

  • Upstage — integrates a RAG/memory layer over its own Solar model. Strong on Korean semantic-search quality. Targets enterprise assistant scenarios.
  • NAVER HyperCLOVA X — HCX ships with built-in memory features. Differentiator: consumer memory that connects to NAVER ecosystem data (Blog, Cafe, Shopping).
  • Kakao — the AI assistant inside KakaoTalk runs user memory. Messenger context is rich, which makes automatic memory extraction easier.
  • General trend: Korean teams often skip a vanilla Mem0/Zep and combine Korean embedding models (Upstage, Cohere multilingual) directly with a custom memory layer.

Japan

  • Sakana AI — famous for "evolutionary model merging," but from 2025 also publishing agent memory research. Interested in approaches that integrate memory into the model itself.
  • Preferred Networks (PFN) — research combining long-context PLaMo models with external memory.
  • Rinna / ELYZA — places building Japanese-specialized models. They tend to wrap Mem0 or Zep with Japanese embeddings rather than building memory from scratch.
  • General trend: Japan has a large character-and-companion-agent market, so demand for Letta-style persistent persona memory is strong. Game and entertainment integrations are more common than in the US.

Chapter 14 · Who Should Pick What

Recommendations by scenario, compressed into one table.

ScenarioFirst pickSecond pickAvoid
Quick user-preference memory (chatbot, FAQ)Mem0OpenAI Assistants threadsBuilding a KG
Enterprise, fact consistency (CRM, sales)ZepGraphiti directly + Neo4jPure vector + auto-forget
Multi-agent, persistent personaLettaMem0 multi-actor modeStateless API + simple client memory
Code-assistant memoryCody Memories (SaaS) or DIY + Mem0Cognee (for codebase KG)Generic chat memory as-is
Domain KG builds (pharma, legal)CogneeGraphitiTrying to solve it with vectors only
Internal-doc RAG + memoryVerbaMem0 + custom RAGRAG without memory
Claude-centric product, fast launchAnthropic Memory APIMem0 + ClaudeHand-rolled context management
Long-running simulation, character agentsLetta + Generative Agents ideasMemPress (compression)Just enlarging the window
Research, experimentsGenerative Agents codebase + customLlamaIndex Memory + custom KGSaaS memory (black box)
Memory grew so large it is now slowMemPress (summary tree)Zep (built-in summaries)Naive TTL expiry

Decision tree

Start
  ├─ Does "why / when / what changed" matter in memory?
  │     YES -> graph/temporal memory needed -> Zep or Cognee+Graphiti
  │     NO  -> next
  ├─ Is the agent always on, with its own persona?
  │     YES -> Letta
  │     NO  -> next
  ├─ Is Claude the only model, and is simplicity paramount?
  │     YES -> Anthropic Memory API
  │     NO  -> next
  ├─ Do you just need to remember "user preferences + facts"?
  │     YES -> Mem0
  │     NO  -> go back up and redefine the requirement
  └─ Will memory grow to hundreds of thousands of entries?
        YES -> add MemPress as a compression layer

Epilogue — Memory Is the Next Decision Point in 2026 AI Infra

In 2023 and 2024 the infra debate was about vector DBs. In 2025 it was about agent frameworks. In 2026 the debate is about memory architecture.

Three big takeaways:

  1. Pure vector memory is only step one. Once time, conflicts, and relations enter the picture, you need graph or episodic memory.
  2. Memory outlives the model. Models change every six months, but user memory must last for years. Data portability is decisive — do not lock memory inside a SaaS black box.
  3. Memory is hard to evaluate. Benchmark standards are immature. Building your own recall eval set (question to expected fact) is the safest bet.

"An agent's intelligence is decided by the model, but its usefulness is decided by its memory."

The same model with a different memory system becomes a completely different assistant. So treat the memory choice as an infrastructure decision — as weighty as the model choice itself.


References