TLDR Too much text? Skip the docs and check out our landing page for the 30-second version.

Inspiration

We didn't set out to build a memory layer the problem found us.

While building AI-powered projects over several months, we relied heavily on models like GPT, Gemini, and Claude as collaborative partners. Early on, things were seamless the model understood our architecture, remembered past decisions, and could reason about trade-offs we had discussed weeks prior.

Then things started breaking.

The Forgetting Problem

As conversations grew longer, older context got silently compressed or dropped to fit within the model's context window. Critical details architectural decisions, API contracts, constraint discussions simply vanished. Worse, the model didn't just forget it started hallucinating over things we had already defined, confidently contradicting its own earlier responses.

The model remembered enough to sound right, but not enough to be right.

The Portability Problem

Context was trapped inside individual chat sessions. Moving from GPT → Claude, or from a chat interface → an IDE agent, meant:

  • Copy-pasting entire conversation histories (bloating token usage)
  • Losing the structure of what was discussed vs. what was decided
  • No way to transfer understanding only raw text

Planning a feature in ChatGPT, then asking Cursor or Copilot to implement it? Nearly impossible without manually reconstructing the full context.

The Core Insight

Modern LLMs have incredible reasoning but zero persistent memory. Every session starts from scratch. Every context window is a ticking clock. There is no shared, structured memory layer across models, sessions, or tools.

That's the gap Xmem fills a unified, persistent, and transferable memory system that gives AI agents the long-term recall they were never designed to have.


What It Does

Xmem is a unified memory infrastructure for AI agents think of it as an external hippocampus that any AI model can plug into.

The Core Idea

Instead of cramming everything into a model's limited context window, Xmem externalizes memory storing, structuring, and retrieving context so AI agents can recall what matters, when it matters.

Without Xmem With Xmem
Context lost after session ends Persistent memory across sessions
Copy-paste to switch models Shared memory across GPT, Claude, Gemini
IDE agents start from zero Full project context available instantly
Long chats → hallucinations Relevant context injected on demand

Intelligent Memory Routing

Not all memories are equal. Xmem automatically classifies and routes information into structured memory types:

  • User Profiles → preferences, roles, past decisions
  • Temporal Events → what happened and when (meetings, milestones)
  • Code Knowledge → architecture, APIs, constraints, implementation details
  • Conversational Context → key discussion points, decisions, trade-offs

When a new query arrives, Xmem retrieves only the most relevant memories and injects them into the prompt keeping token usage lean and context razor-sharp.

Use It Anywhere

Xmem doesn't lock you into one tool — it becomes your universal AI memory:

  • IDE Agents → Cursor, Copilot, Claude Code get full project context from day one using our MCP
  • Chat Interfaces → ChatGPT, Claude, Gemini all share the same memory pool through our Extension
  • Your Own Apps → Integrate Xmem into products like Alexa, custom assistants, or internal tools
  • Repo Scanner → Ingest an entire codebase into Xmem, then query it from any AI no cloning needed

Plan on ChatGPT. Refine on Claude. Build in your IDE. One memory. Zero context loss.


How We Built It

Xmem isn't a wrapper around a single vector DB it's a multi-tier memory architecture built from the ground up to handle the full complexity of long-term AI memory.

Architecture Overview

The Classifier Agent

The brain of the system. Every incoming message passes through a Classifier Agent that:

  1. Analyzes intent → is this a personal detail, a temporal event, or a code discussion?
  2. Routes to the right pipeline → each memory type has its own specialized processing chain
  3. Triggers retrieval → pulls the most relevant existing memories to enrich the response

Hybrid Storage Layer

We chose three database paradigms because no single DB can handle all memory patterns:

Database Role Why This DB
Pinecone (Vector DB) Semantic search, long-term summaries Find contextually similar memories even with different wording
Neo4j (Graph DB) Entity relationships, temporal events Model connections — "who said what, when, and how it relates"
MongoDB (Document DB) Structured logs, code snippets, interaction data Fast retrieval of structured objects with flexible schemas

Multi-LLM Registry

Xmem is model-agnostic by design. Our registry system supports:

  • OpenAI · Anthropic · Google Gemini · AWS Bedrock · OpenRouter
  • Access to 1000+ LLMs through a unified interface
  • Automatic fallback — if a provider is down, Xmem seamlessly routes to the next available model

Each conversation generates structured memory objects. Over hundreds of interactions, older memories are progressively summarized — reducing token cost while preserving the facts that matter.


Challenges We Ran Into

Building a memory layer for AI sounds simple until you realize you're fighting against how LLMs fundamentally work. Here are the hardest problems we tackled:

1. The Summarization Trap

Compress too much → lose critical facts. Compress too little → blow up the context window.

Most memory systems store raw chat logs or aggressively summarize them into vague paragraphs. Both approaches fail. We had to engineer a middle ground:

Approach Problem
Store raw conversations Massive token usage, context window fills up fast
Aggressive summarization Reasoning details lost, model starts hallucinating
Our approach: Structured fact extraction Extract atomic facts, discard filler, preserve relationships

The goal isn't to remember everything it's to remember the right things in the fewest tokens possible, so that when memories are recalled, they enrich the context window without flooding it.

2. Retrieval Quality

Semantic search alone isn't enough. A query like "what did we decide about auth?" might miss a memory stored as "switched from JWT to session tokens on March 5th" the words don't overlap, but the meaning does.

We had to fuse three retrieval signals:

Signal What It Catches
Vector similarity (Pinecone) Semantically related memories
Graph traversal (Neo4j) Entity relationships — "auth" → "JWT" → "session tokens"
Temporal ordering Recent context weighted higher than stale memories

No single signal was sufficient. The combination is what made retrieval actually reliable.

3. Token Economics

This was a non-obvious but critical challenge. Storing memories is pointless if recalling them eats up your entire context window with bloated text.

We optimized for minimum recall footprint:

  • Raw conversation → structured extraction → only key facts stored
  • Recalled memories are token-compressed representations, not chat transcripts
  • Result: recall injects 10-50x fewer tokens than replaying the original conversation

4. Latency vs. Intelligence Trade-off

Every query triggers a pipeline: classify → retrieve → rank → inject → respond. Each step adds latency. We went through multiple iterations to bring this under acceptable response times:

  • Parallel retrieval across all three databases
  • Pre-computed embeddings to avoid real-time vectorization bottlenecks
  • Aggressive caching for frequently accessed memory clusters
  • Parallel invoking of agents and designed them in a way they give good results with fast and low reasoning models like gemini-2.5-flash or gpt-4.1-mini etc

5. Broken Benchmarks

When we tried to evaluate Xmem, we found that most popular memory benchmarks don't reflect real-world usage they test toy scenarios, not multi-session, multi-model workflows.

We validated against research-grade datasets like LongMemEval and LOCOMO, and are actively building our own evaluation dataset grounded in real developer workflows because if the benchmark doesn't match reality, a good score means nothing.

Every challenge pushed us toward the same principle: store less, recall smarter, and never trust a benchmark that doesn't match production.


Accomplishments That We're Proud Of

Benchmark-Crushing Results

We didn't just build a memory system we benchmarked it against every major player and came out on top.

System Backing Xmem's Edge
Mem0 Y Combinator '24 +23 points higher accuracy
Supermemory VC-funded +9 points higher accuracy
Zep Industry established +8 points higher accuracy
Memobase Open-source memory layer +7 points higher accuracy
Backboard Backed by Mistral AI Beaten in multi-session & temporal retrieval

Tested on LOCOMO and LongMemEval two of the most rigorous memory evaluation benchmarks in the field. These aren't cherry-picked metrics; these are head-to-head comparisons on standardized datasets.

Three Interfaces, One Memory Layer

In a short sprint, we shipped three fully functional products — all backed by the same unified memory engine:

Interface What It Does
SDK Integrate Xmem memory into any application programmatically
Browser Extension Persistent memory on ChatGPT, Claude, Gemini — any web-based AI
MCP Server Plug directly into IDE agents like Claude Code, Cursor, Copilot

One memory layer. Three surfaces. Zero context loss across any of them.

Hybrid Storage Architecture

While most memory startups rely on a single vector database, we integrated Pinecone + Neo4j + MongoDB into a unified retrieval layer enabling semantic search, relationship traversal, and structured recall simultaneously.

TOON — Token as Object Notation

We invented a new output format to replace JSON in LLM-generated structured data:

JSON TOON
Token efficiency Wastes tokens on brackets, quotes, spaces Minimal syntax, maximum information density
Parse robustness One hallucinated bracket = entire parse fails Graceful degradation, tolerant parsing
LLM compatibility Models frequently malform complex JSON Simpler structure = fewer generation errors

Instead of asking the LLM to produce fragile JSON, TOON gives us structured extraction that doesn't break when the model hiccups — making the entire memory pipeline significantly more reliable.


What We Learned

1. The Context Window Is a Lie

Bigger context windows ≠ better memory. Even with 128K+ token windows:

  • Cost scales linearly with context size
  • Latency increases with every additional token
  • Models still lose attention over long contexts (the "lost in the middle" problem)
  • Compression still silently drops critical details

The real solution isn't a bigger window it's smarter memory outside the window.

2. No Single Retrieval Strategy Wins

This was one of our biggest technical takeaways:

Query Type What Fails What Works
"What did we decide about auth?" Vector search (vague match) Graph traversal through related entities
"What happened last Tuesday?" Semantic similarity (time isn't semantic) Temporal indexing with structured events
"Show me the DB schema discussion" Keyword search (too narrow) Vector search across code memory
"How does auth connect to the billing module?" Any single strategy Hybrid retrieval combining all three

We started with vector-only. We ended with a fused retrieval engine because real-world queries don't fit one pattern.

3. New Tech We Picked Up Along the Way

Building Xmem pushed us deep into technologies none of us had used before:

  • LangGraph → orchestrating multi-step memory pipelines as stateful graphs
  • Graph RAG → retrieval-augmented generation powered by entity relationships, not just embeddings
  • Multi-Agent Orchestration → coordinating classifier, retriever, and summarizer agents in parallel
  • Token Cost Optimization → engineering every stage of the pipeline to minimize token footprint

4. The Biggest Opportunity in AI Infra

The AI ecosystem has mature tools for inference, fine-tuning, and deployment. But when it comes to long-term memory for agents? The tooling is nearly non-existent.

We didn't just learn how to build a memory system we learned that this space is wide open, and the teams that solve memory infrastructure will shape how the next generation of AI agents actually work.


What's Next for Xmem

Going Open Source

Xmem will be fully open-sourced because memory infrastructure should be a public primitive, not a walled garden. We want developers and researchers worldwide to build on top of it, stress-test it, and push it further than we can alone.

Building the Benchmark That Doesn't Exist Yet

Existing memory benchmarks don't capture real-world complexity. We're building a new evaluation framework focused on:

  • Multi-session recall — can the system remember context across 50+ separate conversations?
  • Temporal reasoning — does it know when things happened, not just what?
  • Cross-model consistency — is memory accurate when switching between GPT, Claude, and Gemini?
  • Token efficiency — how many tokens does recall cost vs. raw replay?

Roadmap

Priority What's Coming
Latency optimization Sub-100ms retrieval for real-time agent workflows
Expanded SDK support Python, TypeScript, Go drop-in integration in under 5 lines
Agent framework plugins Native support for LangChain, CrewAI, AutoGen, LlamaIndex
Memory sharing & teams Shared memory pools across team members and agents
Memory permissions Granular control over what agents can read, write, or forget

The Vision

Today, every AI conversation starts from zero. Every agent forgets you the moment the session ends.

We're building toward a world where AI remembers your preferences, your projects, your decisions — across every model, every tool, every session.

Xmem: the default memory layer for AI agents.

Built With

Share this project:

Updates