Inspiration
Every memory provider claims superior retrieval, but there's no universal way to test them. The RAG ecosystem needed what the ML world has with ImageNet: a definitive benchmark that settles debates with data, not marketing claims.
What it does
MemoryBench is a plug-and-play benchmarking platform that evaluates any memory system (APIs, vector databases, local solutions) using standardized tests. Drop a provider folder, and it's auto-discovered—no config needed. Run benchmarks like LongMemEval (2,000+ questions), get accuracy metrics, latency percentiles, task-type breakdowns, and regression detection. It handles API failures with checkpointing, resumes interrupted runs automatically, and produces interactive dashboards for comparing results.
How we built it
Core Architecture: TypeScript + Bun with a three-layer design—auto-discovery engine, orchestration runner, and provider abstraction layer.
The Discovery Engine scans directories, dynamically imports modules, validates interfaces at runtime, and registers them—zero manual config. This was the hardest part: maintaining type safety across dynamic boundaries required Zod schemas + TypeScript inference.
The Checkpoint System tracks progress at test-case granularity with atomic writes (temp file → rename). Exponential backoff with jitter handles network failures: $\text{backoff}(n) = \min(30000, 1000 \cdot 2^n + \text{rand}(0, 1000))$.
Integrated three major benchmarks: LongMemEval (reverse-engineered their format, built batching for 2K+ questions), LoCoMo (conversational memory), NoLiMa (needle-in-haystack). Used AI SDK with GPT-4/Gemini for LLM-as-judge evaluation.
The orchestrator runs benchmark × provider matrices with smart cleanup (handles stateless, stateful, and reset-capable providers differently).
Challenges we ran into
1. Type Safety in Dynamic Land: How do you maintain TypeScript guarantees when dynamically importing unknown modules? Solved with runtime Zod validation + type inference: type ProviderMeta = z.infer<typeof Schema>.
2. Checkpoint Corruption: Learned the hard way that crashes during checkpoint writes corrupt state. Solution: atomic file operations—write temp, rename (atomic on POSIX).
3. LongMemEval Integration: Their format wasn't designed for external orchestration. Had to build a custom orchestrator that batches operations, manages session state across 2,000+ questions, and handles multi-hour runs without OOM.
4. Evaluation Subjectivity: Is "John lives in NYC" equal to "John's residence is New York City"? String matching fails. Integrated LLM judges, then spent days prompt-engineering for consistent semantic evaluation.
5. The Resumability Problem: Not just "skip completed tests"—needed to track partial progress within benchmark × provider combinations, estimate ETA accurately, and handle cascading failures gracefully.
Accomplishments that we're proud of
Zero-Config Auto-Discovery: You literally drop providers/your-system/index.ts with a meta export and it just works. No registration, no config files, no manual wiring. This is the UX I wish every dev tool had.
Production-Grade Resilience: Network fails? Checkpoint saves. API rate-limited? Exponential backoff. Crash at test 1,847/2,000? Resume from 1,848. Built for the real world where everything breaks.
Three Major Benchmarks Integrated: LongMemEval (2K+ questions), LoCoMo, NoLiMa—complete with proper evaluation logic, task-type taxonomy, and confidence intervals.
Developer Experience: Real-time progress bars with ETA, verbose mode for debugging, interactive dashboards, CI integration for regression detection. Tools should feel good to use.
What we learned
Memory Systems Have Wildly Different Failure Modes: Running identical benchmarks revealed some systems excel at exact retrieval but fail at semantic search. Others nail recent context but lose older memories. This drove the task-type breakdown feature—aggregate accuracy hides critical insights.
Infrastructure Beats Algorithms for Dev Tools: The most impactful feature wasn't the evaluation logic—it was checkpointing and progress feedback. When you're running 2-hour benchmarks, seeing "ETA: 47m" is the difference between calm and panic.
Dynamic Imports + Type Safety Requires Philosophy Shifts: TypeScript trains you to think statically. Dynamic discovery forced me to think in runtime validation + type inference. Zod became my bridge between worlds.
Atomic Operations Aren't Optional: Any state mutation that can be interrupted must be atomic. Learned this debugging corrupted checkpoints at 2 AM.
What's next for saransh_soni
Short-term: Make MemoryBench the standard. Add streaming evaluation, cost tracking ($/query), adversarial test cases, and P95/P99 latency breakdowns. Build a public leaderboard where memory providers compete openly.
Long-term: Work on memory systems full-time. Supermemory.ai is building how we'll interact with information in the future. I want to make it provably, measurably, quantifiably better than everything else. Not through marketing—through rigorous evaluation and rapid iteration.
The gap between "we think this works" and "we know this works with 95% confidence intervals" is where I want to live. MemoryBench is my proof that I can build the infrastructure that bridges that gap.
I want to bring this mindset to supermemory.ai—ship fast, measure rigorously, iterate relentlessly. Memory systems are too important to rely on vibes.
Built With
- anthropic
- gemini
- react
- typescript
- vercel
- vertex
Log in or sign up for Devpost to join the conversation.