Inspiration

The need for a standardized way to objectively compare AI memory providers like SuperMemory, Mem0, and Zep, as there was no unified benchmarking platform for evaluating long-term context retention and semantic retrieval.

What it does

MemoryBench evaluates and compares AI memory providers across three standardized benchmarks (NoLiMa, LongMemEval, LoCoMo) with an interactive dashboard, testing semantic retrieval, long-term memory, and conversational context maintenance. with the option to add more benchmarks or agents easily in the future

How we built it

Built with Bun and TypeScript using factory patterns for auto-discovery, template methods for consistent execution, comprehensive checkpointing, multi-model LLM support, and a Chart.js visualization dashboard.

Challenges we ran into

Managing diverse provider APIs, implementing reliable multi-level checkpointing, ensuring consistent LLM-as-judge evaluation, handling API rate limits, and accurately measuring semantic retrieval without exact string matching.

Accomplishments that we're proud of

Created an extensible platform where new providers require just one file, with production-ready checkpointing, three comprehensive benchmarks, interactive visualizations, and full type-safety across the codebase.

What we learned

Factory patterns with auto-discovery enable true extensibility, checkpointing is crucial for long-running benchmarks, and clean abstraction layers allow providers and benchmarks to evolve independently.

What's next for MemoryBench

Expanding to more benchmarks and providers, adding latency and cost metrics, building a public leaderboard, and creating tools for users to design custom domain-specific memory benchmarks.

Share this project:

Updates