PROBLEM STATEMENT:

LLMs have weak long-term memory and retrieval consistency. There are dozens of benchmarks and even more provider implementations, but: • every benchmark uses a different format • every provider uses a different interface • comparing systems is slow and inconsistent • running experiments requires ad-hoc scripts • visualizing and understanding results is painful

Teams waste time trying to evaluate memory instead of improving it.

Supermemory needs a general, extensible system to benchmark: • retrieval • memory rewriting • answer correctness • latency • consistency across memory histories

WHAT I BUILT:

During the hackathon, I built:

A unified provider interface • dummy • echo • OpenAI (via proxy) • easy plug-in design for new providers

A unified benchmark interface • simple benchmark • LoCoMo • LongMemEval • custom “myBenchmark” • automatic data normalization

A CLI runner bun run run-all Supports: • multiple providers • multiple benchmarks • scoring • latency measurement • JSON results export

A visual dashboard

Built using HTML + JS + Chart.js: • upload a results JSON file • compare providers • radar chart visualization • benchmark breakdowns • item-level explorer • live “Ask a Provider” panel (supports dummy, echo, OpenAI architecture)

A CORS-enabled proxy for live OpenAI requests • secure • unified with provider interface • designed for live evaluation panel • tested locally (blocked only by account quota)

ARCHITECTURE:

┌────────────────────┐ ┌────────────────────┐ │ Benchmarks │ │ Providers │ │ simple │ │ dummy │ │ LoCoMo │ │ echo │ │ LongMemEval │ │ openAI │ │ myBenchmark │ │ (extensible) │ └─────────┬──────────┘ └─────────┬──────────┘ │ │ │ benchmark data │ provider methods ▼ ▼ ┌──────────────────────────────────────┐ │ RUNNER │ │ • prepareProvider() │ │ • addContext() │ │ • searchQuery() │ │ • scoring + latency │ └───────────────┬──────────────────────┘ │ ▼ ┌───────────────────────┐ │ results JSON export │ │ (saved in /results) │ └─────────────┬─────────┘ │ ▼ ┌──────────────────────────────────────────────┐ │ DASHBOARD │ │ • Upload JSON │ │ • Provider comparison charts │ │ • Benchmark breakdown │ │ • Radar chart │ │ • Item-level explorer │ │ • Live provider request panel │ └──────────────────────────────────────────────┘

TECH STACK: • Bun runtime • TypeScript • Chart.js for visualization • HTML/CSS for dashboard • OpenAI proxy (custom CORS server) • JSON results format (machine- & human-readable)

Built With

Share this project:

Updates