##Inspiration

Coding agents are amnesiacs. Every task, they re-explore the same repository from scratch — grepping the same files, opening the same modules, rebuilding the same mental model — and they pay for it in tokens every single time. We watched a baseline agent burn 87,000 tokens and 21 tool calls just to find one file to edit in Django, then do it all over again on the next task. Redis was challenging us to use it as more than a cache, and the fit was obvious: give the agent a persistent, searchable memory of the codebase so it stops paying to relearn what it already knows.

##What it does

Stratum is a Redis-backed memory layer for coding agents, exposed over MCP. Agents write compact notes about a repo — what a file defines, where a symbol lives, which area owns a behavior — and retrieve them by semantic vector search before falling back to expensive blind exploration. On SWE-bench-style localization tasks (find the files that need editing for a given issue), the same DeepSeek agent with Stratum memory uses dramatically fewer tokens for the same correct answer.

We prove it with a controlled benchmark and a live dashboard:

  • Headline diff — baseline vs Stratum: total tokens, tokens-per-solved-task, file reads, search calls, cost.
  • Side-by-side trace diff — watch baseline grind through 21 searches and reads while Stratum fires one memory_search, jumps to the right file, and submits in 8 steps.
  • Live Wall & Race mode — replay every task's agents working at once, or run an agent live and stream each tool call as it happens.

Measured result: total tokens dropped 42% (618,623 → 357,968) across shared tasks, with localization accuracy held — same files found, far less spelunking.

##How we built it

  • Memory layer (memory.py) — Redis with RediSearch: an HNSW vector index over note embeddings (FT.CREATE / FT.SEARCH KNN, cosine), repo/file/symbol tags, plus a SQLite mirror for structured retrieval. A deterministic hash-embedding fallback keeps it runnable with no embedding API key.
  • MCP server (main.py, "Stratum") — post_comment, get_comment, semantic_search, comments_for_file tools that any MCP-capable agent can call.
  • Agent harness (bench/) — DeepSeek V4 Flash via OpenRouter, a baseline agent (repo tools only) and a memory agent (repo tools + Redis memory), identical prompts/limits so only memory differs. Every model call is logged for prompt/completion/reasoning tokens and real OpenRouter cost; every tool call is captured as a trace.
  • Benchmark runner — pulls real SWE-bench Django issues, scores predicted files against the gold patch, runs tasks in parallel, and emits JSONL.
  • Dashboard — FastAPI serving a compare() core shared with the CLI, plus a static SPA (Tailwind, no build step) for the diff, trace timelines, Live Wall, and SSE-streamed live runs.

##Challenges we ran into

  • The network hated git. SSH:22 was blocked and a global insteadOf rule silently rewrote every HTTPS GitHub URL back to SSH, so clones and pulls timed out. We pivoted to downloading repo snapshots as codeload tarballs.
  • Phantom "no matches." ripgrep wasn't on the Python interpreter's PATH, so the search tool silently returned nothing for everything — the agent looked broken but the tool was. We rewrote search in pure Python.
  • A reasoning model that returns None. DeepSeek V4 Flash spends tokens thinking before answering; with a low max_tokens the content came back empty. We had to account for reasoning tokens everywhere.
  • Embedding space mismatch. Seeding notes with one embedding and querying with another returns garbage. Keeping seed and query in the same space (and a consistent fallback) was subtle but critical.
  • SSE hits a wall — literally. Browsers cap ~6 connections per host, so a 15-task "everything live" grid stalls. We switched the Live Wall to a single bulk fetch + client-side animation.
  • Two Python environments (miniconda vs uv) with different installed deps quietly broke imports until we pinned the server to the right one.

##Accomplishments that we're proud of

  • A real, measured 42% token reduction with correctness held — not a hand-wave, a reproducible benchmark with per-task numbers and tokens-per-success so a drop can't be faked by giving up early.
  • A visualization that makes the win obvious in 10 seconds — the side-by-side trace diff and the Live Wall let anyone see memory working.
  • Clean, live-ready architecture: one trace model rendered identically whether it comes from a saved file or a live SSE stream, and one comparison core shared by the CLI and the API.
  • Using Redis as genuine AI infrastructure — vector memory, semantic retrieval, and benchmark metrics — not a key-value cache.

##What we learned

  • For agents, context is the bill — the cheapest token is the one you never spend rediscovering something. Memory beats a bigger window.
  • Redis vector search is a legitimate agent memory store, and standing it up (HNSW, tags, KNN filters) was faster than expected.
  • Benchmark methodology matters more than the model — fixing every variable except memory, and reporting tokens-per-solved-task, is what makes the claim credible.
  • A lot of "the agent is dumb" moments were actually tooling and environment bugs — instrument everything before blaming the model.

##What's next for Stratum

  • Token-budgeted context packing (memory_pack) — return the most useful memories per token under a budget, not just top-k.
  • Staleness detection — diff against git and flag memories whose line ranges changed.
  • The memory flywheel — agents writing back what they learned so later tasks get cheaper automatically, and measuring the compounding savings.
  • Beyond Django and beyond localization — more repos, real SWE-bench patch-and-test mode, and a true live-batch view of an entire run streaming at once.

Built With

Share this project:

Updates