Remi — a second memory for the people in your life

Look at someone and ask "who is this?" Remi recognizes their face, pulls up your real history with them from your own texts and photos, and tells you who they are to you in a warm voice. Every fact it says links back to the message or photo it came from.

Berkeley AI Hackathon 2026. Three of us, about 18 hours, built top to bottom with Claude Code.

Inspiration

The worst part of dementia isn't forgetting facts. It's looking at your own daughter and not knowing her name. 55 million people live with that, and a notebook or a photo album can't do anything for you in the moment a face shows up and your mind goes blank.

You don't need a diagnosis to know the feeling, though. We blank on names at reunions, on how we met an old friend, on the trip we swore we'd never forget. We wanted something that could look at whoever is in front of you and quietly hand you the right memory, while keeping you the one doing the remembering.

We set one rule early: Remi is never allowed to make something up. A wrong memory isn't just useless, it can be hurtful. If it tells you "this is Abhay, you've known him ten years," that has to be true, and you have to be able to check it on the spot. The rest of the project was built around that one constraint.

What it does

  1. A camera (we use an iPhone as a Continuity Camera) sees a face, Remi recognizes it, and a 3D brain dives into its neurons and grows a new connection toward that person.
  2. You ask "who is this?" Remi picks the relevant moments out of your history, writes a short two or three fact reminder, checks each fact against a real source, drops anything it can't back up, and speaks it out loud.
  3. The part we're proudest of: a judge can pick any fact Remi just said, click it, and the actual iMessage thread or photo opens. A chatbot can't survive that kind of random audit. Remi can.
  4. You can keep talking to it too. "When did we first meet?" "Where did we go in San Diego?" An agent figures out the right way to search your history and answers with the sources attached.

How we built it

Data moves in a fairly straight line from your phone to a spoken memory:

iMessage + Apple Photos + Contacts  ──ingest──▶  per-person memory bundle
        │
   MiniLM embeddings (384-d)  ──▶  Redis Stack (HNSW vector index)
        │
   retrieval toolbox (semantic · date · location · keyword)
        │
   Grok Recall Agent (select → compose → ground)  ──▶  Deepgram voice
        │
   FastAPI live loop  ──WebSocket──▶  3D-brain web UI (React + three.js)

The recall brain is xAI Grok (grok-4.3) over its OpenAI-compatible API, with strict JSON schemas. We don't let it act like a summarizer. It has to tie every fact to a real source or throw the fact out.

Face recognition is InsightFace (buffalo_l), and we only use it to trigger the lookup, nothing else. Each person gets one prototype embedding $\mathbf{p} \in \mathbb{R}^{512}$, built by averaging their anchor face with a handful of their real photos and renormalizing:

$$\mathbf{p} = \frac{1}{n}\sum_{i=1}^{n}\mathbf{e}_i,\qquad \hat{\mathbf{p}} = \frac{\mathbf{p}}{\lVert\mathbf{p}\rVert}$$

A live face $\mathbf{f}$ counts as a match when the cosine similarity is over a threshold:

$$\mathrm{sim}(\mathbf{f}, \hat{\mathbf{p}}) = \frac{\mathbf{f}\cdot\hat{\mathbf{p}}}{\lVert\mathbf{f}\rVert\,\lVert\hat{\mathbf{p}}\rVert} > \tau,\qquad \tau = 0.40$$

For memory, we split each thread into sessions, embed them with all-MiniLM-L6-v2 ($\mathbb{R}^{384}$), and store them in Redis Stack behind an HNSW index so we can do per-person k-NN recall. All of it lives behind a single recall_context() function that falls back to a plain slice of the data if Redis goes down.

The follow-up Q&A is a bounded tool-calling loop. Grok decides which lens to use (semantic search, a date range, a location search that widens to the general area so "San Diego" also catches La Jolla, or a keyword search), we run them, and it writes an answer that can cite more than one source.

Voice is Deepgram Aura-2, synthesized live, with a fast fail-over to a pre-baked clip if the call is slow. The UI is a three.js brain in React. Neurons light up, a synapse forms when it recognizes someone, the photos bloom into a montage, and the facts type in underneath.

What kept us sane was a rule we called live-primary, baked-fallback. The demo runs live, but a baked version of every reel is one keypress away if a cloud call dies on stage.

Challenges we ran into

The face that wasn't. For a while the recognizer just would not identify one of us. No error, it kept coming back "unknown." We dug in and the problem turned out to be the reference photo. We'd auto-picked an anchor from a group shot, and the code grabbed the biggest face in it, which happened to be the wrong person. The number gave it away:

$$\mathrm{sim}(\text{my live face}, \text{reference}) = 0.043$$

That's about what you'd get from two strangers. The reference was a completely different human. We fixed it by enrolling straight from the live camera instead of trusting an old photo, and the score jumped to roughly 0.74, well over the threshold.

Getting a model to refuse. Making an LLM talk is easy. Making it stay quiet about things it can't prove is the hard part. We built a source-resolver that drops any fact whose citation doesn't actually resolve, so the blind audit holds up live and not just on the safe pre-baked path.

The ten-second window. Running recall live means a real round trip to the cloud while everyone is watching. We had to fit the tool-calling, the writing, and the voice into about $t \lesssim 10\text{s}$: a smaller, faster model for picking tools, a cap on how many rounds it could take, tight timeouts, and warming everything up before the demo started.

What we learned

  • Grounding mattered more than how good the answer sounded. A smooth, confident, wrong memory is the worst thing this could produce, so saying less but backing all of it up is what made it trustworthy.
  • Reliability had to be a priority, not something we bolted on at the end. Anything on stage that could behave unpredictably needed a deterministic backup behind it.
  • Real data is messy. HEIC photos, weird attributedBody blobs in chat.db, group shots, timestamps that don't agree. Most of our time went into making the real data usable, not into the model.
  • Build for the live demo from the first hour. Camera angle, mic permissions, browser security quirks, timeouts. On stage those aren't edge cases, they're the demo.

What's next

A second memory shouldn't only answer when you ask. The next step is for it to run in the background, take notes, and remind you before you think to ask ("you told Abhay you'd call Sunday"). We'd also like to move the recall onto on-device models like Gemma to keep more of it private. The form factor goes from the phone you already carry to, eventually, glasses.

Built with

xAI Grok (grok-4.3), Deepgram Aura-2, InsightFace (buffalo_l), sentence-transformers (all-MiniLM-L6-v2), Redis Stack (HNSW vector search), FastAPI, React + Vite + TypeScript + Tailwind

  • three.js, OpenCV, osxphotos / Contacts / iMessage, iPhone Continuity Camera, Sentry, Python 3.13 + uv, and Claude Code top to bottom.

Built With

Share this project:

Updates