Vision Cortex

Inspiration

Every AI tool you use starts from zero. You explain your project to ChatGPT, then re-explain it to Claude, then again to whatever agent you spin up next week. Meanwhile your actual context, the decisions buried in Slack threads, the half-finished docs in Notion, the conversation you had walking to lunch, lives scattered across a dozen apps that don't talk to each other. And it's about to get worse: wearables like Ray-Ban Meta glasses and Omi are generating a continuous stream of personal context that nothing captures in a usable way.

Two things made now the right moment. MCP gave us a real interoperability substrate: a way for any AI to read from a shared memory layer instead of every vendor building a walled garden. And the wearables ecosystem finally opened up enough, through Meta's Device Access Toolkit and Omi's ambient audio, that we could actually pull from it. So we built the missing piece: a persistent memory layer that sits underneath all of it.

What it does

Cortex is a persistent, cross-agent memory and context layer. It ingests from your wearables, including Ray-Ban Meta glasses and Omi, and your productivity tools, including Claude, ChatGPT, OpenClaw, Claude Code, Codex, Slack, email, Notion, and Google Docs. It processes everything into a unified knowledge graph and exposes that context through an MCP server, so any AI you use can read your full personal context without you re-explaining yourself.

Deepgram powers the real-time voice layer for the wearable experience. We used Deepgram to transcribe audio input from the Meta glasses, turning live conversations and ambient speech into structured text that Cortex could filter, process, and store as memory. We also used Deepgram TTS for the output path, so Cortex could respond back through the Meta glasses as spoken audio instead of only returning text.

The Token Company API sits early in the ingestion pipeline as a cleanup and normalization layer. Every input source has noise: Slack has messy threads, docs have unfinished fragments, agent sessions have repeated prompts, and natural conversations from the Meta glasses have the most fluff of all. The Token Company API helped clean those raw inputs before memory extraction, reducing filler, repetition, and irrelevant conversational noise so the graph receives the actual signal instead of a transcript landfill.

The storage is deliberately dual-layer: a local markdown folder that doubles as an Obsidian vault and a git repo, alongside Redis vector embeddings for fast semantic retrieval. The local layer keeps the system human-readable, user-owned, and diffable, while Redis makes the memory layer fast enough for real agent use. The MCP server is the part that matters most: it turns a personal knowledge base into something every agent in your stack can actually use.

How we built it

The architecture runs in four phases: ingest → clean → process → store → expose.

The processing pipeline is the heart of it, with seven stages: normalize each source into a common text envelope, clean the input using The Token Company API, segment and filter, extract into the dual store, resolve entities against the existing graph, reconcile against what we already know, then persist and index. The core principle we kept coming back to: aggregation has to happen at write time, with the existing graph in context. If you defer entity resolution to query time, the same person or project fragments into five different nodes and the whole graph rots.

For the wearable streams, we used a three-tier model: stream → working set → graph. Raw perceptual data never enters the knowledge graph directly. Audio from the Meta glasses first goes through Deepgram speech-to-text, which gives us the live transcript stream. That transcript is then cleaned through The Token Company API to strip out the natural fluff of spoken conversation, including filler words, repeated phrases, rambling, and low-value fragments. After that, Anthropic Haiku powers the Meta glasses OpenClaw agent, helping decide what is relevant enough to keep, summarize, or respond to.

That cleaned transcript becomes one of the highest-value salience signals because spoken phrases like “this is important,” “remind me,” “we decided,” or references to “this” and “that” can tell the system which visual or conversational moments are worth keeping. On the output side, Deepgram TTS turns Cortex responses back into speech, making the wearable interaction loop feel natural instead of forcing the user to look at a screen.

Continuous video and audio get cascade-filtered with a simple rule: keep what's surprising, drop what's predictable. Audio deixis from the Deepgram transcript acts as the cheapest high-value signal for deciding whether a visual frame or moment should be promoted from raw stream into working memory. The Token Company API improves that signal by cleaning the transcript before it reaches the extraction layer, so Haiku and the memory pipeline are reasoning over meaning instead of noise. Only after filtering and consolidation does anything become structured memory.

Hackathon pragmatics shaped a lot of the build: Anthropic Haiku as the lightweight model for the Meta glasses OpenClaw agent and relevance gate, The Token Company API for cleaning noisy raw input across sources, Deepgram for live speech input and spoken output, Voyage AI for embeddings, Redis for fast vector retrieval, and a vertical slice end-to-end before widening to more sources.

Challenges we ran into

Entity fragmentation was the big one, and the reason write-time aggregation became non-negotiable. Continuous sensor data was the second: you cannot dump a video feed or raw audio stream into a knowledge graph, so we had to design the filtering and consolidation bridges carefully, balancing high-recall early gates against high-precision late consolidation.

The wearable voice loop also had its own constraints. Real-time transcription needs to be fast enough to feel ambient, accurate enough to preserve meaning, and structured enough to become useful memory. Deepgram helped us bridge that gap by giving us low-latency speech-to-text for input and TTS for audio output through the glasses.

The next problem was input quality. Human conversation is messy. People ramble, repeat themselves, trail off, use filler words, point at things and say “that,” and change topics mid-sentence. That is fine for humans, but brutal for a memory graph. The Token Company API helped us clean that input before extraction so Cortex could preserve the decision, commitment, fact, or context without storing every bit of conversational junk around it.

Hardware availability for the demo was a real constraint, which pushed us toward pre-loading exactly the data our demo story needed rather than ingesting everything live. And underneath it all, the 24-hour clock forced us to be honest about what was a slice and what was scope creep.

We also sat with the harder, non-technical problems: the trust paradox of a startup asking to hold continuous personal data, and whether this is a product or a feature.

Accomplishments that we're proud of

A working end-to-end vertical slice: real data in one side, structured memory out the other, queryable by an external agent through the MCP server. We also built a real wearable interaction loop: Deepgram transcribes audio from the Meta glasses into usable context, The Token Company API cleans the noisy transcript, Anthropic Haiku powers the Meta glasses OpenClaw agent’s reasoning layer, Cortex processes and retrieves relevant memory, and Deepgram TTS can speak the response back through the glasses.

We're proud of getting the dual store to actually behave as one coherent layer. We're also proud of the architectural discipline we held to: separating reference knowledge, the encyclopedia layer, from active state, including commitments, open questions, and deadlines. That distinction is what makes Cortex useful for day-to-day task assistance and not just a search index over your life.

What we learned

MCP is the real differentiator. The storage mechanism is replaceable; the exposure layer is the value.
Voice is the natural interface for wearable memory. Deepgram gave us both sides of that loop: speech-to-text for capturing context and TTS for responding back through the glasses.
Cleaning input is not optional. The Token Company API helped turn noisy agent sessions, app data, and especially messy Meta glasses conversations into cleaner memory candidates.
Small models are enough when the pipeline is designed well. Anthropic Haiku gave us a fast, lightweight reasoning layer for the Meta glasses OpenClaw agent and relevance filtering.
Aggregate at write time, never at query time. Entity resolution with graph context during ingestion is what prevents fragmentation.
Raw perceptual data never enters the graph. Streams and structured memory have to stay architecturally separate.
Reference knowledge vs. active state is the distinction that unlocks actual task help.
Cheap gates buy you a lot. A Haiku relevance filter and a single combined extract-and-resolve call gave us most of the quality at a fraction of the cost and latency.

What's next for Cortex

Widen ingestion well beyond the demo slice, then harden the wearable integration as Meta's Wearables DAT platform reaches broader availability later in 2026. We also want to make the voice loop more proactive: Deepgram listens and transcribes when useful, The Token Company API cleans the transcript, Haiku decides whether the moment matters, Cortex updates memory, and the glasses only speak back when the response is actually valuable.

Beyond features, the real work is the trust model: credibly answering why someone should let Cortex hold continuous personal context, and finding a monetization path that doesn't undermine the "you own your data" promise the local markdown layer is built on.