Inspiration
1 in 9 people over 65 have Alzheimer's disease, and misplacing objects is the number one early symptom reported by patients and caregivers. We kept thinking about what it would feel like to live in a world where nothing stays where you left it, at least in your mind. Existing tools are too complex, too expensive, or simply do not address the core question people with memory conditions ask dozens of times a day: "Where did I put that?" We wanted to build something simple enough for an 80-year-old to use, powerful enough to actually help, and affordable enough to run on hardware people already own.
What it does
Claudio is a real-time visual memory assistant that continuously watches a user's environment through a phone camera, builds a searchable memory of what it sees, and answers natural language questions about their surroundings past and present. Ask "Hey Claudio, where are my keys?" and Claudio responds: "I last saw your keys on the kitchen counter about 12 minutes ago" and shows you the frame. It runs as a transparent overlay on a live video feed with voice input and voice output, so users never need to look down at a keyboard.
How we built it
We split the pipeline into two cost-efficient phases. At capture time, a phone camera sends a frame every 2 seconds via HTTP POST to a FastAPI backend. Each frame is embedded locally using CLIP (clip-ViT-B-32) into a 512-dimensional vector, deduplicated against recent frames, and stored in Supabase with pgvector. No external API is called at capture time, keeping it fast and cheap. At query time, the user's question is embedded with CLIP's text encoder, matched against stored image embeddings via cosine similarity, and the top matching frames are sent to Claude Vision, which synthesizes a natural language answer. The frontend is React with the Web Speech API handling voice input and output.
Challenges we ran into
Getting deduplication right was harder than expected. CLIP embeddings for near-identical frames are not always above a clean threshold, so we tuned the similarity cutoff to avoid both flooding the database and dropping genuinely changed scenes. Latency at query time was another challenge since pulling and analyzing multiple frames through Claude had to feel snappy for a user who may already be anxious. We also had to think carefully about failure modes: what does Claudio say when it has not seen the object recently, and how do we communicate uncertainty without creating more confusion?
Accomplishments that we're proud of
We built a full end-to-end working system in 12 hours on hardware anyone owns. The voice pipeline feels natural. The CLIP-based search actually works well on real household objects without any fine-tuning. Most importantly, we designed the UI from the ground up around the research: non-obtrusive, high-contrast, voice-first, and transparent, because the literature on assistive tech for dementia is clear that intrusive interfaces get abandoned.
What we learned
CLIP is surprisingly powerful as a zero-shot visual search engine for everyday household objects. Separating the cheap always-on capture pipeline from the expensive on-demand query pipeline is the right architecture for cost and latency. We also learned that designing for cognitive accessibility forces clarity in ways that improve the experience for everyone.
What's next for Claudio
People memory: recognizing faces and associating them with names and relationships. A caregiver dashboard with activity alerts. Adaptive capture rates that increase when the scene changes and slow down when it is static. A daily narrative summary so a user can ask "what happened today?" and hear a plain-language recap of their day. And eventually, smart glasses integration to replace the phone camera with something truly hands-free.
Built With
- clip
- react
- vite




Log in or sign up for Devpost to join the conversation.