Sentient

Main page : Detect Objects and start chats in real-time
Main page
Objects page : Scan objects
Objects page : Add Objects

Inspiration

We've spent the last decade making the digital world feel more real ~ better screens, faster phones, smarter AI. Sentient asks the opposite question: what if the physical world could talk back? The idea came from thinking about objects that carry real emotional weight - a book gifted by a grandmother, a mug used by a dementia patient every morning for 30 years, a guitar I myself carried across the world. Those stories exist. There's just no infrastructure to surface them. I wanted to build that.

What it does

Sentient gives any physical object a persistent AI identity ~ a name, a personality, a unique voice, and memory that grows with every conversation. Point your camera at an object, click it, and have a real-time voice conversation with it. Each object is its own independent agent. It remembers you across sessions. Come back tomorrow, it still knows you.

How we built it

YOLO for real-time object detection on the live camera feed Google Vision API for richer, more precise object labeling beyond YOLO's classes

FastAPI Python backend handling detection, routing, and agent orchestration

MongoDB for one canonical record per object, ensuring the same physical object always maps to the same agent

Backboard so that each object gets its own persistent AI agent with independent memory

ElevenLabs unique voice per object, streamed TTS

WebSockets for fully real-time conversation with no page reloads

React + Vite : live camera feed with canvas overlay, bounding boxes, and a conversation drawer

Challenges we ran into

Object identity across sessions ~ YOLO gives you a label, not an identity. Getting the same mug to map to the same agent every time required a compound label strategy with MongoDB as the source of truth

Backboard free tier ~ LLM chat requires paid credits; only memory/RAG is free. I had to design around this mid-build

Real-time latency ~ chaining YOLO detection → Vision API → Backboard → ElevenLabs TTS in under a few seconds required careful async orchestration and WebSocket streaming

[MAJOR ISSUE HERE FOLKS] Depth perception ~ making the system detect objects at different depths, not just the closest thing in frame, required MiDaS depth estimation on top of detection

Accomplishments that we're proud of

Each object genuinely feels like its own entity ~ different personality, different voice, different memory

The pipeline from camera to voice response works end-to-end in real-time

The system can auto-assign a personality to a completely unknown object on the spot ~ any object, anywhere, instantly

What we learned

Chaining multiple AI APIs in real-time requires aggressive async design from day one, not as an afterthought.

I had no idea how Backboard or any of the AI agents worked under the hood, so that was super fun to witness.

What's next for Sentient

Room scanning ~ scan your entire room once with Depth Anything and every object in it gets a persistent identity automatically Mobile app ~ point your phone at anything, anywhere Museum & dementia care pilots ~ the two use cases with the clearest immediate value Shared object memory ~ two people who interact with the same object both contribute to its memory Object-to-object conversations ~ your book and your guitar have never met. What would they say to each other?