Inspiration
I was building a hardware project with no prior experience. My workflow was: try something, take a photo, send it to someone, a friend, a forum, ChatGPT, and ask "does this look right?" or "which pin does this wire go to?" Every question meant stopping, pulling out my phone, framing a shot, typing context, waiting for a response, then going back to work.
This is a common problem in hardware. Getting help requires you to stop and explain what you're looking at. Documenting your build is even worse. You have to capture photos at the right moments, remember what each one was about, add context after the fact, and organize it somewhere useful. Most people don't bother, and then they can't retrace their steps when something breaks.
I wanted something that just watches while I work. It answers questions about what's on my bench without me having to photograph it, and documents everything automatically as I go. No stopping. No typing. Just build, and the record writes itself.
What it does
A small camera on your desk streams video and audio to two AI agents.
Nova Sonic talks to you in real time. It sees what you're working on and can warn you about mistakes, answer questions, or guide you through a step, all hands free.
Nova Lite watches quietly in the background. It tracks what you've done, saves milestones when you finish something, and remembers across sessions so you can pick up where you left off.
You don't interact with a screen. You just work. The AI speaks up when something matters, and the documentation builds itself.
How we built it
A compact device with a camera and microphone captures video and audio. It only streams when it detects someone at the desk, and conversation is activated with a wake word, fully hands free.
A FastAPI backend on AWS receives the streams over WebSocket and routes them to the Nova agents.
Nova Lite analyzes each frame and feeds visual context to Sonic, so Sonic can talk about what it sees even though it's an audio only model.
A background tracker reviews observations regularly and decides whether to save a milestone or flag a problem, no manual input needed.
Redis holds the live context window. Postgres stores the permanent record.
Infrastructure is managed with Terraform, ECS Fargate, ElastiCache, S3, all deployed with a single command.
Challenges we ran into
Making Sonic see. Sonic is a speech model. It has no vision. We built a pipeline where Nova Lite continuously analyzes frames and injects scene descriptions into Sonic's conversation, so its responses stay visually grounded.
Memory limits on the device. Fitting camera buffers, a person detection model, audio I/O, and a WebSocket client into a small footprint required careful task pinning and buffer management.
Accomplishments that we're proud of
End to end from chip to cloud. Custom firmware, real hardware, backend, infrastructure, not a demo stitched from API calls.
The two agent split works. Sonic handles the moment. Lite handles the memory. They run independently and stay in sync through shared context.
Presence gated streaming with wake word activation. The device only streams when someone's in frame, and you trigger conversation with "Hey Wanda", no buttons, no screen, fully hands free.
What we learned
Speech models need visual grounding injected architecturally. You can't describe the scene once in a prompt. Context has to flow continuously as the scene changes.
A second inference pass on critical tool calls is a simple pattern with outsized impact. One extra check eliminated most false alarms.
Two specialized agents working asynchronously outperform one model trying to do everything in a single loop.
What's next for op-wednesday
Richer memory. Semantic search over past sessions so you can ask "when did I last change the power regulator?" and get a timestamped answer with the frame.
WebRTC streaming. Replace WebSockets with WebRTC for adaptive bitrate, lower latency, and stable streaming on unreliable connections.
End-to-end security. Encrypted streams, auth rotation, role-based dashboard access, encrypted storage at rest.
Spatial context. Add a vision model that tracks component positions across frames, not just frame-by-frame descriptions.
Model and device flexibility. Let users pick any Nova model and connect from any device with a camera and mic.
Multi device. Multiple cameras on the same project, shared context across viewpoints.
Real-world pilot. Cut response latency, deploy inside a real company, test on a real production floor.
Open tool interface. The tool schemas are already MCP compatible, exposing them so external tools can query and contribute to project state.
Built With
- esp32
- fastapi
- nova-lite
- nova-sonic
- postgresql
Log in or sign up for Devpost to join the conversation.