Inspiration

We wanted to build an assistive experience that feels like real smart glasses: a companion that understands what you see and hear, then helps you in the moment. The idea came from everyday friction points—forgetting names in social settings, missing visual context during busy situations, and struggling to recall details later.

What it does

AURA combines Face ID, Scene Insight, and visual memory into one app. It recognizes faces, infers likely names from speech context, summarizes regions in a paused frame, and stores object/context memory for later question answering. It runs as a unified Smart Specs Console with shared video input and real-time overlays.

How we built it

We built AURA in Streamlit with a unified pipeline: Face recognition with CavaFace + ONNX Runtime (NPU/QNN preferred) Whisper-based transcription with live name hint extraction Scene summarization using CLIP/BLIP Object memory ingest/query flow Threaded real-time streaming architecture, model preloading, and caching Interactive pause-and-draw UX for region-based scene understanding Challenges we ran into Keeping video/audio smooth while running inference in parallel Preventing Streamlit rerun loops and repeated model initialization Handling Windows OpenCV backend issues and media rendering quirks Making LLM extraction reliable in background threads Balancing speed, stability, and UI responsiveness

Accomplishments that we're proud of

End-to-end unified “smart glasses” workflow in one app Real-time face overlays plus transcription-driven hints Auto memory flow from perception to later retrieval NPU-first acceleration path with practical fallbacks Robust debugging and recovery of multiple live pipeline issues

What we learned

We learned that real-time multimodal UX is mostly systems engineering: synchronization, caching, thread-safe state, and graceful fallback logic matter as much as model quality. We also learned to design around platform/runtime constraints early.

What's next for AURA

Stronger identity confidence checks before auto-enroll Better timeline sync between audio, captions, and overlays Native mobile/AR wearable integration Multi-user memory profiles and privacy controls Faster on-device models and richer contextual reasoning

Built With

Share this project:

Updates