🌌 Aura: Multimodal Home Intelligence
Next-Generation Ambient Smart Home Automation powered by Gemini Live
💡 Inspiration
Smart homes today are often fragmented, reactive, and strictly text- or voice-only. We wanted to build something that feels alive. Inspired by futuristic sci-fi interfaces, we set out to create Aura - a central AI pilot that doesn't just process your voice but also sees your environment concurrently and translates that intelligence into a gorgeous, living Ambient Dashboard that visually maps itself to your home's pulse in real time.
🛠️ What it does
Aura is a fully multimodal smart home operating system operating over continuous, low-latency WebSocket streams.
- Concurrent Live Streaming: Feeds real-time audio AND webcam visuals directly into the Gemini Live API without waiting for turn-taking breaks.
- Native Contextual Vision: Aura identifies objects you hold up (cups, plant types, device errors) using native frame parsing without requiring isolated capture modules.
- Immersive Ambient UI: The dashboard adapts dynamically. Ask Aura to turn off the lights, and the interface dims with a deep, vignette-like neon dark-mode glow. Trigger an Emergency, and the absolute viewport activates a continuous strobe-alert overlay, securely mapped to DOM visual triggers.
⚙️ How I built it
We engineered a decoupled full-stack reactive Node/React pipeline:
- Backend (Node.js/Express): Handles stateful WebSocket connections directly with the
genAI.live.connectNode.js client, managing function calling routing to 12+ backend home triggers (Thermostat, locks, security). - Frontend (React + Vite): Uses continuous Web Audio API capture and canvas context extraction, preserving exact aspect ratio bounding weights. JPEGS are continuously streamed over pure byte-encoding backpressure for pure video concurrency.
- Styling (CSS Variables + Keyframes): Global ambient overrides triggered by function call resolution, mapping discrete class toggles natively into layout wrapper trees for instant visual execution feedback.
🚧 Challenges I ran into
- Video Aspect Ratio Pipeline Flaws: Translating 16:9 webcam buffers onto square AI processing boxes originally caused shape distortions, resulting in hallucinated objects. We built a dynamic canvas-scaling override, preserving center-weighted offsets that corrected early visual bugs.
- Layout Decoupling Synchronicity: Ensuring UI state maps gracefully with speech responses across active thread streams in real-time required strict component lifecycle mounts, ensuring device camera streams didn't crash during layout renders.
🏆 Accomplishments that I'am proud of
- True Multimodal Concurrency: Forcing native side-by-side feed renders that send Live Frames and Voice simultaneously without blocking standard processing speeds.
- Micro-Animatic Fluidity: Building rich ambient CSS themes (.lights-off, .emergency-global, thermal gradients) which execute flawlessly inside standard Dashboard grids with zero re-render lags.
- Visual Feedback Design (Dashboard Debug overlay): Transforming trust between AI and user by rendering Side-by-Side Live and "Sent to Aura" frame buffers on-screen.
🧠 What I learned
- Multimodal Prompting Structures Matter: Instructions need to be explicit that the model has the camera feed continuous, avoiding traditional search-tool hallucinations and relying purely on available imagery natively.
- Visual Feedback is Crucial in Voice-First Apps: Users love seeing exactly what the AI sees to verify it grasped the correct object before committing command executions.
🚀 What's next for Aura: Multimodal Home Intelligence
- Spatial Object Depth Mapping: Upgrading layout triggers to 3D mesh representations, leveraging depth sensor triggers.
- Cross-Component Scene Predictive Logic: Proactive triggers that prepare room temperatures or lock doors simply by reading posture or object tracking visuals accurately with Edge modules.
- Reporting Enhancement: Enable Google Big Query and feeds all the events and get a better reporting.
💡 Key Live API Capabilities I leveraged:
- 🧠 𝗡𝗮𝘁𝗶𝘃𝗲 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴 & 𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴: Powered by
gemini-2.5-flash-native-audio-preview, Aura takes advantage of internal "thinking budgets" to reason through complex smart-home diagnostics safely before executing commands aloud. - 🎙️ 𝗖𝗼𝗻𝘁𝗶𝗻𝘂𝗼𝘂𝘀 𝗕𝗶𝗱𝗶𝗿𝗲𝗰𝘁𝗶𝗼𝗻𝗮𝗹 𝗔𝘀𝘆𝗻𝗰 𝘀𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴: Fully concurrent Audio + Vision pipelines without waiting for turn-taking breaks.
- 📷 𝗥𝗲𝗮𝗹-𝘁𝗶𝗺𝗲 𝗖𝗮𝗻𝘃𝗮𝘀 𝗕𝘂𝗳𝗳𝗲𝗿𝗶𝗻𝗴: Aspect-ratio scaled Canvas context preserving 100% video context, bounding accurately for model intake safely.
- 🚨 𝗜𝗺𝗺𝗲𝗿𝘀𝗶𝘃𝗲 𝗔𝗺𝗯𝗶𝗲𝗻𝘁 𝗩𝗶𝗲𝘄𝗽𝗼𝗿𝘁𝘀: Full-screen overrides that adapt dynamically (e.g., turning off lights triggers rich dark-mode vignette framing natively on variable execution).
Built With
- artifactregistry
- canvas-api
- cloudrun
- css3
- express.js
- gemini-live-api
- google-cloud
- html5
- node.js
- react
- typescript
- vite
- web-audio-api
- websockets
Log in or sign up for Devpost to join the conversation.