🌌 Aura: Multimodal Home Intelligence

Next-Generation Ambient Smart Home Automation powered by Gemini Live

💡 Inspiration

Smart homes today are often fragmented, reactive, and strictly text- or voice-only. We wanted to build something that feels alive. Inspired by futuristic sci-fi interfaces, we set out to create Aura - a central AI pilot that doesn't just process your voice but also sees your environment concurrently and translates that intelligence into a gorgeous, living Ambient Dashboard that visually maps itself to your home's pulse in real time.

🛠️ What it does

Aura is a fully multimodal smart home operating system operating over continuous, low-latency WebSocket streams.

  • Concurrent Live Streaming: Feeds real-time audio AND webcam visuals directly into the Gemini Live API without waiting for turn-taking breaks.
  • Native Contextual Vision: Aura identifies objects you hold up (cups, plant types, device errors) using native frame parsing without requiring isolated capture modules.
  • Immersive Ambient UI: The dashboard adapts dynamically. Ask Aura to turn off the lights, and the interface dims with a deep, vignette-like neon dark-mode glow. Trigger an Emergency, and the absolute viewport activates a continuous strobe-alert overlay, securely mapped to DOM visual triggers.

⚙️ How I built it

We engineered a decoupled full-stack reactive Node/React pipeline:

  • Backend (Node.js/Express): Handles stateful WebSocket connections directly with the genAI.live.connect Node.js client, managing function calling routing to 12+ backend home triggers (Thermostat, locks, security).
  • Frontend (React + Vite): Uses continuous Web Audio API capture and canvas context extraction, preserving exact aspect ratio bounding weights. JPEGS are continuously streamed over pure byte-encoding backpressure for pure video concurrency.
  • Styling (CSS Variables + Keyframes): Global ambient overrides triggered by function call resolution, mapping discrete class toggles natively into layout wrapper trees for instant visual execution feedback.

🚧 Challenges I ran into

  • Video Aspect Ratio Pipeline Flaws: Translating 16:9 webcam buffers onto square AI processing boxes originally caused shape distortions, resulting in hallucinated objects. We built a dynamic canvas-scaling override, preserving center-weighted offsets that corrected early visual bugs.
  • Layout Decoupling Synchronicity: Ensuring UI state maps gracefully with speech responses across active thread streams in real-time required strict component lifecycle mounts, ensuring device camera streams didn't crash during layout renders.

🏆 Accomplishments that I'am proud of

  • True Multimodal Concurrency: Forcing native side-by-side feed renders that send Live Frames and Voice simultaneously without blocking standard processing speeds.
  • Micro-Animatic Fluidity: Building rich ambient CSS themes (.lights-off, .emergency-global, thermal gradients) which execute flawlessly inside standard Dashboard grids with zero re-render lags.
  • Visual Feedback Design (Dashboard Debug overlay): Transforming trust between AI and user by rendering Side-by-Side Live and "Sent to Aura" frame buffers on-screen.

🧠 What I learned

  • Multimodal Prompting Structures Matter: Instructions need to be explicit that the model has the camera feed continuous, avoiding traditional search-tool hallucinations and relying purely on available imagery natively.
  • Visual Feedback is Crucial in Voice-First Apps: Users love seeing exactly what the AI sees to verify it grasped the correct object before committing command executions.

🚀 What's next for Aura: Multimodal Home Intelligence

  • Spatial Object Depth Mapping: Upgrading layout triggers to 3D mesh representations, leveraging depth sensor triggers.
  • Cross-Component Scene Predictive Logic: Proactive triggers that prepare room temperatures or lock doors simply by reading posture or object tracking visuals accurately with Edge modules.
  • Reporting Enhancement: Enable Google Big Query and feeds all the events and get a better reporting.

💡 Key Live API Capabilities I leveraged:

  • 🧠 𝗡𝗮𝘁𝗶𝘃𝗲 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴 & 𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴: Powered by gemini-2.5-flash-native-audio-preview, Aura takes advantage of internal "thinking budgets" to reason through complex smart-home diagnostics safely before executing commands aloud.
  • 🎙️ 𝗖𝗼𝗻𝘁𝗶𝗻𝘂𝗼𝘂𝘀 𝗕𝗶𝗱𝗶𝗿𝗲𝗰𝘁𝗶𝗼𝗻𝗮𝗹 𝗔𝘀𝘆𝗻𝗰 𝘀𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴: Fully concurrent Audio + Vision pipelines without waiting for turn-taking breaks.
  • 📷 𝗥𝗲𝗮𝗹-𝘁𝗶𝗺𝗲 𝗖𝗮𝗻𝘃𝗮𝘀 𝗕𝘂𝗳𝗳𝗲𝗿𝗶𝗻𝗴: Aspect-ratio scaled Canvas context preserving 100% video context, bounding accurately for model intake safely.
  • 🚨 𝗜𝗺𝗺𝗲𝗿𝘀𝗶𝘃𝗲 𝗔𝗺𝗯𝗶𝗲𝗻𝘁 𝗩𝗶𝗲𝘄𝗽𝗼𝗿𝘁𝘀: Full-screen overrides that adapt dynamically (e.g., turning off lights triggers rich dark-mode vignette framing natively on variable execution).

Built With

Share this project:

Updates