Site Supervisor

Architecture Diagram
Initial screen
Gemini listening question
Gemini speaking
Gemini provides shopping results based on questions
Shopping results

👷‍♂️Site Supervisor

Your personal AI co-host

💡Inspiration

Launching an Airbnb property requires juggling countless tasks, from handling logistics to planning the interior design. Managing the physical space while making fast, aesthetic decisions quickly became overwhelming. I realized hosts could benefit from having a knowledgeable partner right there in the room with them. This sparked the idea for Site Supervisor: a real-time, vision-enabled AI co-host that sees your space, talks you through property management, and visually generates interior design ideas on the fly.

🚀What it does

Site Supervisor is a real-time, multimodal AI agent designed to act as an on-demand, interactive co-host. Users can simply point their phone's camera at a room, and the agent processes the live video and audio. It provides real-time, interruptible voice advice on setting up the space and instantly generates tailored interior design suggestions based on the actual layout it's looking at.

🛠️How we built it

The platform is built with a decoupled architecture focusing on low-latency streaming to create a natural conversational feel.

Client: A Progressive Web App (PWA) built with React. It captures raw audio and video from the user's mobile device and transmits the media via WebSockets (WSS).

Backend: A Node.js server deployed on Google Cloud Run acts as the central orchestrator for the incoming streams.

AI Engine: The backend interfaces directly with the Gemini Multimodal Live API, which handles the real-time reasoning, visual processing, and conversational audio responses.

🚧Challenges we ran into

Ensuring low-latency transmission of raw audio and video streams over WebSockets was a primary hurdle. Orchestrating the media flow from the React client, through the Node.js backend, and into the Gemini Multimodal Live API required strict synchronization. Additionally, handling conversational interruptions natively while maintaining the continuous context of the live video feed took several iterations to get perfectly smooth.

🏆Accomplishments that we're proud of

Achieving a truly fluid, interruptible conversational flow with real-time vision capabilities is a massive win. Connecting the mobile PWA stream to the Gemini Multimodal Live API without noticeable lag makes the agent feel genuinely present in the room. Seeing the AI accurately interpret a physical space and immediately offer practical design and management advice is an incredible technical milestone.

🧠What we learned

This project deeply expanded our knowledge of real-time web technologies, specifically handling WebSockets and raw media streaming in a Node.js environment. It also served as an intensive hands-on experience in generative AI app development, teaching us how to orchestrate the Gemini Multimodal Live API to build highly interactive, multimodal user experiences.

⏭️What's next for Site Supervisor

The immediate next step is to battle-test Site Supervisor in the field to assist with a real-world Airbnb launch, using the physical environment to fine-tune the interior design generation. Long-term, the goal is to expand the agent's capabilities to handle more granular hosting tasks and potentially transition the React PWA into a fully native app for deeper hardware optimization.

Built With

Updates

Sagar Kashyap started this project — Mar 16, 2026 05:54 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.