Inspiration
Walking into a vacant house can feel cold and uninspiring. According to real estate industry data, vacant homes sit on the market significantly longer and sell for less than furnished ones. Buyers simply struggle to visualize an empty box as their future "dream home."
While physical staging solves this, it is incredibly expensive and time-consuming—often costing upwards of \$5,000 and taking weeks to coordinate. We realized that AI could bridge this imagination gap instantly and for roughly \$0.05 per room. But we didn't just want to generate a flat image—we wanted to create an immersive, multimodal experience. We were inspired to build an AI agent that doesn't just decorate a room, but sells the vision of living there through interwoven visuals and voice.
What it does
Open House AI Storyteller is a next-generation real estate agent and interior designer living right in your browser.
Users upload a photo of an empty room, select an interior design style (like Modern Farmhouse, Scandinavian Minimalist, or Traditional Luxury), and can even activate our specialized ☯ Feng Shui Expert Mode.
Operating as a true "Creative Storyteller," the agent processes the visual input and simultaneously weaves together a rich, interleaved output:
High-Fidelity Visuals: It generates a photorealistically staged version of the room, perfectly matching the original perspective, lighting, and scale.
Text Narrative: It writes an enthusiastic, buyer-focused description of the design choices, including how the layout promotes good energy and flow.
Immersive Audio: It streams a professional, studio-quality voiceover of the design story, allowing buyers to listen to their new home's potential while admiring the transformed space.
How we built it
We built the platform to be fast, elegant, and deeply integrated with the Google Cloud ecosystem:
Frontend: Built with React and Vite, styled with Tailwind CSS, and brought to life with Framer Motion animations for a premium, luxury real estate feel.
Backend: A Node.js and Express server, designed to be deployed on Google Cloud Run for scalable, serverless execution.
The AI Brain (Vision & Image): We heavily utilized the new
@google/genaiSDK. To ensure speed and cost-effectiveness without sacrificing quality, we engineered a custom model fallback chain, prioritizinggemini-3.1-flash-image-previewandgemini-2.5-flash-imagefor rapid text-to-image and image-to-image generation.The Voice (Audio): We integrated the Google Cloud Text-to-Speech (TTS) API using the ultra-realistic "Journey" voice models to give our agent a warm, persuasive voice.
Challenges we ran into
Prompt Engineering for Spatial Awareness: It was challenging to get the AI to balance beautiful aesthetic staging with strict spatial rules for our Feng Shui mode (like the "Commanding Position" or ensuring unobstructed pathways). We solved this by structuring our prompts with explicit, separated requirements for the image generation versus the text generation.
Handling Multimodal Synchronization: We initially struggled with race conditions where the audio player would try to load before the Gemini AI had finished writing the story, resulting in
400 Bad Requesterrors. We implemented robust React state management to wait for the AI's interleaved text output before safely fetching and decoding the Base64 audio stream.Socket Errors with Heavy Payloads: Processing high-resolution images back and forth caused occasional socket hang-ups. We engineered a smart 3-attempt retry loop and a graceful fallback mechanism between Gemini models to ensure the user never sees a crashed screen.
Accomplishments that we're proud of
We are incredibly proud of breaking out of the standard "chatbot text box." We successfully built an agent that feels like a cohesive storyteller—seamlessly turning an empty photo into a magazine-quality image, paired with a professional voiceover. We are also proud of the UI/UX; the app feels like a premium enterprise tool that real estate brokerages could start using tomorrow to win more listings.
What we learned
The Power of the Google GenAI SDK: We learned how to effectively construct multimodal prompts and pass Base64 image data and MIME types into Gemini's generation endpoints.
Cloud Architecture: We deepened our understanding of serverless deployment concepts and managing secure Google Cloud service account authentication for APIs like TTS.
The Interleaved Paradigm: We learned that combining visual transformations with audio storytelling creates a significantly higher emotional impact than just outputting text on a screen.
What's next for Open House AI Storyteller
We are just scratching the surface of multimodal real estate! Next, we plan to:
Video Walkthroughs: Leverage Google's Veo video generation model to take the staged photos and turn them into panning, cinematic virtual tours.
Real-time Voice Interaction: Integrate the Live API so a prospective buyer looking at the staged photo can actually interrupt the agent and ask questions like, "Can you change that sofa to leather?" or "What kind of wood is that floor?"
Automated MLS Integration: Allow real estate agents to simply paste a Zillow listing link, and have the agent automatically scrape the empty photos and generate a complete marketing package.
Built With
- express.js
- framer-motion
- gemini-api
- google-cloud
- google-cloud-run
- node.js
- react
- real-estate
- tailwind-css
- text-to-speech
- typescript
- vite


Log in or sign up for Devpost to join the conversation.