Inspiration
I used to be terrible at fashion. My college roommate, a brilliant fashion designer, would literally stop me at the door to fix my tragic outfits. Even after college, Iād video call him to look at my closet and tell me what to wear. I wanted to digitize that exact experience, having a fashion-obsessed bestie who keeps it real, gives you instant vocal feedback, and visually shows you how an outfit will look. I didn't want a text chatbot; I wanted a high-end, elite 3D styling session.
What it does
Zaute is a fully interactive 3D fashion companion. You speak to the realistic 3D avatar naturally, and it replies in real-time with highly opinionated styling advice. You can interrupt it mid-sentence. If you ask for a specific look (e.g., "I need a flowing silk dress for a wedding"), you upload a selfie, and Zaute triggers a backend function calling Nano Banana 2 to composite the generated outfit directly onto your body.
How we built it
- Gemini Multimodal Live API connected via WebSockets for bidirectional, sub-second audio streaming.
- A React Three Fiber frontend with realistic .glb avatars. We route raw 16-bit PCM audio from Gemini through the browser's Web Audio API and use the wawa-lipsync library to physically drive 52 ARKit facial blendshapes in real-time. 3 Server-side function calls invoke Gemini 3.1 Flash Image (Nano Banana 2) for high-fidelity virtual try-on and multi-image fusion
- The backend is containerized and deployed on Google Cloud Run to handle the persistent WebSockets, Vercel was used for the front-end
- Built using Google IDE Antigravity to autonomously orchestrate the architecture and complex audio routing.
Challenges we ran into
The biggest challenge was when i mapped real-time interactions, a 3D coordinate conflict caused the avatar's rigging to snap, giving the stylist four arms and a broken neck 48 hours before the deadline! i had to recalculate every Three.js bone constraint from scratch. Additionally, moving from standard sockets to the Gemini Live API required a massive backend refactor to handle raw PCM audio buffers and synchronize them perfectly with the 3D facial morph targets so the lip-sync wouldn't lag, also loosing the files due to a misinterpretation from my coding agent, required me starting from scratch and coming back better.
Accomplishments that we're proud of
Achieving true sub-second latency where the user can interrupt the 3D avatar natively. Combining heavy 3D rendering (WebGL) with real-time generative audio and generative vision (Nano Banana 2) into one fluid, high-end "quiet luxury" user interface without crashing the browser.
What we learned
I learned the deep intricacies of 3D bone hierarchies and WebRTC audio buffering. I also learned the power (and danger) of agentic IDEs like Antigravity, after an autonomous agent accidentally deleted my 3D logic code, rebuilding it from memory forced us to write much cleaner, faster code.
What's next for Zaute
I plan to expand the Studio functionality by integrating VEO 3.1, allowing fashion entrepreneurs to animate their generated product images with a single prompt for social media content, and also shop recommendations for fits.
Built With
- gcr
- gemini
- next.js
- react
- three.js
- typescript
- vanilla
- wawa-lipsync

Log in or sign up for Devpost to join the conversation.