Inspiration
Creative projects often begin with a feeling rather than a specification. In interior design, clients rarely know exactly what they want. They recognize what feels right only after seeing it. This leads to long revision cycles, vague briefs, and inefficient design processes. Mood boards help with inspiration, but they do not guide decisions or synthesize preferences. Design Alignment Agent explores a different approach. Instead of asking users to describe their taste, it lets them discover it visually. The system presents design options, observes reactions, and refines the direction until a clear aesthetic emerges. The idea is to turn preference discovery into a structured journey. A story that begins with uncertainty and ends with a coherent design brief.
What it does
Design Alignment Agent is a stateful multimodal AI agent that helps users converge on a design direction through visual comparison and iterative refinement. The experience follows a guided arc:
Style selection — the user picks an interior style from a curated visual set Style explanation — the agent explains the aesthetic logic of that style Exploration round — three variations of the same room are presented side by side Refinement round — based on the selection, the agent generates progressively focused alternatives Final convergence — a structured design brief summarizes the aesthetic direction
Throughout the process, the system maintains a persistent session that evolves with every choice. The result is a clear design direction derived from interaction rather than prompt writing.
How we built it
FastAPI backend on Google Cloud Run. Gemini 2.5 Flash handles all reasoning including style commentary, direction planning, and final brief generation. Imagen 3 renders photorealistic rooms sequentially. Firestore persists session state across every call. The key architectural decision was separating reasoning from rendering. Gemini plans the full design direction first, then Imagen generates. This keeps quality high and cost controlled.
Challenges we ran into
Keeping visual comparisons fair across rounds. The solution was fixing the spatial canvas. Same room, same camera, same light across every option so users evaluate aesthetic differences only, not architectural ones. Managing sequential image generation latency while keeping the experience coherent was the other main constraint.
Accomplishments that we're proud of
The planning before rendering pattern works well. Gemini reasons about design direction first, then Imagen executes against that plan. The session state design also works cleanly. Round 2 automatically uses the Round 1 selected image as its spatial anchor, creating continuity across rounds.
What we learned
Stateful multimodal agents require careful separation of concerns. Text reasoning is fast and cheap. Image generation is slow and expensive. Designing the system around that asymmetry made everything better. Fixing spatial constraints paradoxically gives users more creative clarity, not less. Removing variables focuses attention on what actually matters.
What's next for Design Alignment Agent
Voice and typed commentary between rounds as richer preference signals. Compositional selection across cards, for example taking the sofa from one option and the palette from another. Additional refinement rounds until the user genuinely converges rather than a fixed limit. Expansion beyond living rooms into branding, architecture, and visual identity.
Built With
- fastapi
- gemini-2.5-flash
- google-cloud-firestore
- google-cloud-run
- google-genai-sdk
- imagen-3
- python
- vertex-ai


Log in or sign up for Devpost to join the conversation.