Inspiration
Every morning I'd watch my girlfriend photograph individual clothing items she thought might match together, send them to ChatGPT, and ask it for an opinion. It sort of worked - but it was painfully slow. She'd have to re-photograph things, describe details the model couldn't see, and start from scratch every time because it had no memory of what she owned. And honestly, when you're trying to leave the house, watching someone have a 15-minute conversation with a chatbot about whether a jacket matches is… motivating. I figured there had to be a better way.
The core problem was clear: styling is holistic. A good outfit isn't about individual items — it's about how they work together. That means the AI can't just look at one piece at a time. It needs to see the entire closet at once, remember what you've worn recently, and understand your taste. That's what I set out to build.
What I Learned
Gemini 3's context window is incredible. Most AI apps hitting a knowledge problem reach for RAG — retrieve a handful of documents, hope they're the right ones, and generate from there. For wardrobe styling, that approach fundamentally breaks down. If the model only sees 5 out of 50 items, it's styling from an incomplete picture. Gemini's large context window let me skip RAG entirely and load the full wardrobe — every image, every metadata field — into a single prompt. The difference in output quality was night and day. It's also significantly faster — there's no retrieval step, no embedding lookup, no re-ranking. The model already has everything it needs, so it can go straight from question to answer. What would be a multi-step pipeline with RAG becomes a single inference call.
Tool calling makes agents actually reliable. Early prototypes used free-text parsing to extract outfit suggestions from the model's response. It was brittle and constantly broke. Switching to Gemini's structured tool calling (view_items, show_outfit, add_to_calendar, etc.) turned the agent from a demo into something genuinely usable. The model decides what to do, the code handles how.
Streaming matters more than speed. Users don't mind waiting 3-4 seconds for a response if they can see it happening — thinking indicators, text appearing word by word, items loading in. A fast response that appears all at once actually feels slower than a streamed one that takes longer.
How I Built It
The stack is Next.js 14 (TypeScript, Tailwind) on the frontend, FastAPI (Python) on the backend, PostgreSQL for persistence, and Backblaze B2 for image storage. Everything runs in Docker Compose. Most of the development was done in Antigravity, which made it significantly faster to iterate on the agent logic and frontend components. I was pleasantly surprised by Gemini 3's capability when it came to UI design, which is something I admittedly lack experience with.
The core of the app is the stylist agent — a Gemini 3 flash model with access to tools that can view wardrobe items (with images), check the weather, save user preferences, and present outfits. When a user asks for an outfit, the agent:
- Gets an initial prompt with the user's complete wardrobe, outfit history, preferences, local weather forecast, and question
- Reasons about what works together, considering colour, season, occasion, weather, and recent wear history
- Picks a few candidate outfits, then pulls the actual images of these items to do a visual pass to ensure cohesion (saves token cost)
- Calls
show_outfitwith specific item IDs, per-item styling notes, and an overall comment
The chat runs over WebSocket with real-time streaming — the frontend shows thinking status, text chunks as they arrive, and the final outfit composite.
Magic Polish is the upload pipeline. When you photograph a clothing item, a background worker picks up the job, sends the image through Nano Banana for a product-style enhancement (using image generation), then runs a second pass to extract metadata (category, colour, material, season, tags). The worker uses SELECT ... FOR UPDATE SKIP LOCKED for job claiming, so it scales horizontally. The user can upload a crumpled mess, and it comes out as a beautifully presented catalogue.
Challenges
The multi-tool-call bug. Gemini can return multiple function calls in a single response — for example, save_preference followed by view_items followed by show_outfit. My initial code used fc = part.function_call in a loop, which meant each call overwrote the previous one. Preferences were being silently dropped. The fix was simple (function_calls.append(...)) but finding it took hours of staring at "correct" prompts that somehow weren't working. The lesson: when the model seems to be ignoring you, check if your code is ignoring the model.
Getting the agent to actually "style". Early versions would just pick items that matched a keyword — ask for "something casual" and it'd grab any item tagged casual. It took a lot of prompt engineering and system prompt iteration to get Gemini to think like a stylist: considering colour balance across the whole outfit, layering textures, and explaining why pieces work together rather than just listing them. The tool-calling structure helped here — forcing the agent to pass per_item_notes and a stylist_comment with every outfit meant it couldn't be lazy.
Streaming + tool calls in the same response. Gemini streams text tokens, but tool calls arrive as structured objects mixed into the same stream. Building a frontend that gracefully handles "here's some thinking text… now here's a tool call that renders an outfit card… now here's more text commenting on it" required careful state management. The WebSocket handler needed to process text chunks, tool results, and UI state transitions all in the right order without flickering or layout shifts.
Keeping the context window useful, not just big. Loading 200+ items with images into context is possible, but this approach is really expensive or produces worse results because the model drowns in noise. I had to find the right balance — what metadata to include, how to structure the item list, when to show images vs. just text descriptions — so the model could reason efficiently over the full wardrobe without degrading response quality, or costing a fortune.
Built With
- Gemini 3 — Flash for the stylist agent (tool calling + vision), Nano Banana for image generation
- Antigravity — Google's AI coding IDE, used for the majority of development
- Next.js 14 — React frontend with App Router, TypeScript, Tailwind CSS, Framer Motion
- FastAPI — Python backend with WebSocket streaming and SQLModel ORM
- PostgreSQL 15 — Wardrobe items, saved outfits, calendar entries, user preferences
- Backblaze B2 — S3-compatible private image storage with signed URLs
- Docker Compose — Full stack orchestration (db, backend, frontend, worker)
What's Next
- Real authentication — Replace the mock auth with Google OAuth so multiple users can have their own closets
- Shopping suggestions — When Remi spots a gap in your wardrobe ("you don't have a neutral layering piece"), suggest specific items to buy that complement what you already own
- Social sharing — Let users share outfits with friends and get feedback before committing
- Outfit history analytics — Visualise wear patterns over time - what you reach for most, what's gathering dust, seasonal trends
- Mobile app — The current UI is mobile-first web, but a native app would unlock background uploads, push notifications for tomorrow's outfit, and better camera integration
Built With
- backblaze-b2
- docker
- gemini
- next.js
- postgresql
- python
- typescript
Log in or sign up for Devpost to join the conversation.