Inspiration

Professional branding costs €6,000–€25,000 and takes 6–16 weeks. We work with e-commerce resellers across 30+ European markets — and the pattern is universal: quality products, terrible brand presence. Not because founders don't care. Because they can't afford to care professionally.

Existing AI tools make this worse, not better. Logo generators assume you already have a brand direction. Product photo tools do only photography. Nothing takes a single product photo and asks: "What does this brand want to be?"

Every one of these tools is a form. Fill in the blanks, click generate, get output. That's not how creative direction works. A creative director sees your product, listens to your vision, speaks their reasoning aloud, and creates assets while you're still in the conversation — adapting in real time to your reactions.

We wanted to build the thing that should already exist: a creative director you can actually talk to — one that sees, hears, speaks, and creates in a single immersive session.

What it does

BrandStorm is a live AI creative director. You upload a product photo, then have a real-time voice conversation with an agent that sees your product, listens to your preferences, speaks its creative reasoning aloud — and generates a complete brand identity as you talk.

One session produces 10 coherent brand assets across 3 modalities, interleaved into a single creative narrative:

  • 5 strategy texts — brand name with rationale, tagline, brand story, core values, tone-of-voice guide
  • 4 visual assets — logo concept, 5-color palette, hero lifestyle shot, Instagram post (4:5)
  • 1 audio asset — brand story narrated in character by a dedicated narrator voice

The agent doesn't fill in a form. It consults. It proposes creative directions, explains its reasoning in its own voice, adapts to feedback in real time, and selectively regenerates only the assets that need to change. Voice, images, and live transcription emerge as a single coherent creative stream — not three separate outputs stitched together after the fact.

The creative conversation itself is the product. There is no text box.

How we built it

Architecture: Three Layers in Concert

Layer 1 — Live API Agent (Gemini Live 2.5 Flash, Native Audio)

The agent runs as a persistent Gemini Live API session on Vertex AI. It receives the uploaded product photo as visual input, conducts the entire voice conversation with native audio output (response_modalities=[AUDIO]), and autonomously orchestrates its own workflow through function calling — deciding when to analyze, when to propose, when to generate, and when to ask for feedback. The agent has a distinct creative-director persona with a warm, opinionated voice. Everything the user sees as text in the chat is the live transcription of the agent's speech — powered by output_audio_transcription running alongside the audio stream.

Layer 2 — Function Calling → Image Generation Bridge

This is the key architectural insight that enables true multimodal interleaving. The Live API produces AUDIO only — it cannot generate images inline. Instead, we registered 7 function-calling tools that the agent invokes autonomously:

Tool What it does
propose_names Generate brand name candidates with rationale
set_brand_identity Lock in brand name, tagline, story, values, tone of voice
set_palette Create a 5-color palette with HEX values and roles
set_fonts Select typography pairing (heading + body)
generate_image Create a visual asset (logo, hero lifestyle, Instagram 4:5)
generate_voiceover Narrate the brand story via a dedicated narrator voice
finalize_brand_kit Package all assets into a downloadable brand kit

The backend intercepts each function call, executes it against gemini-3.1-flash-image-preview (Nano Banana Pro) for visual assets, and returns the result to the Live session — which the agent then narrates as part of the ongoing conversation.

The result: the agent says "I'm building the palette now — I'll use the rose-gold from your bottle as the primary accent...", the palette appears on screen, and the agent continues "...see how that warm tone carries into the hero shot I'm generating next." Voice audio, live transcription, and images emerge as a single interleaved creative stream — exactly as they would in a real creative review session.

Layer 3 — Brand Canvas as Source of Truth

Every brand element — name, tagline, palette, fonts, logo, hero, Instagram post, voiceover — lives on a BrandCanvas with element-level status: EMPTY → GENERATING → READY → STALE. The agent receives a snapshot of this canvas on every turn and autonomously decides what to create, regenerate, or skip. The backend never drives the narrative — it only executes tools and updates canvas state. This gives the agent genuine creative autonomy while keeping the execution layer deterministic and observable.

Stack

Component Technology
Backend Python 3.11, FastAPI, uvicorn, WebSocket at /ws/{session_id}
Frontend React 19 (Vite), Tailwind CSS 4, Web Audio API for mic capture + agent voice playback
Live Agent gemini-live-2.5-flash-native-audio via Vertex AI (voice + vision + tool use)
Image Gen gemini-3.1-flash-image-preview / Nano Banana Pro (visual asset generation)
Voiceover gemini-2.5-flash-preview-tts (Kore voice)
Cloud Google Cloud Run (single container: API + SPA), Vertex AI, Cloud Storage, Application Default Credentials

Gemini Integration Depth

BrandStorm uses four distinct Gemini capabilities and 7 agent-controlled tools in a single session flow:

  • Live API with Native Audio — real-time bidirectional voice conversation with barge-in support
  • Live API Multimodal Input — product photo sent as visual context at session start; the agent references what it sees throughout the conversation
  • Live API Function Calling — 7 registered tools the agent invokes autonomously to orchestrate the entire brand creation workflow
  • Live API Output Transcription — agent speech is transcribed in real time and streamed to the UI, creating the chat-alongside-audio experience without a separate text modality
  • Gemini Image Generation — photorealistic lifestyle shots, Instagram posts, and logo concepts via Nano Banana Pro, triggered by the agent's generate_image tool calls

These aren't separate features bolted together. They flow through one persistent Live API session where the agent decides what to create, when, and why — narrating the entire creative journey in real time.

Challenges we ran into

The Live API does not generate images inline — and nobody told us. Our first architecture assumed the agent could output images directly within the Live session. It cannot — response_modalities only supports [AUDIO]. We discovered this only after building the entire first pipeline. The rebuild took a full day: we redesigned the image layer as a function-calling bridge — the agent invokes generate_image, the backend executes against Nano Banana Pro, and returns the result to the Live session. The outcome was cleaner than the original plan: the agent's tool calls became an explicit, auditable trace of every creative decision.

Getting the agent to actually call tools — and then stop talking. Function calling with the Live API is non-trivial. The model wants to keep narrating. Early versions would describe what it was about to generate in detail, then forget to call the tool. Or call the tool, receive the result, and immediately start speaking again unprompted — producing a second, duplicate narration turn. We solved this through a combination of: precise tool descriptions with hard character limits ("Say EXACTLY ONE short sentence, max 8 words, then call immediately. Stop talking."), a pending_tool_response watchdog in session state that detects when the agent has gone silent after a tool result, and a finalize_in_progress flag that suppresses the redundant post-tool speech turn entirely.

Vertex AI preview models return 429 constantly. gemini-3.1-flash-image-preview on Vertex AI hits RESOURCE_EXHAUSTED under any real load. We couldn't rely on it as a primary path. Our solution: a three-tier fallback chain — Vertex AI global endpoint first (1 attempt), then Google AI Developer API as the real workhorse (up to 4 retries with exponential backoff), then Vertex AI regional fallback models. In practice the Developer API does most of the work. We maintain two separate client instances (_global_client, _dev_client) and route between them dynamically at runtime. A model outage never surfaces to the user.

The whole pipeline was too smart — and it broke. The first architecture had the backend driving the conversation: 542 lines of text_parser.py detecting structured output from the agent, a pregen.py pre-generation pipeline running assets in the background before the user even asked, and a dual-turn state machine coordinating parallel tool calls. Under real usage it produced race conditions, duplicate UI events, out-of-order assets, and deadlocks when pre-generated images arrived mid-speech. We threw all of that out. We replaced it with a BrandCanvas — a single source of truth with element-level status — and a purely reactive agent loop that executes tools and returns results, nothing else. Simpler. More robust. Easier to debug at 2am.

The agent misread "yes" and skipped ahead. When the user says "yes" or "ok", we inject a [NEXT STEP] instruction to nudge the agent toward the next pipeline step. This caused false positives: a user confirming a tagline tweak would accidentally trigger "call generate_image now". We solved it with a canvas fingerprint guard — we record the canvas state hash at the moment of injection and suppress re-injection until the canvas actually changes. It sounds simple. It took three debug sessions to get right.

Live API sessions die silently during image generation. Image generation takes 15–40 seconds. The Live API session times out if it receives no input during that window — no error, just a silent disconnect. Solution: we send silent PCM keepalive frames (\x00 * 480 — 15ms of silence at 16kHz, well below VAD threshold) every 8 seconds during long-running tool calls. This is not documented anywhere. We found it through trial, error, and reading Live API behavior directly.

Model deprecations at deadline. Two models we had built around were shut down within the contest window. gemini-3-pro-preview was shut down March 9 — one week before our deadline. We spent a day retesting the full image generation chain and rebuilding config defaults. The fallback chain wasn't optional engineering. It was survival infrastructure.

Accomplishments that we're proud of

True multimodal interleaving in a live session. Voice audio, live transcription, and images don't arrive as separate payloads — they emerge as a single coherent creative narrative. The agent narrates between and around generated assets, creating a sense of creative presence that no batch-mode generator can replicate. This is exactly what "Creative Storyteller" means to us: the story is the creation process.

Agent autonomy via 7 function-calling tools. The agent genuinely controls the session. It decides when to propose names (propose_names), when to lock in identity (set_brand_identity), when to build the palette (set_palette), when to generate each visual asset (generate_image), when to narrate the brand story (generate_voiceover), and when to package everything (finalize_brand_kit) — all through the same function calling mechanism that production AI agents use. The backend is a pure execution layer. No keyword detection, no hardcoded flows.

10 immediately usable assets from one conversation. Brand name. Tagline. Brand story. Core values. Tone-of-voice guide. Logo. 5-color palette. Hero lifestyle shot. Instagram post. Brand story voiceover. Zero post-processing required.

Voice-first UX that degrades gracefully. Mic unavailable? The agent still speaks, you type. No audio at all? Full text + image mode. The experience is coherent at every degradation level — because the agent adapts, not the UI.

Single-container deploy on Cloud Run. FastAPI serves both the API and the React SPA from one container. gcloud run deploy and it's live. No separate frontend hosting, no CDN config, no CORS complexity.

What we learned

The best AI UX is a conversation, not a form. Every time we caught ourselves adding an input field — "pick your style", "enter your target audience" — we asked: why can't the agent just ask? Almost always, it could. Removing UI controls and replacing them with agent judgment made the product both simpler and more powerful. This is what "beyond the text box" means in practice.

Function calling is the bridge between modalities. The Live API's function calling turns out to be the perfect orchestration layer for creative workflows. The agent's ability to narrate between tool calls — explaining what it's generating and why — creates the interleaved multimodal experience that makes BrandStorm feel like working with a real creative director, not using a tool.

Real-time voice changes the feedback loop. When a user can just say "make it bolder" mid-conversation and hear the agent respond "absolutely — I'll punch up the saturation on that palette..." before regenerating, the entire dynamic shifts. The user stops thinking in prompts and starts thinking in preferences. That's when the agent becomes a collaborator.

Simplicity scales. Cleverness breaks. The most important architectural lesson: the backend that tries to anticipate the agent, pre-generate assets, and parse structured output from speech will eventually fight the agent for control — and lose. The backend that just executes tools and gets out of the way lets the agent be what it is: autonomous, creative, and surprisingly reliable.

What's next for BrandStorm

  • Video assets via Veo — 6-second brand intro clips auto-generated from the hero shot and brand story narration, completing the content stack
  • Persistent brand memory — return to refine your brand over multiple sessions; the agent remembers previous creative decisions and evolves the identity over time
  • Style transfer — upload a competitor or inspiration brand and ask the agent to position yours distinctly in the same territory
  • Direct publish — one-click push to Shopify store, Instagram bio, or Canva workspace
  • Agency white-label — brand consultants run BrandStorm sessions with their own clients under their own brand

Built With

  • application-default-credentials
  • audioworklet
  • docker-python-3.11
  • fastapi
  • gemini-image-generation-(gemini-3.1-flash-image-preview)
  • gemini-live-api-(gemini-live-2.5-flash-native-audio)
  • gemini-tts-(gemini-2.5-flash-preview-tts)
  • google-cloud
  • google-cloud-run
  • google-gemini-live-api-(gemini-live-2.5-flash-native-audio)
  • google-genai-sdk
  • motion
  • motion-(framer-motion)
  • pillow
  • python-3.11
  • react-19
  • tailwind-css-4
  • uvicorn
  • vertex-ai
  • vite-7
  • web-audio-api
  • websocket
Share this project:

Updates