Mimesis - Your team AI Creative Director for Commercial Ideation

μίμησις (mímēsis), imitation, representation. In Plato and Aristotle, the idea that poetry, painting, and theater imitate human actions, character, and reality.

Table of Contents


Part 1 - The Story

Inspiration

I've worked in production studios where creating a commercial meant weeks of pre-production before a single frame was captured. Briefs, discovery sessions, treatments, storyboards, casting, each step exists for a reason. That process isn't overhead. It's where the meaning is built.

When I started experimenting with AI video generation, the results looked technically impressive. But something felt wrong. The output was clean, but it meant nothing. It had no why.

And I realized: the problem wasn't the model. It was that I'd skipped every step that gives a commercial its soul.

We've all experienced this, being stuck on a project, walking away frustrated, and finding the answer two hours later in the shower. That's not magic. That's your brain doing asynchronous work, absorbing, connecting, making sense of everything you've fed it. The pre-production process in advertising works exactly the same way. It forces reflection. It creates meaning. It ensures that when you press record, you know exactly what you're trying to say and who you're saying it to.

So I didn't want to shortcut that process. I wanted to make it more creative, with faster feedback loops, so you spend less time in the abstract and more time reacting to something real.

What Mimesis Does

Mimesis is the collaborator you brief on the spot. Through real-time voice conversation, she runs a full commercial production pipeline, from brand research to final video.

She works with any brand that has an internet presence: Chanel, Netflix, Coca‑Cola, Nike, Apple, no URL needed. Give a name, and she starts working.

The interface is built like a constellation: the brand's mission, its enemy, its strategy, its campaign history, all visible at a glance. A living brief that thinks with you. Drop a product image, she reads it, connects it to the brand DNA, and pitches a direction. She assists the room, not replaces it.

See It In Action

Here is the full flow, from brand name to final video:

Step 1 - Brand Research & Exploration

Brand Research

Start a session and activate the voice agent. Mimesis greets you and asks which brand you're working on. Give any brand name, she dispatches 5 parallel workers that autonomously research the brand's visual identity, strategy, latest news, viral campaigns, cultural symbols, and philosophy. Results stream into the UI as they arrive.

Once all the research is displayed, you can interact freely:

  • Ask for details on any section: "Tell me more about their strategy."
  • Focus on a section: "Show me just the news." The UI cinematically isolates that panel.
  • Ask Mimesis to connect the dots: "What link can you make between the keywords and Chanel's strategy?"

Step 2 - Creative Briefing

Creative Briefing

When you have a clear understanding of the brand, say: "Let's move to the next step."

Mimesis takes the lead and conducts a rapid creative interview through 3–4 questions: What's the objective? What product are we showcasing? Who is the target? What emotion should the ad evoke?

💡 At any point you can upload images by dragging them onto the screen: a product photo, a color palette, a moodboard, or any visual inspiration for the scenario. Mimesis analyzes the visual mood, colors, and creative potential, and links it back to the brand DNA.

You can track everything Mimesis has collected in the Memory component (bottom-left of the UI).

Step 3 - Scenario & Sequence

Master Sequence

Once the brief is complete, Mimesis asks if you have any scenario ideas, a situation, a character, a visual concept. Share your vision or let her generate from scratch.

She builds a Master Sequence: a 6-act emotional arc (Hook → Context → Product Entry → Transformation → Climax → Resolution). The timeline appears on screen. You can approve it as-is, or request changes by voice: "Make the hook more aggressive", "Swap scenes 3 and 4." The sequence regenerates automatically.

Step 4 - Art Direction (Keyframes)

Art Direction

Once the sequence is locked, Mimesis launches the Production Workshop:

  1. An Anchor Image is generated to define the visual DNA (lighting, palette, mood).
  2. Keyframes (start/end reference images) are generated for all 6 scenes using Imagen 3. Browse them scene by scene, give feedback, and regenerate individual scenes by voice.
  3. When all scenes are approved, the visual storyboard is finalized.

Step 5 - Final Video Generation

Ask Mimesis to generate the final commercial. Veo 3.1 produces cinematic clips for each scene, and FFmpeg stitches them into a single video. Based on the scene created, we extend the sequence into 10 to 14 scene by inserting insert-shots on the product. When complete, the final ad appears on screen, ready for review.

Resolution VIDEO

Part 2 - Under The Hood

Architecture

Mimesis is built around three interconnected layers.

Global System Architecture The user interacts through a Next.js frontend connected to a FastAPI + Uvicorn backend deployed on Cloud Run. The Gemini Live API handles the real-time bidirectional voice session. In parallel, background workers (Gemini 2.5 Flash) fire independently, each executing a Google Search query, processing the result, and streaming it directly into the UI via WebSocket as it arrives. Final assets (images, videos) are delivered via Google Cloud Storage.

Global System Architecture

Agent Communication, Webhooks & State Store Each background worker is a discrete unit. When a worker completes its search, it writes its result to a persistent state store and emits a notification to the Live Agent via webhook. The agent reads the authoritative data from the store, never from the notification preview, and reacts creatively, triggering a UI update via function call. This ensures the voice session and the data pipeline never block each other.

Agent Communication

MCP Server, Detailed Architecture The MCP server exposes the tools the agent uses to control the UI, query brand memory, and manage the upload zone. Each tool call from the agent translates into a targeted UI event: a panel moves, a section lights up, an image zone appears. The MCP layer is what makes the interface a direct extension of the agent's reasoning.

MCP Server Architecture

How It Was Built

The core principle behind Mimesis is simple: every agent output is a UI event. The interface has no static state. When Mimesis speaks, the screen reacts.

Building that required assembling several layers that don't naturally want to coexist:

Layer Technology
Voice Agent Gemini Live API (bidirectional audio streaming) via Vertex AI
Brand Intelligence Gemini 2.5 Flash + Google Search (5 parallel async workers)
Image Generation Imagen 3
Video Generation Veo 3.1
Backend / Orchestration FastAPI + Uvicorn + WebSockets, deployed on Cloud Run
Storage & Delivery Google Cloud Storage
Frontend Next.js + GSAP (cinematic animations)

The voice session is fully bidirectional and interruptible. While Mimesis speaks, background workers fire in parallel, streaming results directly into the UI. No waiting. No batch loading. The interface is a zero-gravity system: the screen is always a reflection of where the conversation is.

But the hardest constraint wasn't technical, it was human. Mimesis is used during a creative session, which means the user is in a state of flow. Friction is the forbidden word. One dropped connection, one stuttering response, one misaligned event, and the session breaks. It's the same feeling as losing your internet connection in the middle of something important. You don't recover from it.

Challenges & System Design

Aligning voice, model execution, worker responses, and UI events, all within seconds, consistently, without a single break: that's the invisible work behind Mimesis.

Challenge How it was solved
Gemini bidi → Next.js The reference repo wasn't built for production. SSR + React state + low-latency audio → rebuilt the WebSocket layer from scratch.
UI / Agent sync Audio playback, worker responses, and animations must align to the millisecond. Every bug required debugging 3 systems at once.
4 models, 1 coherent output Live API, Flash, Imagen, Veo each have different creative contracts. Treated as separate disciplines, not variations.
Session drops (1011, 500, 503) 4-phase lifecycle + exponential backoff retries (×3). Frontend notified at each attempt so Mimesis stays in character.
Worker crashes Each worker is an isolated asyncio.task with try/except. Signals completion even on failure → pipeline never deadlocks.
Malformed LLM JSON 4-level fallback: strict parse → regex extraction → trailing comma repair → control character cleanup.
MCP session routing MCP tools run as subprocesses (no shared session). 4 fallback strategies: direct lookup → state store → ADK tracking → queue.
State consistency Every update is dual-written: frontend (WebSocket) + ADK session.state. Partial-success handling if either fails.
Dynamic memory Agent writes to memory mid-conversation, one field or many at once. System returns a checklist of what's still missing.

Robustness & Grounding

Mimesis never invents brand data. Here's how.

Grounding - No hallucinations by design

All brand intelligence comes from live web data, not model memory. Each of the 5 research workers calls Gemini 2.5 Flash with google_search enabled, at low temperature (0.2 for facts, 0.5 for creative interpretation). The voice agent never performs searches itself, it receives structured, verified JSON and reacts to specific data points. When the state store says primary_color: "#C70039", she reacts to that exact red. She doesn't guess.

Mid-session, every question routes through get_brand_memory, the grounded state store, never the model's own recall. If data isn't ready yet: "My team is still working on that." No fabrication.

Product images are grounded twice: the raw image goes to the Live model for real-time visual reaction, while a separate Vision worker extracts structured analysis (color codes, mood, brand alignment). Both channels converge, but neither invents what the other didn't provide.

The same principle applies to the creative brief: every decision is written to the state store the moment it's expressed, verified against a checklist, and locked only when the user explicitly validates it. The brief is never reconstructed from memory.

Resilience - The session never breaks

Anti-repetition rules in the system prompt prevent the agent from looping on topics already covered, a common failure mode in long voice sessions. The Veo pipeline uses 3 generation strategies per scene (keyframe interpolation → video extension → text-only with reference images), falling back automatically if one fails. If GCS upload fails, the system continues with a local URI so analysis still works. Every layer has a fallback, because in a creative session, one interruption kills the flow.

What I Learned

That the hardest part of building an AI product isn't the AI. It's the architecture around it, the plumbing that makes it feel effortless.

Voice changes everything. Speaking an idea generates ten times more nuance than typing it. Once I built around voice, every other decision followed: the UI had to be reactive, the workers had to be silent, the state had to be real-time. Voice was the constraint that shaped the entire system.

I also learned that debugging a voice + worker + UI system is a new discipline entirely. You can't reproduce bugs by clicking. You have to replay a conversation, watch three log streams at once, and figure out which millisecond broke the sync. That skill didn't exist before this project.

And the real metric for creative tools isn't "does it work", it's "does it keep you in the zone." That's a much harder bar to clear, but it captivated me.

What's Next

  • Camera awareness: point your phone at a product or moodboard and Mimesis reacts in real time. She could even read facial expressions: "I see you're skeptical, let me try a different direction."

  • Audience persona targeting: drop a LinkedIn profile URL of the person you're targeting, and Mimesis personalizes the commercial for that exact profile (you could become the main character of your own ad).

  • Export pipeline: DaVinci Resolve, Premiere Pro, CapCut.


Proof of deployment https://vimeo.com/1174190118?share=copy&fl=sv&fe=ci


Thank you for reading my work, it was a long journey!

Built for the Gemini Live Agent Challenge, Creative Storyteller category. Deployed on Google Cloud Run. Demonstrated with Google Pixel as the client brand.

Lazreq.

Built With

Share this project:

Updates