Inspiration
The creative industry today is at a crossroads. While AI tools for image, video, and text generation have exploded in capability, the workflow remains fundamentally fragmented. To launch a single campaign, a creator must act as an "AI operator," manually stitching together outputs from Midjourney, Runway, Luma, and ChatGPT.
Fluence AI was born from a simple, provocative question: What if the AI wasn't just a tool, but a Creative Director you could talk to in real-time?
We wanted to create a system where a single voice conversation could result in a cohesive, multi-media output stream, removing the "interface friction" from the creative process.
What it does
Fluence AI isn't just a chatbot; it's a glimpse into the future of human-AI collaboration in the creative arts. Fluence AI acts as a Live Creative Director. In a single voice conversation, it listens to your brand vision and simultaneously assembles a complete marketing campaign. It doesn't just "talk" about ideas; it uses Gemini Live and ADK to orchestrate parallel generation of hero images (Imagen 4), cinematic videos (Veo 3.1), platform-specific copy, and voiceovers, all delivered in a single, fluid, interleaved stream.
How I built it
Fluence AI is built on a high-concurrency architecture designed to handle the massive throughput of a live creative session.
Technical Stack The core "brain" of the project is the Gemini Live API (using gemini-live-2.5-flash), which handles the bidirectional audio stream. We utilized the Google Agent Development Kit (ADK) to orchestrate a multi-agent system of specialized tools.
The backend is a FastAPI service deployed on Google Cloud Run, leveraging asynchronous workers to manage parallel tool calls. For instance, when a campaign is being "rendered," we execute multiple generation pipelines simultaneously:
$$ T_{total} = \max(T_{image}, T_{copy}, T_{audio}) + T_{video} $$
where $T_{video}$ depends on the completion of the image $T_{image}$ to serve as a reference frame.
Multimodal Pipeline Vertex AI Imagen 4: Used for core campaign visuals. Vertex AI Veo 3.1: Generates 5-second cinematic loops from reference frames. Cloud TTS: Synthesizes the campaign voiceover using SSML for precise emotive control. Cloud Firestore: Acts as our "Brand Identity" grounding layer, ensuring every generation adheres to the client's guidelines.
Challenges I ran into
Building a live creative director presented significant engineering hurdles, particularly in state management and latency.
The Interruption Problem: If a user interrupts the AI mid-sentence to change the "vibe" of the campaign, we had to implement a graceful cancellation of all $n$ in-flight tool calls and reset the ADK context. Streaming Synchronization: Maintaining a fluid narration -> generation -> reveal flow required a custom WebSocket protocol to handle "chunks" of media as they finished, rather than waiting for the entire set to complete. Concurrency Limits: Negotiating the rate limits of high-end models like Veo 3.1 and Imagen 4 while the user is actively speaking necessitated an "optimistic generation" strategy where the agent predicts sub-tasks based on the initial brief.
Accomplishments that I'm proud of
True Interleaved Orchestration: Successfully implementing a multi-agent system where a single "Creative Director" root agent manages 7 specialized tools in parallel, providing a non-blocking, real-time experience. Micro-Latency Bidi-Streaming: Mastering the LiveRequestQueue pattern in ADK to handle bidirectional audio while heavy GPU-bound generation tasks (like Veo 3.1) are running in the background. Dynamic Interruption Handling: Engineering the system to instantly stop and pivot all creative tools when the user interrupts, making the AI feel like a responsive human collaborator rather than a rigid script. Institutional Memory: Integrating Firestore as a "Brand Guard" layer that grounds every generation in existing brand guidelines, ensuring creative freedom never breaks brand safety.
What I learned
The Power of Interleaved Outputs: We discovered that user trust in AI increases significantly when they see the "process." By rendering the creative brief and the hero image as the AI talks about them, the agent feels less like a black box and more like a collaborator. Agentic Orchestration: ADK's LiveRequestQueue is a game-changer for bidi-streaming. It allowed us to maintain a low-latency audio loop while running heavy GPU-bound tasks in the background. Creative Grounding: The relationship between a brand profile in Firestore and the final output is quantifiable. By injecting brand constraints directly into the ADK tool prompts, we achieved a validation accuracy of: $$ \text{Brand Consistency} = \left( \frac{\text{Validated Elements}}{\text{Total Generation Blocks}} \right) \times 100 $$ In our testing, this reached over 95% consistency across diverse brand identities.
What's next for Fluence-AI
Google Ads Integration: Automated one-click deployment of the generated campaigns directly to Google Ads and social media platforms. Multi-Shot Storyboarding: Expanding the video capabilities from single 5-second clips to full interleaved storyboards with consistent character/product anchoring. Collaborative Mode: Allowing multiple users to join the same live session to "co-direct" a brand launch in real-time. Advanced ROI Analytics: Providing "Predictive Performance" scores for the generated copy and visuals based on real-world marketing data.
Technologies Used
Gemini Live API: For real-time, bidirectional audio and visual reasoning. Google Agent Development Kit (ADK): For orchestrating a multi-agent system of specialized creative tools. Vertex AI (Imagen 4 & Veo 3.1): For generating high-fidelity campaign images and cinematic video clips. Google Cloud Run: For hosting the scalable, asynchronous FastAPI backend. Cloud Firestore & Cloud Storage: For stateful session management and persistent asset storage. Cloud Text-to-Speech: For synthesizing high-quality campaign voiceovers using SSML. React 19: Responsive frontend shell designed for progressive rendering of streaming content blocks.