Emanator Infographic
Memory - Suspended City
Memory - Whispering River
Memory - Surreal Desert Ride
Memory - Clockwork Forest
GIF
Proof of Google Cloud Deployment Gif
Gemini ↔ GCP Backend Architecture at a Glance

EMANATOR: Speak a dream. Watch it become cinema.

EMANATOR starts with spoken memory. It extracts the key emotional and visual anchors, reconstructs them into a storyboard, and then turns that same narrative into connected cinematic artefacts: a poster, a trailer, and a final film. These are not separate outputs; they are different expressions of the same remembered experience. That is what makes the system interleaved: each medium builds on the last to create one cohesive storytelling flow.

**EMANATOR differs from a simple “text then image then video” dump because its outputs are chained through a shared story state. The system extracts a structured memory, asks the user to confirm anchor facts, reconstructs a storyboard from those anchors, and then uses that reconstructed sequence to generate later artefacts such as the trailer and poster. See Here in the GitHub repo: Interleaved storytelling implementation: docs/interleaved-storytelling.md

Inspiration

Everyone has stories trapped in their heads—vivid, fleeting dreams, cherished memories, or wild creative ideas. Yet, translating these internal worlds into external media usually requires an entire production crew, weeks of labor, and high-end technical expertise. We were inspired by the massive gap between imagination and execution. Whether it's therapists using guided narration to help patients externalize their inner worlds, educators bringing history lessons to life, indie screenwriters rapid-prototyping a script, or accessibility advocates seeking pure voice-first interfaces, we wanted to build a bridge. We asked ourselves: What if the process of creating a cinematic storyboard was as effortless as simply speaking out loud? This question birthed EMANATOR for the Gemini Live Agent Challenge.

What it does

EMANATOR is a voice-first creative director agent accessed directly in your browser. You simply speak a memory or dream into your microphone. In real time, the system:

Transcribes and Analyzes: Extracts a structured memory draft alongside user-confirmed "anchors" (verifiable facts).
Generates a Director Bible: Derives a cohesive visual style, cinematic motifs, character sheets, and a unified color palette based on your voice input.
Paints Storyboards via Interleaved Output: Streams 6-10 cinematic frames with contextual narrative captions as they generate.
Produces a Trailer and Poster: Drafts a narrative voiceover script, synthesizes it to speech, and lays out key art with a tagline.
Persists a Memory Library: Every spoken dream is saved as a versioned artifact with complete provenance tagging, allowing memory worlds to branch, fork, and evolve dynamically. The entire experience is fully streaming. As you speak, your thoughts progressively materialize before your eyes into a rich, Hollywood-quality storyboard. ## How we built it At the core of EMANATOR is Gemini 2.5 Flash and the Google GenAI SDK, orchestrated via a robust, production-grade cloud-native architecture.
The Real-Time API Engine: We built an asynchronous Python 3.11+ backend with FastAPI. A bespoke 8-step pipeline chain yields Server-Sent Events (SSE) continuously to a Next.js 15 React frontend. This ensures frames, audio, and descriptive metadata materialize instantly without blocking HTTP round-trips.
Interleaved Modality (The Secret Sauce): Answering the call of the Creative Storyteller category, we leaned entirely into Gemini's response_modalities=["TEXT", "IMAGE"]. Rather than generating text prompts and serially waiting for a separate image model, a single Gemini call outputs both narrative context and imagery.
The Mechanic Registry: We architected 16 independent "mechanics" (e.g., Chamber Mode, Fear Signal, Emotional Coherence) as functional middleware. They intercept prompts and mutate the LLM context dynamically to explore different creative dimensions.
Cloud Infrastructure: EMANATOR runs securely on Google Cloud Run, leveraging Firestore for stateful session persistence and Cloud Storage for asset storage. To make the project instantly reproducible by any judge or open-source contributor, we engineered fully automated, idempotent IaC bash scripts (gcp-setup.sh and deploy.sh). ## Challenges we ran into Taming a multimodal, highly non-deterministic LLM pipeline into a predictable, stateful streaming engine presented immense challenges. 1. Orchestrating Asynchronous Streaming Extracting concurrent, out-of-order interleaved text and images over a single stateful SSE connection without losing frame-to-audio synchronization required rigorous async loop management and concurrency semaphores. Handling the browser's audio context alongside React state transitions while raw Gemini byte-streams arrived was a complex balancing act. 2. Contextual Drift & Provenance As a visual story extends across 10 distinct frames, LLMs tend to drift stylistically. We defined "Anchors" and "Provenance Layers" to pin down narrative truth. To combat volatility, we modeled an Emotional Coherence metric, dynamically calculating narrative valence $E(t)$ over progression $t$: $$ \mathcal{C} = \exp\left( -\lambda \int_0^T \left| \frac{dE}{dt} \right|^2 dt \right) $$ By keeping the generative derivative structurally smooth via constraints evaluated during the Baseline Step, we prevented jarring tonal shifts between frames. 3. API Quotas & Volatile Topologies Rapidly iterating over large context windows in preview models naturally triggered RESOURCE_EXHAUSTED ceilings. We engineered incredibly flexible failovers, allowing the system to fall back gracefully between Gemini's native image engine, Vertex AI Imagen, and proxy setups, alongside cascading TTS degradation strategies (from ElevenLabs down through Google Cloud TTS and edge-tts). ## Accomplishments that we're proud of
True Interleaved Speed: Successfully harnessing Gemini's native text+image interleaved capabilities shaved monumental time off our pipeline's inference loop, providing an immediate "wow" factor that serial generative pipelines struggle to match.
The Mechanic Abstraction Layer: Creating robust, interchangeable lenses (like the Disposability Lens or Messiness Dial) that structurally change target artistic direction without breaking strict JSON schema adherence limits.
A Frictionless, Cinematic UX: The zero-click, voice-driven interface feels inherently magical. Users speak to the machine, and the machine intimately dreams it back to them.
Production-Ready Deployments: Proving that sophisticated, heavily-permissioned GCP architectures (spanning 8 APIs, custom IAM, and decoupled services) can be configured and deployed automatically from bash scripts. ## What we learned
Simultaneous Modalities are Superior: Giving an LLM the power to reason about conceptual visual framing while writing narrative text yields far more coherent, narratively dense storyboards than running text and vision explicitly in separate steps.
The Vitality of Transparency: End-users highly valued our "Designer Notes" UI—a panel explaining exactly why the AI chose a specific artistic motif. It builds immense trust, turning users from passive observers into active collaborative directors.
Resilient Architectures: We learned exactly how to write defensive, fault-tolerant generator architectures capable of self-healing or fast-failing when image byte streams or JSON structures randomly malform in the wild. ## What's next for EMANATOR We see EMANATOR expanding from an individual creative scratchpad to an explorable, federated dreaming network globally.
Mobile PWA Support: Capturing fleeting dreams directly upon waking through a native app interface that syncs out-of-band when connected.
Collaborative Story Canon ($\mathcal{W}i \rightarrow \mathcal{W}{i+1}$): Permitting multiple users to securely branch "Worlds" together, forming deep recursive lineage trees of a shared cinematic universe.
Playable Video Export: Automatically stitching all storyboard frames, synchronized narrative context, and synthesized audio into a continuously auto-scored MP4 reel that users can instantly post.
Plugin Marketplace: Publishing our Mechanic Registry externally, enabling the community to submit custom LLM middleware to steer stylistic directions dynamically (e.g., a Noir Mod or Cyberpunk Lens).