StoryWeaver AI

Architecture Diagram

🌟 Inspiration

Stories have always been told through multiple senses — words, pictures, sound. Yet most AI tools generate these separately, breaking the creative flow. I wanted to build something that thinks and creates like a real creative director: weaving narration, visuals, and sound together in one single, fluid stream.

🔀 What it does

StoryWeaver AI transforms any story idea into a fully illustrated, narrated, cinematic experience:

Text — Gemini 2.0 Flash generates poetic, emotionally resonant scene narration
Images — Illustrations appear inline, exactly where Gemini interleaves them in the story
Audio — Google Cloud TTS Neural2-F narrates the full story with a cinematic voice
Video — A Ken Burns cinematic video stitches all scenes with slow zoom/pan effects and narration audio into a downloadable MP4

All four modalities are delivered in one cohesive, fluid output — just like a premium animated storybook.

🛠️ How we built it

The core of StoryWeaver AI is Gemini 2.0 Flash Preview Image Generation with responseModalities: ["TEXT", "IMAGE"] — a single API call that returns text and images natively interleaved in one response stream. This is what makes it truly different from apps that stitch modalities together separately.

The stack:

Gemini 2.0 Flash (gemini-2.0-flash-preview-image-generation) — native interleaved text + image output
Google Cloud Vertex AI — model hosting and inference
Google Cloud Text-to-Speech — Neural2-F voice narration
MoviePy — Ken Burns cinematic video generation with ffmpeg
Streamlit — interactive frontend
Google Cloud Run — serverless deployment

🚧 Challenges we faced

Gemini interleaved output was the hardest part — understanding how to correctly request responseModalities and parse the mixed text/image parts from a single response took significant experimentation
Token expiry mid-session caused silent failures until we implemented per-call token refresh
Imagen rate limits required retry logic with exponential backoff
moviepy v2 breaking changes required pinning to v1.0.3 for stable Ken Burns video generation
TTS gRPC vs REST — switched from the Python client library to direct REST API calls for reliability

📚 What we learned

How to use Gemini's native interleaved output — generating text and images in a single API call rather than separate pipelines
How to build a truly multimodal agent that thinks like a creative director
How to deploy a full-stack AI app on Google Cloud Run with Docker