🌟 Inspiration
Stories have always been told through multiple senses — words, pictures, sound. Yet most AI tools generate these separately, breaking the creative flow. I wanted to build something that thinks and creates like a real creative director: weaving narration, visuals, and sound together in one single, fluid stream.
🔀 What it does
StoryWeaver AI transforms any story idea into a fully illustrated, narrated, cinematic experience:
- Text — Gemini 2.0 Flash generates poetic, emotionally resonant scene narration
- Images — Illustrations appear inline, exactly where Gemini interleaves them in the story
- Audio — Google Cloud TTS Neural2-F narrates the full story with a cinematic voice
- Video — A Ken Burns cinematic video stitches all scenes with slow zoom/pan effects and narration audio into a downloadable MP4
All four modalities are delivered in one cohesive, fluid output — just like a premium animated storybook.
🛠️ How we built it
The core of StoryWeaver AI is Gemini 2.0 Flash Preview Image Generation with responseModalities: ["TEXT", "IMAGE"] — a single API call that returns text and images natively interleaved in one response stream. This is what makes it truly different from apps that stitch modalities together separately.
The stack:
- Gemini 2.0 Flash (
gemini-2.0-flash-preview-image-generation) — native interleaved text + image output - Google Cloud Vertex AI — model hosting and inference
- Google Cloud Text-to-Speech — Neural2-F voice narration
- MoviePy — Ken Burns cinematic video generation with ffmpeg
- Streamlit — interactive frontend
- Google Cloud Run — serverless deployment
🚧 Challenges we faced
- Gemini interleaved output was the hardest part — understanding how to correctly request
responseModalitiesand parse the mixed text/image parts from a single response took significant experimentation - Token expiry mid-session caused silent failures until we implemented per-call token refresh
- Imagen rate limits required retry logic with exponential backoff
- moviepy v2 breaking changes required pinning to v1.0.3 for stable Ken Burns video generation
- TTS gRPC vs REST — switched from the Python client library to direct REST API calls for reliability
📚 What we learned
- How to use Gemini's native interleaved output — generating text and images in a single API call rather than separate pipelines
- How to build a truly multimodal agent that thinks like a creative director
- How to deploy a full-stack AI app on Google Cloud Run with Docker
✨ What's next
- Real-time streaming of the interleaved output as it generates
- Support for custom art styles and story genres
- Multi-language narration
- Export to ePub for e-reader format ```
Log in or sign up for Devpost to join the conversation.