🌟 Inspiration

Stories have always been told through multiple senses — words, pictures, sound. Yet most AI tools generate these separately, breaking the creative flow. I wanted to build something that thinks and creates like a real creative director: weaving narration, visuals, and sound together in one single, fluid stream.

🔀 What it does

StoryWeaver AI transforms any story idea into a fully illustrated, narrated, cinematic experience:

  • Text — Gemini 2.0 Flash generates poetic, emotionally resonant scene narration
  • Images — Illustrations appear inline, exactly where Gemini interleaves them in the story
  • Audio — Google Cloud TTS Neural2-F narrates the full story with a cinematic voice
  • Video — A Ken Burns cinematic video stitches all scenes with slow zoom/pan effects and narration audio into a downloadable MP4

All four modalities are delivered in one cohesive, fluid output — just like a premium animated storybook.

🛠️ How we built it

The core of StoryWeaver AI is Gemini 2.0 Flash Preview Image Generation with responseModalities: ["TEXT", "IMAGE"] — a single API call that returns text and images natively interleaved in one response stream. This is what makes it truly different from apps that stitch modalities together separately.

The stack:

  • Gemini 2.0 Flash (gemini-2.0-flash-preview-image-generation) — native interleaved text + image output
  • Google Cloud Vertex AI — model hosting and inference
  • Google Cloud Text-to-Speech — Neural2-F voice narration
  • MoviePy — Ken Burns cinematic video generation with ffmpeg
  • Streamlit — interactive frontend
  • Google Cloud Run — serverless deployment

🚧 Challenges we faced

  • Gemini interleaved output was the hardest part — understanding how to correctly request responseModalities and parse the mixed text/image parts from a single response took significant experimentation
  • Token expiry mid-session caused silent failures until we implemented per-call token refresh
  • Imagen rate limits required retry logic with exponential backoff
  • moviepy v2 breaking changes required pinning to v1.0.3 for stable Ken Burns video generation
  • TTS gRPC vs REST — switched from the Python client library to direct REST API calls for reliability

📚 What we learned

  • How to use Gemini's native interleaved output — generating text and images in a single API call rather than separate pipelines
  • How to build a truly multimodal agent that thinks like a creative director
  • How to deploy a full-stack AI app on Google Cloud Run with Docker

✨ What's next

  • Real-time streaming of the interleaved output as it generates
  • Support for custom art styles and story genres
  • Multi-language narration
  • Export to ePub for e-reader format ```

Built With

  • docker
  • gemini-2.0-flash
  • google-cloud-run
  • google-cloud-text-to-speech
  • google-cloud-vertex-ai
  • moviepy
  • python
  • streamlit
Share this project:

Updates