Inspiration

Storytelling is one of humanity's oldest art forms — but creating a produced story with visuals, voice, and video has always required expensive tools, creative teams, and hours of work. We asked ourselves: what if anyone could type an idea and watch it become a cinematic experience in seconds?

The rise of multimodal AI — especially Gemini 2.0 Flash and Imagen 3 — made this feel genuinely possible. We were inspired by the idea of democratizing content creation: a student, an educator, a first-time creator, or a seasoned storyteller should all be able to bring their imagination to life without any production skills.


What it does

StoryForge AI transforms any story idea into a fully produced cinematic video — streamed in real-time.

  1. The user types a story idea (e.g. "A lone astronaut discovers an ancient civilization on Mars")
  2. Gemini 2.0 Flash generates a structured story bible — scenes, characters, setting, and narration scripts
  3. Imagen 3 generates a cinematic AI image for each scene
  4. Google Cloud Text-to-Speech (WaveNet) narrates each scene with professional voice audio
  5. FFmpeg + ImageMagick assemble the images and audio into a final MP4 video
  6. The finished video is delivered back to the user — all streamed live so they can watch the pipeline unfold in real-time

Every step is visible to the user through a live progress tracker and scene cards that appear as they are generated.


How we built it

Backend — Python 3.11 + FastAPI

  • A streaming pipeline using Server-Sent Events (SSE) pushes progress updates to the frontend as each stage completes
  • story_bible_agent.py calls Gemini 2.0 Flash (gemini-2.0-flash-001) with a structured prompt to produce a JSON story bible
  • image_service.py calls Imagen 3 (imagen-3.0-generate-002) via Vertex AI for each scene
  • tts_service.py calls Cloud Text-to-Speech to synthesize WaveNet MP3 narration per scene
  • video_service.py uses FFmpeg + ImageMagick to composite and assemble the final MP4 at 1280×720
  • storage_service.py uploads all assets to Google Cloud Storage

Frontend — Next.js 14 + TypeScript + TailwindCSS

  • A custom useStoryGeneration hook consumes the SSE stream from the backend
  • Real-time ProgressTracker and StudioFeed components update as each pipeline stage completes
  • Scene cards stream in with images and narration text as they are generated
  • Auto-scrolls to the video player when the final MP4 is ready

Infrastructure — Google Cloud

  • Both frontend and backend are containerized with Docker and deployed to Cloud Run (asia-south1)
  • Cloud Build + Artifact Registry form the CI/CD pipeline — a single gcloud builds submit builds, pushes, and deploys
  • All generated assets (images, audio, video) are stored in Cloud Storage

Challenges we ran into

  • NEXT_PUBLIC_* build-time variables — Next.js bakes environment variables into the JS bundle at compile time. Passing --build-arg to Docker isn't enough; the ARG and ENV must be explicitly declared in the Dockerfile before npm run build. This took several failed deployments to diagnose.

  • Cloud Run PORT injection — Cloud Run injects a PORT environment variable at runtime (defaulting to 8080), but our Dockerfile hardcoded --port 8000. The container would start and immediately crash. Fixed by changing the CMD to sh -c "uvicorn main:app --host 0.0.0.0 --port ${PORT:-8080}".

  • Cold starts with heavy containers — Our backend container includes FFmpeg, ImageMagick, and font packages. Cloud Run's scale-to-zero behaviour means the first request after idle can take 15–20 seconds just to spin up the container — before any AI work begins.

  • SSE streaming across Cloud Run — Ensuring chunked SSE responses weren't buffered by Cloud Run's infrastructure required careful response header configuration on the FastAPI side.

  • Video assembly performance — FFmpeg encoding at 1080p was too slow for Cloud Run's virtualised CPU. We dropped to 720p with the ultrafast preset and 2500k bitrate, which gave a much better balance of speed and quality.


Accomplishments that we're proud of

  • Full end-to-end pipeline — from a text prompt to a downloadable MP4 video, entirely AI-generated, in one seamless flow
  • Real-time streaming UX — users see the pipeline unfold live, scene by scene, rather than staring at a loading spinner
  • Production-grade cloud deployment — fully containerized, deployed on Cloud Run with a proper CI/CD pipeline via Cloud Build
  • Cinematic quality — Imagen 3 with carefully engineered prompts produces genuinely impressive, stylistically consistent scene imagery
  • Professional narration — WaveNet voices bring the narration to life with natural prosody

What we learned

  • Prompt engineering is half the product. The quality of Gemini's story output and Imagen 3's images depends enormously on how the prompts are structured — specificity of mood, lighting, style, and camera angle makes a dramatic difference.
  • SSE is a powerful pattern for AI pipelines. It gives users a sense of progress and agency, and it's far more engaging than polling or waiting for a bulk response.
  • Docker build-time vs runtime environment variables are a common but subtle pitfall in Next.js + Cloud Run deployments.
  • Cloud Run is excellent for stateless AI workloads — fully managed, auto-scaling, and easy to deploy — but cold start latency needs to be factored into the UX design for containers with heavy dependencies.

What's next for StoryForge AI

  • Custom character consistency — use Gemini's multimodal capabilities to maintain visual consistency of characters across scenes
  • AI background music — add generated ambient soundtracks that match the story's mood
  • Multi-language support — generate stories and narration in multiple languages using Cloud TTS's language library
  • Mobile-optimized experience — progressive web app with offline video playback
  • Social sharing — one-click share of generated videos to social platforms
  • Style presets — let users choose cinematic styles (noir, anime, watercolors, epic fantasy) that influence both image generation and narration tone

Built With

  • artifactregistry
  • cloudinfrastructure
  • docker
  • fastapi
  • gemini2.0flash
  • github
  • googlecloudbuild
  • googlecloudrun
  • googlecloudstorage
  • googlecloudtext-to-speech
  • imagen3
  • next.js
  • python
  • react
  • server-sentevents(sse)
  • tailwindcss
  • typescript
  • vertexai
Share this project:

Updates