StoryForge AI

Landing Page
Scenes generation
Architecture Digram
Final Output

Inspiration

Storytelling is one of humanity's oldest art forms — but creating a produced story with visuals, voice, and video has always required expensive tools, creative teams, and hours of work. We asked ourselves: what if anyone could type an idea and watch it become a cinematic experience in seconds?

The rise of multimodal AI — especially Gemini 2.0 Flash and Imagen 3 — made this feel genuinely possible. We were inspired by the idea of democratizing content creation: a student, an educator, a first-time creator, or a seasoned storyteller should all be able to bring their imagination to life without any production skills.

What it does

StoryForge AI transforms any story idea into a fully produced cinematic video — streamed in real-time.

The user types a story idea (e.g. "A lone astronaut discovers an ancient civilization on Mars")
Gemini 2.0 Flash generates a structured story bible — scenes, characters, setting, and narration scripts
Imagen 3 generates a cinematic AI image for each scene
Google Cloud Text-to-Speech (WaveNet) narrates each scene with professional voice audio
FFmpeg + ImageMagick assemble the images and audio into a final MP4 video
The finished video is delivered back to the user — all streamed live so they can watch the pipeline unfold in real-time

Every step is visible to the user through a live progress tracker and scene cards that appear as they are generated.

How we built it

Backend — Python 3.11 + FastAPI

A streaming pipeline using Server-Sent Events (SSE) pushes progress updates to the frontend as each stage completes
story_bible_agent.py calls Gemini 2.0 Flash (gemini-2.0-flash-001) with a structured prompt to produce a JSON story bible
image_service.py calls Imagen 3 (imagen-3.0-generate-002) via Vertex AI for each scene
tts_service.py calls Cloud Text-to-Speech to synthesize WaveNet MP3 narration per scene
video_service.py uses FFmpeg + ImageMagick to composite and assemble the final MP4 at 1280×720
storage_service.py uploads all assets to Google Cloud Storage

Frontend — Next.js 14 + TypeScript + TailwindCSS

A custom useStoryGeneration hook consumes the SSE stream from the backend
Real-time ProgressTracker and StudioFeed components update as each pipeline stage completes
Scene cards stream in with images and narration text as they are generated
Auto-scrolls to the video player when the final MP4 is ready

Infrastructure — Google Cloud

Both frontend and backend are containerized with Docker and deployed to Cloud Run (asia-south1)
Cloud Build + Artifact Registry form the CI/CD pipeline — a single gcloud builds submit builds, pushes, and deploys
All generated assets (images, audio, video) are stored in Cloud Storage

Challenges we ran into

NEXT_PUBLIC_* build-time variables — Next.js bakes environment variables into the JS bundle at compile time. Passing --build-arg to Docker isn't enough; the ARG and ENV must be explicitly declared in the Dockerfile before npm run build. This took several failed deployments to diagnose.
Cloud Run PORT injection — Cloud Run injects a PORT environment variable at runtime (defaulting to 8080), but our Dockerfile hardcoded --port 8000. The container would start and immediately crash. Fixed by changing the CMD to sh -c "uvicorn main:app --host 0.0.0.0 --port ${PORT:-8080}".
Cold starts with heavy containers — Our backend container includes FFmpeg, ImageMagick, and font packages. Cloud Run's scale-to-zero behaviour means the first request after idle can take 15–20 seconds just to spin up the container — before any AI work begins.
SSE streaming across Cloud Run — Ensuring chunked SSE responses weren't buffered by Cloud Run's infrastructure required careful response header configuration on the FastAPI side.
Video assembly performance — FFmpeg encoding at 1080p was too slow for Cloud Run's virtualised CPU. We dropped to 720p with the ultrafast preset and 2500k bitrate, which gave a much better balance of speed and quality.

Accomplishments that we're proud of

Full end-to-end pipeline — from a text prompt to a downloadable MP4 video, entirely AI-generated, in one seamless flow
Real-time streaming UX — users see the pipeline unfold live, scene by scene, rather than staring at a loading spinner
Production-grade cloud deployment — fully containerized, deployed on Cloud Run with a proper CI/CD pipeline via Cloud Build
Cinematic quality — Imagen 3 with carefully engineered prompts produces genuinely impressive, stylistically consistent scene imagery
Professional narration — WaveNet voices bring the narration to life with natural prosody

What we learned

Prompt engineering is half the product. The quality of Gemini's story output and Imagen 3's images depends enormously on how the prompts are structured — specificity of mood, lighting, style, and camera angle makes a dramatic difference.
SSE is a powerful pattern for AI pipelines. It gives users a sense of progress and agency, and it's far more engaging than polling or waiting for a bulk response.
Docker build-time vs runtime environment variables are a common but subtle pitfall in Next.js + Cloud Run deployments.
Cloud Run is excellent for stateless AI workloads — fully managed, auto-scaling, and easy to deploy — but cold start latency needs to be factored into the UX design for containers with heavy dependencies.

What's next for StoryForge AI

Custom character consistency — use Gemini's multimodal capabilities to maintain visual consistency of characters across scenes
AI background music — add generated ambient soundtracks that match the story's mood
Multi-language support — generate stories and narration in multiple languages using Cloud TTS's language library
Mobile-optimized experience — progressive web app with offline video playback
Social sharing — one-click share of generated videos to social platforms
Style presets — let users choose cinematic styles (noir, anime, watercolors, epic fantasy) that influence both image generation and narration tone

Built With

artifactregistry
cloudinfrastructure
docker
fastapi
gemini2.0flash
github
googlecloudbuild
googlecloudrun
googlecloudstorage
googlecloudtext-to-speech
imagen3
next.js
python
react
server-sentevents(sse)
tailwindcss
typescript
vertexai

Submitted to

Gemini Live Agent Challenge

Created by

Sole developer and architect of StoryForge AI. Designed and built the entire system end-to-end — from the multi-agent backend pipeline (Gemini 2.0 Flash, Imagen 3, Cloud TTS) to the real-time streaming frontend in Next.js. Architected the cloud infrastructure on Google Cloud Run with a fully automated CI/CD pipeline using Cloud Build and Artifact Registry. Integrated Server-Sent Events for real-time scene-by-scene progress updates and FFmpeg for automated video assembly. Deployed both frontend and backend as containerized services on Cloud Run in the asia-south1 region.

Prakash Ramasamy

Updates

Prakash Ramasamy started this project — Mar 09, 2026 06:58 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.