Inspiration
Storytelling is one of humanity's oldest art forms — but creating a produced story with visuals, voice, and video has always required expensive tools, creative teams, and hours of work. We asked ourselves: what if anyone could type an idea and watch it become a cinematic experience in seconds?
The rise of multimodal AI — especially Gemini 2.0 Flash and Imagen 3 — made this feel genuinely possible. We were inspired by the idea of democratizing content creation: a student, an educator, a first-time creator, or a seasoned storyteller should all be able to bring their imagination to life without any production skills.
What it does
StoryForge AI transforms any story idea into a fully produced cinematic video — streamed in real-time.
- The user types a story idea (e.g. "A lone astronaut discovers an ancient civilization on Mars")
- Gemini 2.0 Flash generates a structured story bible — scenes, characters, setting, and narration scripts
- Imagen 3 generates a cinematic AI image for each scene
- Google Cloud Text-to-Speech (WaveNet) narrates each scene with professional voice audio
- FFmpeg + ImageMagick assemble the images and audio into a final MP4 video
- The finished video is delivered back to the user — all streamed live so they can watch the pipeline unfold in real-time
Every step is visible to the user through a live progress tracker and scene cards that appear as they are generated.
How we built it
Backend — Python 3.11 + FastAPI
- A streaming pipeline using Server-Sent Events (SSE) pushes progress updates to the frontend as each stage completes
story_bible_agent.pycalls Gemini 2.0 Flash (gemini-2.0-flash-001) with a structured prompt to produce a JSON story bibleimage_service.pycalls Imagen 3 (imagen-3.0-generate-002) via Vertex AI for each scenetts_service.pycalls Cloud Text-to-Speech to synthesize WaveNet MP3 narration per scenevideo_service.pyuses FFmpeg + ImageMagick to composite and assemble the final MP4 at 1280×720storage_service.pyuploads all assets to Google Cloud Storage
Frontend — Next.js 14 + TypeScript + TailwindCSS
- A custom
useStoryGenerationhook consumes the SSE stream from the backend - Real-time
ProgressTrackerandStudioFeedcomponents update as each pipeline stage completes - Scene cards stream in with images and narration text as they are generated
- Auto-scrolls to the video player when the final MP4 is ready
Infrastructure — Google Cloud
- Both frontend and backend are containerized with Docker and deployed to Cloud Run (
asia-south1) - Cloud Build + Artifact Registry form the CI/CD pipeline — a single
gcloud builds submitbuilds, pushes, and deploys - All generated assets (images, audio, video) are stored in Cloud Storage
Challenges we ran into
NEXT_PUBLIC_*build-time variables — Next.js bakes environment variables into the JS bundle at compile time. Passing--build-argto Docker isn't enough; theARGandENVmust be explicitly declared in the Dockerfile beforenpm run build. This took several failed deployments to diagnose.Cloud Run PORT injection — Cloud Run injects a
PORTenvironment variable at runtime (defaulting to8080), but our Dockerfile hardcoded--port 8000. The container would start and immediately crash. Fixed by changing the CMD tosh -c "uvicorn main:app --host 0.0.0.0 --port ${PORT:-8080}".Cold starts with heavy containers — Our backend container includes FFmpeg, ImageMagick, and font packages. Cloud Run's scale-to-zero behaviour means the first request after idle can take 15–20 seconds just to spin up the container — before any AI work begins.
SSE streaming across Cloud Run — Ensuring chunked SSE responses weren't buffered by Cloud Run's infrastructure required careful response header configuration on the FastAPI side.
Video assembly performance — FFmpeg encoding at 1080p was too slow for Cloud Run's virtualised CPU. We dropped to 720p with the
ultrafastpreset and 2500k bitrate, which gave a much better balance of speed and quality.
Accomplishments that we're proud of
- Full end-to-end pipeline — from a text prompt to a downloadable MP4 video, entirely AI-generated, in one seamless flow
- Real-time streaming UX — users see the pipeline unfold live, scene by scene, rather than staring at a loading spinner
- Production-grade cloud deployment — fully containerized, deployed on Cloud Run with a proper CI/CD pipeline via Cloud Build
- Cinematic quality — Imagen 3 with carefully engineered prompts produces genuinely impressive, stylistically consistent scene imagery
- Professional narration — WaveNet voices bring the narration to life with natural prosody
What we learned
- Prompt engineering is half the product. The quality of Gemini's story output and Imagen 3's images depends enormously on how the prompts are structured — specificity of mood, lighting, style, and camera angle makes a dramatic difference.
- SSE is a powerful pattern for AI pipelines. It gives users a sense of progress and agency, and it's far more engaging than polling or waiting for a bulk response.
- Docker build-time vs runtime environment variables are a common but subtle pitfall in Next.js + Cloud Run deployments.
- Cloud Run is excellent for stateless AI workloads — fully managed, auto-scaling, and easy to deploy — but cold start latency needs to be factored into the UX design for containers with heavy dependencies.
What's next for StoryForge AI
- Custom character consistency — use Gemini's multimodal capabilities to maintain visual consistency of characters across scenes
- AI background music — add generated ambient soundtracks that match the story's mood
- Multi-language support — generate stories and narration in multiple languages using Cloud TTS's language library
- Mobile-optimized experience — progressive web app with offline video playback
- Social sharing — one-click share of generated videos to social platforms
- Style presets — let users choose cinematic styles (noir, anime, watercolors, epic fantasy) that influence both image generation and narration tone
Built With
- artifactregistry
- cloudinfrastructure
- docker
- fastapi
- gemini2.0flash
- github
- googlecloudbuild
- googlecloudrun
- googlecloudstorage
- googlecloudtext-to-speech
- imagen3
- next.js
- python
- react
- server-sentevents(sse)
- tailwindcss
- typescript
- vertexai

Log in or sign up for Devpost to join the conversation.