Inspiration

Content creators and story writers who post viral short-form video face a multi-hour manual edit per piece — scripting, voiceover, cutting, subtitles, thumbnail, metadata, then reformatting for each platform. No end-to-end tool exists for that full loop. Gameplay footage is a perfect retention layer, so ClipForge pairs your story with any looping clip and produces four platform-ready videos in one pass.

What it does

Drop in background footage, pick a mode (Full AI / Topic-Guided / Manual), and the pipeline handles the rest: an AI agent loop writes and reviews the script, TTS generates the voiceover, FFmpeg cuts the footage to target duration, Whisper generates synced burned-in subtitles, a green progress bar travels the frame perimeter for retention, a thumbnail is auto-generated with AI hook text, and per-platform titles and descriptions are written respecting each platform's character limits.

How we built it

  • Python 3.11, FastAPI (async + WebSocket), Alpine.js + Tailwind via CDN (zero build step), SQLite with WAL mode, FFmpeg, Pillow, Typer + Questionary for the CLI.
  • Google Gemma 4 (gemma-4-31b-it) via the new google-genai SDK powers all four AI roles — Generator, Reviewer, Modifier, Tightener — differentiated only by prompt. CLIPFORGE_MODEL env var lets users swap to gemini-pro-latest for maximum quality.
  • Microsoft edge-tts (free, 25+ English neural voices, no local model download) for voiceover. Google Cloud TTS supported as an alternate cloud option.
  • OpenAI Whisper small for word-level subtitle timestamps.
  • WebSocket pushes a state_sync snapshot on connect (solves the race where fast steps complete before the frontend connects) plus step_update events per transition. Output dashboard blooms open on complete.
  • Per-step checkpoints in the job_steps table. Startup sweep marks in_progress jobs interrupted; user clicks Resume explicitly — no auto-resume crash loops.

Challenges

  • The planned FFmpeg geq spatial-gradient rainbow border was far too slow. Replaced with a pre-rendered PNG frame sequence showing a green progress bar travelling the perimeter — 150× faster and arguably more useful.
  • Kokoro TTS was planned as the local fallback but proved slow on CPU and required a 1.5 GB first-run download. Swapped to edge-tts, which streams with no local model.
  • TTS output rarely matched target duration. Added a voiceover time-stretch step using FFmpeg atempo to lock tracks to exactly 70 s (full) and 54 s (short).
  • WebSocket race condition: fast-completing steps were landing before the frontend connection opened. Fixed with the state_sync snapshot pattern.

What I learned

Built using a structured spec-driven development (SDD) process — scope → PRD → spec → checklist → build, each phase producing a document. The biggest lesson: 3-4 hours spent on planning artifacts before writing code saved more time than it cost. The checklist became a script for the build, and when reality diverged (geq filter too slow, Kokoro swapped for edge-tts, Gemini swapped for Gemma) the plan adapted cleanly instead of collapsing.

What's next

  • Multi-language support (TTS voices + Whisper languages)
  • Batch generation — multiple videos from one footage dump
  • Export preset profiles
  • SaaS packaging with user accounts and cloud storage
  • Subtitle customisation (font, size, color, position, animation)

Built With

  • alpine-js
  • edge-tts
  • fastapi
  • ffmpeg
  • google-cloud-tts
  • google-genai
  • openai-whisper
  • pillow
  • python
  • questionary
  • sqlite
  • tailwindcss
  • typer
Share this project:

Updates