Vibe editing is finally here with Omu.

Inspiration

Modern video editing is powerful — but slow, manual, and timeline-heavy. Creators spend more time cutting filler words, syncing captions, and hunting for B-roll than actually telling stories. We asked a simple question: what if you could edit video the same way you think and speak?

Omu was born from the idea that voice should be the primary interface for creativity, not an afterthought.


What it does

Omu is a voice-first AI video editor.
Edit hands-free, add animated captions, auto-remove filler words, add music, and generate B-roll at the perfect moments — turning raw clips into polished videos in minutes. A live transcript becomes the editing surface, so creators never touch a timeline.


How we built it

Omu is a multi-agent AI system centered around Gemini 3:

  • Gemini 3 Flash (core reasoning & tool calling): Orchestrates edits, understands intent, and coordinates agents. Drives the transcript and text-editing agent.
  • Gemini Live API: Powers the real-time, voice-driven video editing agent
  • Gemini 3 Pro (video understanding analysis): Determines when B-roll should appear based on context
  • Gemini 3 Pro Image (Nano Banana Pro): Generates image-based B-roll
  • Veo 3.1 Fast: Generates video B-roll dynamically
  • Scribe v2: Provides accurate, word-level timestamps for precise edits
  • FFmpeg: for video composition and client-side video processing in the browser
  • Remotion Lambda: Handles scalable, serverless video rendering
  • Google Antigravity & Bolt: AI-powered development agents that supported the implementation and iteration of this application

Spoken commands trigger agent-based tool calling, converting intent into structured video edits.


Challenges we ran into

  • Timing precision: Aligning filler-word removal, captions, and B-roll at frame-level accuracy
  • Real-time responsiveness: Coordinating live voice input with multiple agents without lag
  • Context-aware B-roll: Matching visuals to narrative intent not just keywords
  • Rendering complexity: Making local rendering effective; scaling cloud video rendering while keeping iteration fast (increasing concurrency for Lambda)

Accomplishments that we're proud of

  • Built a fully voice-driven editing workflow, not just voice shortcuts
  • Achieved context-aware B-roll generation using Gemini-powered video understanding
  • Designed a timeline-free editing experience grounded in transcripts and intent
  • Coordinated multiple Gemini models in a single production pipeline

What we learned

  • Voice becomes a powerful creative interface when paired with strong reasoning models
  • Tool calling and agent orchestration unlock new UX patterns
  • Video editing is fundamentally a semantic problem, not just a visual one
  • Gemini excels at understanding why an edit should happen—not just how

What's next for Omu

  • Mobile App for vibe-editing on the go
  • Color grading and automatic Motion graphics based on context.
  • Personalized editing styles and tone-aware B-roll generation
  • Creator workflows for social, ads, and education

Omu’s vision: make video editing feel as natural as conversation.

Built With

Share this project:

Updates