Omu | Devpost

Omu logo

Vibe editing is finally here with Omu.

Inspiration

Modern video editing is powerful — but slow, manual, and timeline-heavy. Creators spend more time cutting filler words, syncing captions, and hunting for B-roll than actually telling stories. We asked a simple question: what if you could edit video the same way you think and speak?

Omu was born from the idea that voice should be the primary interface for creativity, not an afterthought.

What it does

Omu is a voice-first AI video editor.
Edit hands-free, add animated captions, auto-remove filler words, add music, and generate B-roll at the perfect moments — turning raw clips into polished videos in minutes. A live transcript becomes the editing surface, so creators never touch a timeline.

How we built it

Omu is a multi-agent AI system centered around Gemini 3:

Gemini 3 Flash (core reasoning & tool calling): Orchestrates edits, understands intent, and coordinates agents. Drives the transcript and text-editing agent.
Gemini Live API: Powers the real-time, voice-driven video editing agent
Gemini 3 Pro (video understanding analysis): Determines when B-roll should appear based on context
Gemini 3 Pro Image (Nano Banana Pro): Generates image-based B-roll
Veo 3.1 Fast: Generates video B-roll dynamically
Scribe v2: Provides accurate, word-level timestamps for precise edits
FFmpeg: for video composition and client-side video processing in the browser
Remotion Lambda: Handles scalable, serverless video rendering
Google Antigravity & Bolt: AI-powered development agents that supported the implementation and iteration of this application

Spoken commands trigger agent-based tool calling, converting intent into structured video edits.

Challenges we ran into

Timing precision: Aligning filler-word removal, captions, and B-roll at frame-level accuracy
Real-time responsiveness: Coordinating live voice input with multiple agents without lag
Context-aware B-roll: Matching visuals to narrative intent not just keywords
Rendering complexity: Making local rendering effective; scaling cloud video rendering while keeping iteration fast (increasing concurrency for Lambda)

Accomplishments that we're proud of

Built a fully voice-driven editing workflow, not just voice shortcuts
Achieved context-aware B-roll generation using Gemini-powered video understanding
Designed a timeline-free editing experience grounded in transcripts and intent
Coordinated multiple Gemini models in a single production pipeline

What we learned

Voice becomes a powerful creative interface when paired with strong reasoning models
Tool calling and agent orchestration unlock new UX patterns
Video editing is fundamentally a semantic problem, not just a visual one
Gemini excels at understanding why an edit should happen—not just how