Inspiration
Editing long videos to find highlights is slow and tedious. Content creators often spend hours manually scrubbing timelines just to find a few useful moments. We wanted to remove this friction by replacing manual editing with a conversational AI experience.
The idea was simple: what if you could edit videos just by talking to an AI agent?
Instead of searching for clips manually, users can say commands like:
“Extract the highlight.” “Cut the funny moment.” “Remove silent parts.”
The agent understands the command, analyzes the video, and generates clips automatically.
What it does
VoxClip AI is a real-time multimodal AI agent that lets users edit videos using natural voice commands.
The system analyzes both audio and video content to identify meaningful segments and automatically generate optimized clips.
Key capabilities: Voice-controlled video editing Automatic highlight detection Silence and filler removal Smart clip extraction Real-time conversational interaction
This transforms video editing from a timeline-based workflow into a conversational interface.
How we built it
The system uses a modern AI-driven architecture combining multimodal reasoning with video processing.
Pipeline: User uploads a video. User gives a voice command. Gemini processes the speech and intent. Audio is extracted from the video using FFmpeg. A transcript is generated. AI scoring identifies highlight segments. The system cuts the best clips automatically. Clips are returned to the user interface.
The frontend provides a simple interface for uploading videos and interacting with the AI agent in real time.
Challenges we ran into
The biggest challenge was interpreting ambiguous voice commands.
Commands like:
“Find the interesting part” do not include timestamps or precise instructions.
To solve this, we combined multiple signals: transcript sentiment speech intensity keyword detection contextual scoring
Another challenge was processing videos efficiently while maintaining responsive interaction.
Accomplishments that we're proud of
We built a working prototype where:
video editing becomes voice-driven clips are generated automatically the agent responds conversationally
The system can successfully transform long videos into short highlight clips without manual editing.
What we learned
This project demonstrated how multimodal AI agents can transform creative workflows.
By combining voice interaction, video understanding, and automated editing, we created a system that removes the traditional barriers of video production.
What's next for VoxClip AI — Real-Time Voice Video Editing Agent
Future improvements include:
emotion detection from video frames automatic subtitle generation social media clip formatting (Shorts, Reels, TikTok) multi-speaker highlight detection smarter conversational editing workflows

Log in or sign up for Devpost to join the conversation.