How we Used Gemini
CutPilot uses Google Gemini 2.5 Flash as the planning brain that converts a user’s plain-language editing request into a structured JSON edit plan the app can execute. The main feature used is Gemini’s text generation (generateContent), but it’s guided with strong prompting so the response is JSON-only and matches a fixed schema (operations, timestamps, parameters, status). CutPilot sends Gemini the editing context—video duration, selected timestamp range, chosen effects, available meme/SFX assets, and the user’s instruction—so Gemini can produce time-accurate, intent-aware suggestions (for example “make it viral” → captions, tighter pacing, punch-ins, color boost). Gemini also helps interpret human time formats like “0:45” or “the first 20 seconds,” while CutPilot enforces safety rules such as clamping times to the video length and limiting operation count. Finally, Gemini’s output is cleaned with safe JSON extraction and validated with Zod before anything reaches the renderer, ensuring only valid, executable operations are accepted.
Inspiration
We were inspired by the tedious nature of video editing—hours spent trimming clips, adjusting transitions, and syncing audio. We imagined a world where creators could simply describe what they wanted and let AI handle the technical execution, democratizing professional-quality video production for everyone from YouTubers to educators.
What it does
CutPilot transforms video editing through natural language prompts. Users can upload raw footage and give commands like "remove all the ums and pauses," "add upbeat background music during the introduction," or "create a 30-second highlight reel of the best moments." The AI analyzes the video, understands the intent, and executes edits automatically—cutting, trimming, adding effects, and adjusting timing without manual timeline manipulation.
How we built it
We built CutPilot using a combination of Python-based video processing libraries (FFmpeg, MoviePy), natural language processing models to parse editing instructions, and computer vision algorithms to analyze video content. The backend uses Claude's API to interpret complex editing prompts and translate them into precise editing commands. We created a React-based frontend that provides a simple upload interface and real-time preview of AI-suggested edits before final rendering.
Challenges we ran into
The biggest challenge was mapping ambiguous creative language to specific technical operations—"make this more energetic" could mean dozens of different edits. We also struggled with processing speed for longer videos and ensuring frame-accurate cuts. Understanding context across an entire video (not just individual frames) proved computationally intensive, and we had to optimize our models extensively to keep processing times reasonable.
Accomplishments that we're proud of
We're proud of achieving genuinely useful results from conversational prompts—our beta testers were able to complete editing tasks in minutes that previously took hours. We successfully implemented scene detection that understands pacing and emotional tone, and our audio cleanup feature removes filler words with impressive accuracy while maintaining natural speech rhythm.
What we learned
We learned that video editing is as much about artistic intent as technical precision, and that teaching AI to understand creative vision requires extensive training on diverse editing styles. We also discovered the importance of giving users control—full automation isn't always desired, so we built in approval steps. Performance optimization for video processing taught us valuable lessons about efficient data handling and pipeline architecture.
What's next for CutPilot
Next, we're implementing multi-modal understanding so the AI can edit based on visual content and dialogue together (e.g., "cut to closeups whenever someone laughs"). We're building a collaborative feature where multiple users can give editing prompts on the same project, and exploring real-time editing for live streams. Long-term, we envision CutPilot becoming a complete post-production suite with color grading, motion graphics, and even AI-generated B-roll suggestions.
Log in or sign up for Devpost to join the conversation.