Problem Statement: What are you building and why?

Let's be real, traditional video editing software like Premiere Pro or DaVinci Resolve is super intimidating. The learning curve is insane, and the amount of time you spend just scrubbing through the timeline to find a specific 2-second clip is a nightmare. We wanted to build something where you don't have to menu-dive or memorize keyboard shortcuts just to make a cool video.

Solution Overview: How does your project work?

Clippi is an AI-powered "vibe-based" video editor. Instead of messing with confusing tracks, you just drag your clips onto a node-based canvas, connect them, and then literally talk to your editor in plain English. You can just type "blur the background" or "where is the product demo?" and our AI engine routes the request, runs the heavy-lifting segmentation, and updates the video live.

Key Features: What are the main functionalities?

  • AI Chat Editor: Describe your edits in natural language, and Mistral AI parses your intent and executes the tools.
  • Video RAG (Semantic Search): Pixtral indexes your frames, so you can ask "where is the rubix cube?" and instantly jump to that exact timestamp.
  • Object-Aware Editing: We use SAM 2 to automatically segment and track objects in the frame, allowing you to easily blur backgrounds or isolate speakers.
  • Node-Based Flow: A non-destructive, visual canvas where you drag clips and connect them with edges that act as transitions (cuts, fades, etc.).
  • Pro Audio & Dubbing: Integrated with ElevenLabs for AI voiceovers, sound effects, and auto-captions.

Technologies Used: Tools, frameworks, APIs, etc.

  • Frontend: React (Vite), Tailwind CSS, Zustand, Xyflow (for the node canvas), and Remotion (for the video engine).
  • Backend: FastAPI (Python), FFmpeg, MoviePy, and OpenCV running on a Brev.dev GPU instance.
  • AI Models: Mistral AI (tool routing), Meta's SAM 2 (segmentation), Pixtral Vision (indexing), and ElevenLabs (audio).
  • Observability: Weights & Biases (WandB) to trace MCP tool calls and prevent the LLM from hallucinating edits.

Target Users: Who is this for?

Content creators, social media managers, students, and basically anyone who wants to pump out high-quality, professional-looking videos without spending 100 hours learning complex editing suites.

Inspiration

We were trying to edit some footage for a previous project and spent hours just scrubbing through a timeline and struggling to manually mask an object. We literally thought, "Why can't we just tell the software what we want?" With LLMs getting so good at function calling, we realized we could build an editor that you can actually converse with.

What it does

Clippi completely changes the editing workflow. You drop your raw videos onto a web canvas and link them up like a flowchart. Then, you just use the chatbox to describe your edits. Mistral AI translates your English into FFmpeg actions and SAM 2 segmentation masks. Want a cinematic golden hour filter and a voiceover? Just ask the AI, and it processes it live. When you're done, the backend stitches everything together into a final MP4.

How we built it

We split the architecture into a lightweight local frontend and a heavy-duty GPU backend. We built the UI with React and Xyflow for the drag-and-drop nodes, using Remotion to handle the live video previews.

For the backend, we used FastAPI running on a Brev.dev GPU instance to handle the massive compute required by SAM 2. Mistral AI acts as the brain, taking user prompts and triggering the right Python scripts (like FFmpeg or ElevenLabs APIs). To keep the AI from hallucinating tool calls, we built a really cool closed-feedback loop using Weights & Biases to log and verify every single action.

Challenges we ran into

Getting SAM 2 to run fast enough on the backend without timing out the frontend was a massive headache. Video processing is super compute-heavy, so we had to heavily optimize our FFmpeg pipelines. Another huge challenge was dealing with the LLM hallucinating function calls that never happened, which is exactly why we ended up integrating W&B to strictly audit the MCP traces.

Accomplishments that we're proud of

We are incredibly proud that we managed to build a fully functional, node-based video editor right in the browser. Seeing the system actually understand a prompt like "blur the background", perfectly run SAM 2 to segment the speaker, and spit out the edited video was a mind-blowing moment for the whole team.

What we learned

We leveled up our skills massively. We learned the deep intricacies of FFmpeg, how to deploy and manage heavy machine learning models (SAM 2) on remote GPU instances, and how to string together complex function-calling with Mistral. We also got a crash course in advanced React state management using Zustand to keep the node canvas and video player in perfect sync.

What's next for Clippi

We want to add a dedicated multi-track audio interface, direct integrations to publish straight to TikTok and YouTube Shorts, and potentially a mobile app version so you can vibe-edit straight from your phone!

Built With

  • brev.dev
  • elevenlabs
  • fastapi
  • ffmpeg
  • mistral
  • moviepy
  • opencv
  • pixtral-vision
  • react
  • react-flow
  • remotion
  • sam-2
  • weights&biases
  • zustand
Share this project:

Updates