Inspiration

Finding a 30-second clip for TikTok requires content creators to spend hours sifting through lengthy 4-hour livestreams. Current "AI video clippers" attempt to address this issue, but they are essentially deaf and blind. They rely on lazy heuristics, such as cutting whenever someone laughs loudly or the audio spikes. The cultural background, the inside jokes, and the video's true lore are all entirely lost on them.

I was motivated to create a system that watches and comprehends videos in the same way as a human editor, rather than merely "processing" them. Our goal was to create an Agentic Video Director, a multi-agent swarm that could recognize celebrities, comprehend context, and even grade its own work before rendering the final cut, in place of strict, linear scripts.

What it does

Clipper-Agent-Application is an AI video director that uses LangGraph and is completely self-sufficient. It takes care of the rest once you give it a YouTube URL:

  • Global Vision & Identity: Rather than using generic descriptions, it scans thumbnails and frames using Vision AI (Nemotron) to instantly identify specific celebrities (such as MrBeast or IShowSpeed) and the spatial context.

  • Lore Generation: To ensure that the AI comprehends the inside jokes of the stream, it cross-references the transcript with the visual data to create a "Style Guide" and contextual lore.

  • Parallel Worker Swarm: It creates a swarm of parallel AI workers to concurrently search for viral hooks in lengthy videos by chunking the transcript.

  • The Critic Loop: Employees submit clips to an internal AI Critic rather than simply outputting them. Based on the clip's potential for going viral, the critic gives it a score of 10. The employee tries again if it receives a low score.

  • Dynamic Directing: After approval, the AI gives the clip a "spatial bias" (such as face_lock or pan_left) that tells the MoviePy render engine precisely how to crop the 16:9 video into a 9:16 viral short.

  • The "Glass Box" user interface involves more than just staring at a loading spinner. The LangGraph Directed Acyclic Graph (DAG) is visualized by our real-time React user interface (UI), which allows users to observe the AI's neural pathways firing, parallel nodes sprouting, and the critic loop flashing red or green as it grades clips in real-time.

How we built it

Using Python and FastAPI, I constructed the backend, managing the complex state and Directed Acyclic Graph (DAG) logic with LangGraph as the primary orchestration engine.

  • Vision & NLP: I used OpenCV to extract frames and integrated yt-dlp for incredibly quick asset retrieval. While StepFun/OpenAI models processed the large transcript chunks to identify narrative hooks, Nvidia Nemotron handled the vision-language tasks (face identification and spatial framing).

  • Video Rendering: To handle the dynamic 9:16 cropping based on the AI's "spatial bias" decisions, I created unique mathematical formulas using MoviePy for the final assembly.

  • Frontend: Tailwind CSS and React are used to create the frontend. I put in place a polling system that parses terminal logs from the FastAPI backend in real time. These logs are routed to particular UI elements by a custom Regex parser, which also dynamically animates the parallel worker cards and illuminates the active nodes.

Challenges we ran into

  • Concurrent State Overwrites: In LangGraph, having multiple worker nodes write to the global state simultaneously caused catastrophic InvalidUpdateError crashes. I had to heavily refactor the subgraph to ensure nodes only returned specific, isolated key updates rather than the whole state.

  • API Rate Limits & Hallucinations: When 5 parallel workers hit the LLM API at once, I got rate-limited immediately. I engineered a staggered jitter/delay system to space out requests. Additionally, the Vision AI initially gave useless descriptions (e.g., "man in red shirt"). I had to engineer a strict "Celebrity Expert" system prompt to force it to identify named entities.

  • UI/Backend Synchronization: Syncing the exact state of a deeply nested Python graph to a React frontend was incredibly difficult. I solved this by designing a custom logging protocol that the UI could parse via Regex to isolate state changes to specific UI worker cards.

Accomplishments that we're proud of

  • The Internal Critic Loop: Moving away from "one-shot" AI generation. Building a system where the AI actually evaluates its own outputs, grades them, and self-corrects is a massive leap in output quality.

  • The Real-Time Node UI: Translating a complex backend architecture into a beautiful, pulsing, interactive "Glass Box." Watching the parallel workers spawn, and the critic scores pop up (e.g., "Score: 2/10 -> Rejected") makes the waiting process incredibly engaging.

  • True Spatial Awareness: The AI doesn't just crop the center of the screen; it actively decides how to frame the shot based on the context of the scene.

What we learned

I learned that the future of AI isn't just better models; it's better orchestration. Mastering LangGraph taught me how to move from imperative programming to state-driven agentic flows. I also learned advanced prompt engineering techniques, specifically how to force LLMs to adhere to strict JSON schemas and how to give Vision models "roles" to drastically improve their utility. Finally, I learned how to safely manage asynchronous, parallel video processing in Python without blowing up our memory limits.

What's next for Clipper-Agent-Application

  • Multi-Modal Critique: Currently, the Critic grades the transcript and concept. Next, I want the Critic to watch the actual rendered MP4 clip using Vision AI to grade the pacing and visual engagement.

  • Dynamic B-Roll & Transitions: Allowing the "Director Node" to automatically pull relevant images from the web (e.g., DuckDuckGo) to overlay as B-roll when the subject is just talking.

  • One-Click Publishing: Integrating the TikTok and YouTube Shorts APIs so the agent can automatically publish the top-scoring clips with the generated viral titles and hashtags.

Built With

Share this project:

Updates