VisualBox

VisualBox

Inspiration

Video editing takes forever. Most of us have raw footage sitting on our drives that could become amazing content, but the barrier to entry is high—either you spend days learning editing software or months paying editors. We thought: what if you could just describe what you want, and an AI agent actually builds it for you?

What it does

VisualBox is an intent-driven, AI-powered video editor. You describe what you want to create in plain English, and the system uses multimodal understanding to analyze your media and capture the spatio-temporal structure of your footage. Based on your intent, it produces an initial, high-quality edit automatically.

VisualBox includes a dedicated studio workspace where you can interact with the same editor agent—refining the video through natural language commands or manual adjustments, with full creative control. Over time, the editor learns from your behavior, adapts to different editing strategies, and continuously acquires new skills to better match your workflow.

How we built it

Unlike traditional AI video editors that rely on transcripts and timestamps, VisualBox is built around spatio-temporal understanding of video. Instead of treating video as text with timecodes, the system analyzes it as a dynamic signal over time. Using multimodal reasoning with Google’s Gemini models, it directly understands visuals, motion, audio, and overall structure.

The editor is built in React with a fully interactive timeline. All uploaded media is stored locally in IndexedDB, allowing projects to persist across sessions without requiring files to be reprocessed. This design keeps the editor responsive while supporting iterative editing workflows.

At the core of the system is an agent-oriented architecture. A single orchestrator agent coordinates multiple specialized subagents, each responsible for a narrowly scoped task. This decomposition improves reliability and avoids the brittleness common in monolithic AI editors.

When an edit is requested, a media analysis and clip extraction agent examines the footage using adaptive depth analysis. Depending on the task, it can perform quick scans or deep semantic analysis. Every analysis result is cached, reducing redundant computation and token usage. The agent reasons across multiple resolutions, similar to zooming in and out on an editing timeline, moving from high-level structure down to precise clip boundaries.

If a reference video is provided, a style extraction agent analyzes it to infer pacing, transitions, and editing techniques. Rather than copying styles directly, the agent translates these observations into reusable editing principles that generalize across different content.

The editing agent plans and executes edits in a structured, top-down manner. It first defines the rough structure of the video, then refines the timeline track by track, executing edits sequentially from left to right. The agent can invoke editing tools and other subagents, allowing it to plan at a high level while executing at a low level. When new editing patterns are discovered, it updates its internal editing knowledge for future use.

Every edit is reviewed by a verification agent that checks for visual artifacts, timing issues, and alignment with the user’s intent. Any detected problems are fed back to the editing agent for correction before the workflow continues, forming a closed feedback loop.

Once an initial draft is produced, users enter a studio screen where the same editor agent remains available. Users can refine edits using natural language commands or by manually adjusting the timeline, combining AI-driven structure with full creative control.

Challenges we ran into

One of the biggest challenges was making structured AI outputs reliable. Gemini 3 Flash Preview turned out to be far less dependable for structured responses than expected. An additional LLM call was introduced to safely handle structured outputs, also using Gemini, but this approach failed as well. As a temporary workaround, the system now operates on plain text outputs instead of strict structured formats.

Another major challenge was coordinating multiple AI agents so they could work together smoothly without hallucinating, conflicting with one another, or getting stuck in execution loops.

Providing enough context for the AI to correctly infer editing intent from natural language descriptions was also difficult. Editing instructions are often ambiguous, and translating them into precise, actionable timeline operations required careful prompt and system design.

Finally, managing file storage and caching efficiently was critical to keeping the application fast. Media files needed to persist across sessions without unnecessary reprocessing, while still allowing rapid access and iteration during editing workflows.

Accomplishments that we're proud of

We successfully implemented an agent-oriented architecture that decomposes the editing process into clear stages such as media analysis, style extraction, planning, execution, and verification. This separation of responsibilities made the system more reliable and easier to evolve compared to a single monolithic AI workflow.

The editor supports a natural, conversational editing loop. Users can iteratively refine their videos by talking to the AI as they would to a human editor, while still retaining full manual control through the timeline when needed.

All of this runs entirely in the browser, with persistent local storage and no heavy backend dependencies. This significantly reduces system complexity while still enabling fast, iterative editing workflows.

What we learned

What we learned is that AI agents are far more effective when they are given narrow roles, clear constraints, and well-defined inputs and outputs, rather than being asked to handle the entire problem space at once. We also learned that video editing is inherently subjective. To produce useful results, the AI must understand the user’s creative intent and preferences, not just technical requirements like duration or format.

What's next for VisualBox

We plan to expand beyond editing into generation. We plan to add native video generation capabilities so AI-generated clips can be created, edited, and merged directly into the same timeline. This enables more creative workflows where generated content and real footage coexist, rather than living in separate tools.Another focus is strengthening agent reliability and coordination. This includes better guardrails for structured outputs, improved verification passes, and more robust recovery mechanisms when agents disagree or fail. We also want to expand the editor’s understanding of style and intent over time. This means richer style modeling, better reuse of learned editing patterns, and more consistent behavior across projects.

Built With

gemini3
openrouter
react
tailwind
twick
typescript
vite

Updates

Private user started this project — Feb 09, 2026 07:30 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.