VideoGen-Agent

💡 Inspiration

We live in a world driven by video, yet creating engaging video content from static assets remains a massive bottleneck. Whether you are a student sharing a presentation, a small business promoting a product, or a creator organizing vacation photos, the process is painstakingly manual. You have to arrange images, write a script, hunt for background music, record voiceovers, and manually keyframe pan-and-zoom effects in a video editor.

We were inspired to solve this by asking a simple question: What if we could build an autonomous, AI-powered film director? We wanted to create a system that doesn't just slap a filter on your photos, but actually understands them, rearranges them to tell a logical story, and outputs a cinematic video. Thus, the Agentic Video Creator (VideoGen-Agent) was born.

⚙️ What it does

VideoGen-Agent is a 6-agent autonomous pipeline that transforms a chaotic folder of static images into a fully narrated, cinematic video. The system's "Wow" factor is its Context-Aware Narrative Ordering. Instead of playing images in the order they were uploaded, our AI analyzes the semantic context of the photos and physically rearranges them to craft a logical, emotionally resonant story with a beginning, middle, and end.

🛠️ How we built it

We engineered a true sequential multi-agent architecture divided into three layers:

The Cognitive Layer:
- Enhancement Agent: Intercepts raw images and applies professional upscaling/color correction via the Perfect Corp API (or a Pillow fallback).
- Vision Agent: Uses Gemma 3 12B (via Featherless.ai) to extract rich semantic metadata (subjects, mood, setting) from the base64-encoded images.
- Story Agent: Powered by DeepSeek-V3, this agent acts as the screenwriter. It analyzes all image metadata simultaneously, determines the logical sequence, and writes scene-by-scene voiceover scripts.
The Synthesis Layer:
- Voiceover Agent: Uses a dual-engine setup (Microsoft edge-tts primary, ElevenLabs fallback) to generate natural MP3 narration for every scene.
- Music Agent: Uses a rule-based algorithm to match the Story Agent's generated "mood" to royalty-free background tracks.
The Output Layer:
- Director Agent: We used MoviePy and FFmpeg to programmatically assemble the video. It calculates the exact audio duration to sync scene lengths, ducks the background music volume dynamically, and applies subtitles.

🚧 Challenges we ran into

Model Reliability and Concurrency: Initially, we tried using heavier multimodal models (like Kimi-K2) for the Vision Agent, but we constantly hit rate limits and 503 capacity errors, breaking our pipeline. We solved this by switching to Gemma 3 12B, which costs only 1 concurrency point and remains "warm," providing rock-solid reliability.

Programmatic Cinematic Effects: Applying a dynamic Ken Burns (pan and zoom) effect to images programmatically without cropping out subjects was highly complex. We had to calculate bounding box scaling dynamically. To ensure smooth scaling $S(t)$ over time $t$, we implemented a linear interpolation algorithm where the scale factor is bounded by a maximum zoom $Z_{max}$: $$ S(t) = 1.0 + \left( Z_{max} - 1.0 \right) \left( \frac{t}{T_{total}} \right) $$ Where $T_{total}$ is the total duration of the scene based on the voiceover length.

🏆 Accomplishments that we're proud of

We are incredibly proud of the orchestration of the agents. Building a system where 6 different AI agents pass structured JSON and media files to one another without failure required rigorous error handling and fallback mechanisms. The fact that a user can drop in 5 completely random photos and get back a video that actually makes sense narratively is a huge achievement.

🧠 What we learned

Agentic Workflows: We learned that large language models are most effective when strictly scoped. Assigning specific "roles" (Vision, Story, Director) yielded much better results than trying to make one model do everything.
Deterministic Video Generation: We discovered the power of combining non-deterministic LLMs with deterministic video rendering tools like MoviePy, allowing us to maintain absolute control over the final output quality.

🚀 What's next for VideoGen-Agent

Our modular agent design allows us to swap in new technologies easily. Next, we plan to implement:

True Image-to-Video Generation: Swapping the MoviePy Ken Burns effect for models like Stable Video Diffusion to turn static images into actual moving footage.
GPU Acceleration: Optimizing our Director Agent to render video locally using NVIDIA hardware encoding for near-instant output.
Voice Cloning Integration: Allowing users to upload a 10-second sample of their own voice so the Voiceover Agent can narrate the video in their own tone.

Built With

ai
ffmpeg
moviepy
pillow
python
streamlit
tts

Updates

shamee K.sharma started this project — Jun 04, 2026 01:59 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.