💡 Inspiration
We live in a world driven by video, yet creating engaging video content from static assets remains a massive bottleneck. Whether you are a student sharing a presentation, a small business promoting a product, or a creator organizing vacation photos, the process is painstakingly manual. You have to arrange images, write a script, hunt for background music, record voiceovers, and manually keyframe pan-and-zoom effects in a video editor.
We were inspired to solve this by asking a simple question: What if we could build an autonomous, AI-powered film director? We wanted to create a system that doesn't just slap a filter on your photos, but actually understands them, rearranges them to tell a logical story, and outputs a cinematic video. Thus, the Agentic Video Creator (VideoGen-Agent) was born.
⚙️ What it does
VideoGen-Agent is a 6-agent autonomous pipeline that transforms a chaotic folder of static images into a fully narrated, cinematic video. The system's "Wow" factor is its Context-Aware Narrative Ordering. Instead of playing images in the order they were uploaded, our AI analyzes the semantic context of the photos and physically rearranges them to craft a logical, emotionally resonant story with a beginning, middle, and end.
🛠️ How we built it
We engineered a true sequential multi-agent architecture divided into three layers:
The Cognitive Layer:
- Enhancement Agent: Intercepts raw images and applies professional upscaling/color correction via the Perfect Corp API (or a Pillow fallback).
- Vision Agent: Uses Gemma 3 12B (via Featherless.ai) to extract rich semantic metadata (subjects, mood, setting) from the base64-encoded images.
- Story Agent: Powered by DeepSeek-V3, this agent acts as the screenwriter. It analyzes all image metadata simultaneously, determines the logical sequence, and writes scene-by-scene voiceover scripts.
The Synthesis Layer:
- Voiceover Agent: Uses a dual-engine setup (Microsoft
edge-ttsprimary, ElevenLabs fallback) to generate natural MP3 narration for every scene. - Music Agent: Uses a rule-based algorithm to match the Story Agent's generated "mood" to royalty-free background tracks.
- Voiceover Agent: Uses a dual-engine setup (Microsoft
The Output Layer:
- Director Agent: We used MoviePy and FFmpeg to programmatically assemble the video. It calculates the exact audio duration to sync scene lengths, ducks the background music volume dynamically, and applies subtitles.
🚧 Challenges we ran into
Model Reliability and Concurrency: Initially, we tried using heavier multimodal models (like Kimi-K2) for the Vision Agent, but we constantly hit rate limits and 503 capacity errors, breaking our pipeline. We solved this by switching to Gemma 3 12B, which costs only 1 concurrency point and remains "warm," providing rock-solid reliability.
Programmatic Cinematic Effects: Applying a dynamic Ken Burns (pan and zoom) effect to images programmatically without cropping out subjects was highly complex. We had to calculate bounding box scaling dynamically. To ensure smooth scaling $S(t)$ over time $t$, we implemented a linear interpolation algorithm where the scale factor is bounded by a maximum zoom $Z_{max}$: $$ S(t) = 1.0 + \left( Z_{max} - 1.0 \right) \left( \frac{t}{T_{total}} \right) $$ Where $T_{total}$ is the total duration of the scene based on the voiceover length.
🏆 Accomplishments that we're proud of
We are incredibly proud of the orchestration of the agents. Building a system where 6 different AI agents pass structured JSON and media files to one another without failure required rigorous error handling and fallback mechanisms. The fact that a user can drop in 5 completely random photos and get back a video that actually makes sense narratively is a huge achievement.
🧠 What we learned
- Agentic Workflows: We learned that large language models are most effective when strictly scoped. Assigning specific "roles" (Vision, Story, Director) yielded much better results than trying to make one model do everything.
- Deterministic Video Generation: We discovered the power of combining non-deterministic LLMs with deterministic video rendering tools like MoviePy, allowing us to maintain absolute control over the final output quality.
🚀 What's next for VideoGen-Agent
Our modular agent design allows us to swap in new technologies easily. Next, we plan to implement:
- True Image-to-Video Generation: Swapping the MoviePy Ken Burns effect for models like Stable Video Diffusion to turn static images into actual moving footage.
- GPU Acceleration: Optimizing our Director Agent to render video locally using NVIDIA hardware encoding for near-instant output.
- Voice Cloning Integration: Allowing users to upload a 10-second sample of their own voice so the Voiceover Agent can narrate the video in their own tone.
Log in or sign up for Devpost to join the conversation.