Video Generation Pipeline - Google ADK and Multi-Agent

All the agents initialized
Generating script
Dubbing prompt
Background music
Generating video

Inspiration

As a video-making hobbyist, I constantly faced the time-consuming reality of content creation. While AI services offered some relief, stitching together disparate tools for script, visuals, and audio remained a fragmented, inefficient process. My initial exploration with Langchain and LlamaIndex for standalone agents proved promising but lacked true multi-agent orchestration. This challenge led me to Google ADK – its intuitive framework for building and coordinating multiple agents, coupled with its excellent documentation, presented the perfect opportunity to build the seamless video pipeline I envisioned.

What it does

Creating engaging 30-second video content traditionally demands a diverse skill set and significant time investment, encompassing:

Scriptwriting: Crafting compelling narratives. Visual Design: Sourcing or generating impactful imagery. Audio Production: Mastering voiceovers and background scores. Video Editing: Seamlessly combining all elements.

This solution revolutionizes the process by automating the entire video production pipeline. From a simple text prompt, it empowers anyone to generate broadcast-ready, high-quality 30-second videos in minutes, democratizing video content creation.

How I built it

At the heart of our system lies a Director Agent, a sophisticated orchestrator that leverages Google ADK's SequentialAgent capabilities. This root agent intelligently coordinates the entire video generation process by calling five specialized AI agents in a precise, sequential workflow. Gemini 2.5 Pro powers the Director Agent's intelligent decision-making and workflow management.

Solution https://drive.google.com/file/d/1bMYcUQZvNv_1y9XKfiWIezvQBZ9h09Yj/view

Each specialized agent contributes a unique skill:

Script Writer Agent - The Narrative Architect: This agent transforms raw user input into compelling, structured 30-second video scripts. Powered by Gemini 2.5 Pro, it meticulously crafts narratives with precise timing cues, single-image visual descriptions, and natural-flowing dialogue.

Image Producer Agent - The Visual Alchemist: This agent, powered by OpenAI's DALL-E 3 and coordinated by Gemini 2.5 Pro, generates high-quality, singular images that perfectly match the script's specific visual descriptions for each segment.

Dubbing Agent - The Voice Talent: Leveraging OpenAI's advanced TTS models and Gemini 2.5 Pro for coordination, this agent converts script dialogue into engaging, timed audio narration, combining all spoken parts into a single dubbing.mp3 file.

Background Score Agent - The Mood Weaver: Utilizing the Beatoven AI API and Gemini 2.5 Pro, this agent composes contextually appropriate background music. It interprets the script's emotional tone to create a 30-second track that enhances the video's impact without overpowering it.

Video Builder Agent - The Final Director: As the culmination of the pipeline, this agent expertly assembles all generated components—script, images, voiceover, and background music—into a polished, final video. It leverages MoviePy for video rendering, with Gemini 2.5 Pro coordinating the integration logic.

Challenges we ran into

Our journey wasn't without hurdles, but overcoming them significantly refined our pipeline:

Inter-Agent Communication & Data Flow: Passing precise outputs from the Script Writer to dependent agents (Image, Dubbing, Music) proved challenging. Fortunately, Google ADK's robust output storage capabilities provided an elegant solution, allowing us to seamlessly embed complex data (like the structured script) directly within subsequent agent prompts.
Synchronization of Visuals & Audio: Initial video stitches often resulted in out-of-sync images and dialogue. This critical timing issue was successfully addressed through intensive prompt engineering and fine-tuning, allowing agents to precisely adhere to the script's time cues.
Dynamic Video Assembly & Tooling Limitations: Our ambition was to make the Video Builder agent entirely autonomous, dynamically generating video stitching code. However, the external dependency on the MoviePy library necessitated the use of a dedicated create_video tool, preventing a fully code-generating agent. This highlighted the current boundaries of LLM code generation for complex external libraries.

Accomplishments that we're proud of

We are immensely proud to have built a fully functional, multi-agent AI pipeline that can generate a high-quality 30-second video from a simple text prompt in just a couple of minutes and for less than 30 cents! This truly validates the power of orchestrated AI agents for complex creative tasks.

What I learned

This project offered invaluable insights into the practicalities of building with advanced AI:

LLM Limitations vs. Developer Reality: While LLMs are powerful, expecting them to write flawless, production-ready code for bleeding-edge frameworks like Google ADK is unrealistic. It underscored the critical role of developers in reading documentation, debugging, and understanding tool specifics—a humbling reminder that AI augments, but doesn't yet replace, core engineering.
The Evolving Developer Landscape: This experience confirms that AI agents will fundamentally reshape job portfolios rather than simply replacing developers. The focus shifts towards designing, orchestrating, and prompting intelligent systems, creating exciting new specializations.

What's next for Video Generation Pipeline Google ADK and Multi-Agent

Vision for the future includes:

Enhanced Temporal Alignment: Introducing sophisticated feedback loops and iterative refinement mechanisms to precisely match dubbing agent output duration with the target video segment lengths, ensuring even tighter synchronization.
Autonomous Video Builder 2.0: Moving beyond fixed tools, we aim to enable the Video Builder agent to dynamically generate and execute MoviePy (or other video editing library) code directly based on script instructions, truly fulfilling the vision of autonomous video assembly.
Advanced User Control: Integrating options for users to specify desired mood, style, or specific visual elements directly in the initial prompt, giving them more creative control.
Broader Output Formats: Exploring generation of longer videos or different aspect ratios to cater to diverse platforms.

Built With

beatoven
dall-e
gemini
googleadk
openai
python
tts

Updates

NAR Rasheed started this project — May 29, 2025 08:42 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.