๐ก Inspiration
Creating engaging dialogue videos usually requires a disjointed stack of tools:
- A creative writer for the script
- Voice actors (or a separate TTS subscription)
- A video editor to stitch visuals and subtitles
For developers or educators who want to quickly mock up conversations, interview scenarios, or language lessons, the friction is too high. You end up juggling:
- API keys for one service
- File uploads for another
- Manual syncing of audio duration with video frames
All of this makes fast iteration painful.
๐ฌ What It Does
Gemini 3 Dialogue Director is an advanced interactive web application that leverages Google's Gemini AI and Cloud Text-to-Speech technologies to create dynamic, audiovisual dialogue videos. Users can generate scripts, customize voices, and produce synchronized "talking head" videos with real-time subtitles and visual effects.
How it works
- Users enter a topic or prompt
- Gemini generates a natural, multi-turn conversation between two speakers
- Each line is automatically:
- Assigned a digital actor
- Converted to high-quality AI speech
- Synchronized with animated subtitles and talking-head video
- Assigned a digital actor
Features
- ๐ญ Built-in or custom actors
- ๐๏ธ Global studio voices
- โฑ๏ธ Adjustable speaking speed
- ๐ Subtitle styling
- ๐ Multiple aspect ratios (16:9, 9:16)
Once generated, audio, video, and subtitles are merged in real time and exported as a single MP4 or WebM file. The result is an end-to-end, browser-based workflow for creating interviews, debates, explainers, and educational dialogue videos โ without writing code or using traditional video editing tools.
๐ ๏ธ How We Built It
Dialogue Generation
We use Gemini (3 Flash or Pro) to generate structured dialogue from a user-provided topic. Strict system instructions enforce a predictable format:( Man/Woman:) This allows the frontend to reliably parse dialogue turns and map them to speakers.
Text-to-Speech
Each dialogue line is sent to Google Cloud Text-to-Speech, where:
- Speaker-specific studio voices generate high-fidelity audio
- Exact audio duration is measured and used as the single timing source
Visual Rendering
- Talking-head videos (preloaded or user-uploaded) are rendered on an HTML5 Canvas
- The canvas engine:
- Switches active speakers
- Animates subtitles
- Maintains real-time synchronization with audio
- Switches active speakers
Video Export
- The MediaRecorder API captures the combined canvas video and audio stream
- Exports a downloadable MP4 or WebM file directly in the browser
Authentication
- Supports both:
- Vertex AI credentials
- Google AI Studio API keys
This architecture enables fast iteration, minimal setup, and a smooth end-to-end experience.
โ ๏ธ Challenges We Ran Into
AudioโVideoโSubtitle Synchronization
Talking-head videos loop independently, while AI-generated speech has variable duration.
Solution
- Pre-generate audio for each dialogue line
- Measure its exact duration
- Use that duration as the single source of truth for:
- Video switching
- Subtitle animation
- Video switching
Cross-Browser Video Export
The MediaRecorder API behaves differently across browsers and codecs.
Solution
- Runtime detection of supported MIME types
- Reliable MP4 or WebM downloads on Chrome, Edge, and other browsers
๐ Key Features
๐ฌ AI Script Generation
- Powered by Gemini: Supports multiple models including Gemini 3 (Flash/Pro), Gemini 2.5, and Gemini 2.0.
- Versatile Categories: Generate scripts for specific use cases:
- General Conversation
- Educational / Learning
- Podcast / Talk Show
- Job Interview
- Debate / Argument
- Storytelling
- Customizable Control: Adjust video length (Snippet to Deep Dive), dialogue pace (Quick, Balanced, Detailed), and target language.
- Interactive Mode: Review and edit the AI-generated script before producing the video.
๐ฃ๏ธ Advanced Audio & TTS
- Google Cloud TTS: high-quality speech synthesis using Google's Cloud Text-to-Speech API.
- Multilingual Support: Generate audio in over 10 languages including English, Spanish, French, German, Italian, Portuguese, Japanese, Chinese, Hindi, and Marathi.
- Voice Variety: Access a wide range of voice types including Studio, Neural2, and Chirp3-HD voices.
- Fine-Tuning: Adjustable speaking rate (0.5x to 1.5x) for perfect timing.
๐ฅ Video Production & Customization
- Dual-Actor System: Seamlessly synchronized "Man" and "Woman" actors.
- Asset Library: Built-in library of high-quality actor clips fetched directly from GitHub.
- Custom Uploads: Upload your own "talking head" video files for personalized characters.
- Aspect Ratios: Support for 16:9 (Landscape) and 9:16 (Portrait) formats.
- Real-Time Subtitles:
- Fully customizable font size and colors.
- Dynamic animations: Fade, Slide Up/Down/Left/Right, and Scale (Pop).
โก Performance & Usability
- Client-Side Rendering: Real-time video composition using HTML5 Canvas.
- Live Preview: Watch the video generation process in real-time.
- Export: Download the final output as
.mp4or.webm. - Screen Wake Lock: Prevents the device from sleeping during long generation tasks.
๐ Accomplishments Weโre Proud Of
- Built a complete end-to-end AI video generation pipeline that runs entirely in the browser
- Combined Gemini-powered dialogue generation, studio-quality TTS, animated subtitles, and talking-head video into one workflow
- Achieved reliable audioโvideoโsubtitle synchronization without traditional video editing software
- Designed a flexible authentication system for both enterprise and public demos
- Created a polished UI that non-technical users can use in minutes
- Delivered a production-quality tool for interviews, explainers, educational content, and demos
๐ What We Learned
- Clear system instructions matter โ strict prompt formatting improved reliability
- Audio duration is the most reliable clock for synchronization
- Developer experience affects adoption โ setup friction can block users early
- Multimodal AI works best as a coordinator, not just a text generator
- Client-side media pipelines are viable with modern browser APIs
๐ Whatโs Next: Gemini 3 Dialogue Director โ AI Video Storyteller
Planned enhancements include:
- ๐ Lip-synced avatars using phoneme-level timing from TTS
- ๐งโ๐จ Custom avatar uploads for reusable talking-head actors
- ๐ฅ Multi-speaker dialogues for panels and group discussions
- ๐๏ธ Timeline-based editing for precise control over timing and transitions
- ๐ฅ Scene and camera variations with cuts, zooms, and cinematic presets
- ๐ Template-driven workflows for interviews, lessons, debates, and language practice
Built With
- antigravity
- gcp
- gemini
- javascript
- tailwind
Log in or sign up for Devpost to join the conversation.