Gemini 3 Dialogue Director: AI Video Storyteller

Craft dynamic video conversations with Gemini AI, featuring custom actors, voices, and real-time subtitles.
Example of the Generated Video

💡 Inspiration

Creating engaging dialogue videos usually requires a disjointed stack of tools:

A creative writer for the script
Voice actors (or a separate TTS subscription)
A video editor to stitch visuals and subtitles

For developers or educators who want to quickly mock up conversations, interview scenarios, or language lessons, the friction is too high. You end up juggling:

API keys for one service
File uploads for another
Manual syncing of audio duration with video frames

All of this makes fast iteration painful.

🎬 What It Does

Gemini 3 Dialogue Director is an advanced interactive web application that leverages Google's Gemini AI and Cloud Text-to-Speech technologies to create dynamic, audiovisual dialogue videos. Users can generate scripts, customize voices, and produce synchronized "talking head" videos with real-time subtitles and visual effects.

How it works

Users enter a topic or prompt
Gemini generates a natural, multi-turn conversation between two speakers
Each line is automatically:
- Assigned a digital actor
- Converted to high-quality AI speech
- Synchronized with animated subtitles and talking-head video

Features

🎭 Built-in or custom actors
🎙️ Global studio voices
⏱️ Adjustable speaking speed
📝 Subtitle styling
📐 Multiple aspect ratios (16:9, 9:16)

Once generated, audio, video, and subtitles are merged in real time and exported as a single MP4 or WebM file. The result is an end-to-end, browser-based workflow for creating interviews, debates, explainers, and educational dialogue videos — without writing code or using traditional video editing tools.

🛠️ How We Built It

Dialogue Generation

We use Gemini (3 Flash or Pro) to generate structured dialogue from a user-provided topic. Strict system instructions enforce a predictable format:( Man/Woman:) This allows the frontend to reliably parse dialogue turns and map them to speakers.

Text-to-Speech

Each dialogue line is sent to Google Cloud Text-to-Speech, where:

Speaker-specific studio voices generate high-fidelity audio
Exact audio duration is measured and used as the single timing source

Visual Rendering

Talking-head videos (preloaded or user-uploaded) are rendered on an HTML5 Canvas
The canvas engine:
- Switches active speakers
- Animates subtitles
- Maintains real-time synchronization with audio

Video Export

The MediaRecorder API captures the combined canvas video and audio stream
Exports a downloadable MP4 or WebM file directly in the browser

Authentication

Supports both:
- Vertex AI credentials
- Google AI Studio API keys

This architecture enables fast iteration, minimal setup, and a smooth end-to-end experience.

⚠️ Challenges We Ran Into

Audio–Video–Subtitle Synchronization

Talking-head videos loop independently, while AI-generated speech has variable duration.

Solution

Pre-generate audio for each dialogue line
Measure its exact duration
Use that duration as the single source of truth for:
- Video switching
- Subtitle animation

Cross-Browser Video Export

The MediaRecorder API behaves differently across browsers and codecs.

Solution

Runtime detection of supported MIME types
Reliable MP4 or WebM downloads on Chrome, Edge, and other browsers

🚀 Key Features

🎬 AI Script Generation

Powered by Gemini: Supports multiple models including Gemini 3 (Flash/Pro), Gemini 2.5, and Gemini 2.0.
Versatile Categories: Generate scripts for specific use cases:
- General Conversation
- Educational / Learning
- Podcast / Talk Show
- Job Interview
- Debate / Argument
- Storytelling
Customizable Control: Adjust video length (Snippet to Deep Dive), dialogue pace (Quick, Balanced, Detailed), and target language.
Interactive Mode: Review and edit the AI-generated script before producing the video.

🗣️ Advanced Audio & TTS

Google Cloud TTS: high-quality speech synthesis using Google's Cloud Text-to-Speech API.
Multilingual Support: Generate audio in over 10 languages including English, Spanish, French, German, Italian, Portuguese, Japanese, Chinese, Hindi, and Marathi.
Voice Variety: Access a wide range of voice types including Studio, Neural2, and Chirp3-HD voices.
Fine-Tuning: Adjustable speaking rate (0.5x to 1.5x) for perfect timing.

🎥 Video Production & Customization

Dual-Actor System: Seamlessly synchronized "Man" and "Woman" actors.
Asset Library: Built-in library of high-quality actor clips fetched directly from GitHub.
Custom Uploads: Upload your own "talking head" video files for personalized characters.
Aspect Ratios: Support for 16:9 (Landscape) and 9:16 (Portrait) formats.
Real-Time Subtitles:
- Fully customizable font size and colors.
- Dynamic animations: Fade, Slide Up/Down/Left/Right, and Scale (Pop).

⚡ Performance & Usability

Client-Side Rendering: Real-time video composition using HTML5 Canvas.
Live Preview: Watch the video generation process in real-time.
Export: Download the final output as .mp4 or .webm.
Screen Wake Lock: Prevents the device from sleeping during long generation tasks.

🏆 Accomplishments We’re Proud Of

Built a complete end-to-end AI video generation pipeline that runs entirely in the browser
Combined Gemini-powered dialogue generation, studio-quality TTS, animated subtitles, and talking-head video into one workflow
Achieved reliable audio–video–subtitle synchronization without traditional video editing software
Designed a flexible authentication system for both enterprise and public demos
Created a polished UI that non-technical users can use in minutes
Delivered a production-quality tool for interviews, explainers, educational content, and demos

📚 What We Learned

Clear system instructions matter — strict prompt formatting improved reliability
Audio duration is the most reliable clock for synchronization
Developer experience affects adoption — setup friction can block users early
Multimodal AI works best as a coordinator, not just a text generator
Client-side media pipelines are viable with modern browser APIs

🚀 What’s Next: Gemini 3 Dialogue Director — AI Video Storyteller

Planned enhancements include:

👄 Lip-synced avatars using phoneme-level timing from TTS
🧑‍🎨 Custom avatar uploads for reusable talking-head actors
👥 Multi-speaker dialogues for panels and group discussions
🎛️ Timeline-based editing for precise control over timing and transitions
🎥 Scene and camera variations with cuts, zooms, and cinematic presets
📋 Template-driven workflows for interviews, lessons, debates, and language practice

Built With

antigravity
gcp
gemini
javascript
tailwind

Updates

Onkar Pawar started this project — Feb 08, 2026 01:00 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.