Auto Dubbing System
An end-to-end AI-powered video dubbing platform. Given a video, subtitle file, and a host voice sample, it produces a fully dubbed audio track in a target language — with proper speaker separation, voice cloning, emotion-aware synthesis, and precise audio timing.
What It Does
- Extracts and separates audio — splits vocals from background music/noise
- Identifies speakers — detects who is the host vs. guests using voice embeddings
- Analyzes content — extracts emotion, tone, pacing, and terminology using multimodal AI
- Extracts a glossary — identifies proper nouns (names, places, brands) for human review
- Translates — produces phoneme-fitted translations that match original speech timing
- Synthesizes speech — generates dubbed audio via ElevenLabs with emotion tags and speaker routing
- Mixes final output — combines dubbed vocals with original background audio
Architecture
Auto_Dubbing_System/
├── FE/ # Next.js 16 frontend (project management UI)
└── BE/ # Python pipeline (AI/audio processing)
Two-Phase Pipeline
The pipeline is split deliberately at the glossary review step:
- Part 1 — runs automatically: preprocessing, speaker grouping, multimodal analysis, glossary extraction
- Human review — user edits/approves the extracted glossary via the web UI
- Part 2 — resumes after approval: translation, TTS synthesis, audio mixing
Local vs. Cloud Mode
NEXT_PUBLIC_LOCAL_MODE=true→ files stored locally, status polled fromstatus.jsonevery 3 secondsNEXT_PUBLIC_LOCAL_MODE=false→ uses Supabase for storage and real-time updates
Backend Pipeline (Python)
Part 1: Preprocessing & Analysis (run_pipeline_part1.py)
Input: input/video.mp4, input/source.srt, input/host_sample.wav
| Stage | What happens |
|---|---|
| Audio Extraction | FFmpeg extracts mono WAV (mix.wav) from video |
| Vocal Separation | BS RoFormer separates vocals (voice.wav) from background (rest.wav) |
| Keyframe Extraction | Extracts representative frames per subtitle segment for multimodal context |
| SRT Parsing | Parses subtitles, splits multi-speaker entries (dash-separated) |
| Speaker Grouping | ERes2NetV2 (ModelScope) compares audio segments to host sample; assigns HOST / GUEST_N roles |
| Audio Feature Extraction | librosa extracts duration, pitch, MFCC, RMS, tempo per segment |
| Nova Embeddings | AWS Bedrock amazon.nova-2-multimodal-embeddings-v1:0 creates multimodal vectors (transcript + audio features + keyframes) |
| Nova Lite Analysis | amazon.nova-lite-v1:0 classifies emotion, tone, speech style, pacing; suggests ElevenLabs voice settings |
| Speaker Profiling | Aggregates per-speaker metadata; generates 06_speaker_profiles.json |
| Glossary Extraction | Nova Lite extracts proper nouns (names, places, food, brands, slang) → temp/video_glossary.json |
Pauses here → user reviews glossary in the web UI.
Status flow: CREATED → PREPROCESSING → MULTIMODAL → LENGTH_ADJUST → WAITING_FOR_GLOSSARY_REVIEW
Part 2: Translation & Synthesis (run_pipeline_part2.py)
Triggered by: user clicking "Resume" after glossary approval.
| Stage | What happens |
|---|---|
| Translation (Phase 1) | Nova Lite translates segments in batches of 50, using glossary + surrounding context |
| Phoneme Fitting (Phase 2) | Counts phonemes in translations; shortens segments where speech would be too long for the time slot |
| TTS Generation | ElevenLabs eleven_v3 synthesizes audio per segment; HOST/GUEST roles mapped to distinct voices; emotion tags ([excited], [angry], etc.) prepended |
| Audio Stretching | librosa time-stretches each TTS segment to fit original duration without pitch change |
| Final Mix | FFmpeg mixes dubbed_vocal.wav + rest.wav → output/final_dubbed.wav |
| SRT Export | Saves translated subtitles to output/dubbed.srt |
Status flow: RESUMED → LENGTH_ADJUST → MIXING → COMPLETED
Workspace Structure
Each project gets an isolated directory:
BE/workspace/{projectId}/
├── project.json # name, target_language, created_at
├── status.json # current pipeline status (polled by FE)
├── pipeline.log # stdout/stderr from both parts
├── input/
│ ├── video.mp4
│ ├── source.srt
│ └── host_sample.wav
├── temp/
│ ├── mix.wav, voice.wav, rest.wav
│ ├── keyframes/
│ ├── segments.json
│ ├── 02_host_audio_features.json
│ ├── 04_embeddings.json
│ ├── 05_segment_structured.json
│ ├── 06_speaker_profiles.json
│ ├── 07_tts_ready_segments.json
│ ├── video_glossary.json # editable by user
│ ├── duration_translated.json
│ └── tts_output/ # individual dubbed segments
└── output/
├── final_dubbed.wav ← main output
└── dubbed.srt ← translated subtitles
Frontend (Next.js 16)
Pages
| Page | Purpose |
|---|---|
/ |
Landing page — pipeline overview, features, CTA |
/projects |
List all projects |
/projects/new |
Create project — pick video, SRT, host sample via native macOS file picker; set target language and optional guest count |
/projects/[id] |
Progress tracker — live status stepper, audio player on completion, download buttons |
/projects/[id]/review |
Glossary review — edit/delete extracted terms, save, then resume Part 2 |
Local API Routes
| Route | Method | Purpose |
|---|---|---|
/api/local/browse |
POST | Native file picker via osascript (no file upload, no OOM) |
/api/local/projects |
GET | List projects from filesystem |
/api/local/projects |
POST | Create project, copy files, spawn Part 1 detached |
/api/local/projects/[id]/status |
GET | Read status.json |
/api/local/projects/[id]/glossary |
GET/PUT | Read/write video_glossary.json |
/api/local/projects/[id]/resume |
POST | Spawn Part 2 detached |
/api/local/projects/[id]/audio |
GET | Stream final dubbed audio |
/api/local/projects/[id]/download/[filename] |
GET | Download output files |
External Services
| Service | Usage |
|---|---|
AWS Bedrock — amazon.nova-lite-v1:0 |
Glossary extraction, translation, phoneme fitting, segment analysis |
AWS Bedrock — amazon.nova-2-multimodal-embeddings-v1:0 |
Multimodal segment embeddings |
ElevenLabs — eleven_v3 |
TTS with emotion tags and speaker-specific voices |
| Supabase | Cloud mode only: storage, DB, realtime |
Key Technologies
Backend:
librosa,soundfile,numpy— audio analysistorch,speechbrain,modelscope— speaker embedding models (requiresai_dubbingconda env)audio-separator— BS RoFormer vocal separationboto3— AWS Bedrock APIelevenlabs— TTS APIffmpeg— video/audio processing (CLI)
Frontend:
- Next.js 16, React 19, TypeScript
- Tailwind CSS 4, shadcn/ui
- Native macOS file picker (
osascript) — avoids large file uploads
Environment Variables
FE/.env.local:
NEXT_PUBLIC_LOCAL_MODE=true
BE_ROOT_DIR=/path/to/BE
BE_WORKSPACE_DIR=/path/to/BE/workspace
PYTHON_BIN=/opt/anaconda3/envs/ai_dubbing/bin/python # must be ai_dubbing env
NEXT_PUBLIC_SUPABASE_URL=... # cloud mode only
NEXT_PUBLIC_SUPABASE_ANON_KEY=...
BE/.env:
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
ELEVENLABS_API_KEY=...
ELEVENLABS_HOST_VOICE_ID=... # ElevenLabs voice ID for host
SUPABASE_URL=... # optional
SUPABASE_KEY=... # optional
PYTHON_BINmust point to theai_dubbingconda environment — base Python lacksspeechbrainandmodelscoperequired for speaker grouping.
Data Flow Summary
[User] picks video + SRT + host sample
↓
[FE /projects/new] creates workspace, copies files, spawns Part 1
↓
[BE Part 1] audio extraction → vocal separation → speaker grouping
→ multimodal analysis → glossary extraction
→ writes status.json: WAITING_FOR_GLOSSARY_REVIEW
↓
[FE /projects/[id]/review] user edits glossary → saves → resumes
↓
[BE Part 2] translation + phoneme fitting → TTS synthesis
→ audio stretching → final mix
→ writes status.json: COMPLETED
↓
[FE /projects/[id]] shows audio player + download buttons
↓
[User] downloads final_dubbed.wav + dubbed.srt
Built With
- elevenlabs
- nova2lite
- python
Log in or sign up for Devpost to join the conversation.