Auto Dubbing System

An end-to-end AI-powered video dubbing platform. Given a video, subtitle file, and a host voice sample, it produces a fully dubbed audio track in a target language — with proper speaker separation, voice cloning, emotion-aware synthesis, and precise audio timing.

What It Does

Extracts and separates audio — splits vocals from background music/noise
Identifies speakers — detects who is the host vs. guests using voice embeddings
Analyzes content — extracts emotion, tone, pacing, and terminology using multimodal AI
Extracts a glossary — identifies proper nouns (names, places, brands) for human review
Translates — produces phoneme-fitted translations that match original speech timing
Synthesizes speech — generates dubbed audio via ElevenLabs with emotion tags and speaker routing
Mixes final output — combines dubbed vocals with original background audio

Architecture

Auto_Dubbing_System/
├── FE/                  # Next.js 16 frontend (project management UI)
└── BE/                  # Python pipeline (AI/audio processing)

Two-Phase Pipeline

The pipeline is split deliberately at the glossary review step:

Part 1 — runs automatically: preprocessing, speaker grouping, multimodal analysis, glossary extraction
Human review — user edits/approves the extracted glossary via the web UI
Part 2 — resumes after approval: translation, TTS synthesis, audio mixing

Local vs. Cloud Mode

NEXT_PUBLIC_LOCAL_MODE=true → files stored locally, status polled from status.json every 3 seconds
NEXT_PUBLIC_LOCAL_MODE=false → uses Supabase for storage and real-time updates

Backend Pipeline (Python)

Part 1: Preprocessing & Analysis (`run_pipeline_part1.py`)

Input: input/video.mp4, input/source.srt, input/host_sample.wav

Stage	What happens
Audio Extraction	FFmpeg extracts mono WAV (`mix.wav`) from video
Vocal Separation	BS RoFormer separates vocals (`voice.wav`) from background (`rest.wav`)
Keyframe Extraction	Extracts representative frames per subtitle segment for multimodal context
SRT Parsing	Parses subtitles, splits multi-speaker entries (dash-separated)
Speaker Grouping	ERes2NetV2 (ModelScope) compares audio segments to host sample; assigns HOST / GUEST_N roles
Audio Feature Extraction	librosa extracts duration, pitch, MFCC, RMS, tempo per segment
Nova Embeddings	AWS Bedrock `amazon.nova-2-multimodal-embeddings-v1:0` creates multimodal vectors (transcript + audio features + keyframes)
Nova Lite Analysis	`amazon.nova-lite-v1:0` classifies emotion, tone, speech style, pacing; suggests ElevenLabs voice settings
Speaker Profiling	Aggregates per-speaker metadata; generates `06_speaker_profiles.json`
Glossary Extraction	Nova Lite extracts proper nouns (names, places, food, brands, slang) → `temp/video_glossary.json`

Pauses here → user reviews glossary in the web UI.

Status flow: CREATED → PREPROCESSING → MULTIMODAL → LENGTH_ADJUST → WAITING_FOR_GLOSSARY_REVIEW

Part 2: Translation & Synthesis (`run_pipeline_part2.py`)

Triggered by: user clicking "Resume" after glossary approval.

Stage	What happens
Translation (Phase 1)	Nova Lite translates segments in batches of 50, using glossary + surrounding context
Phoneme Fitting (Phase 2)	Counts phonemes in translations; shortens segments where speech would be too long for the time slot
TTS Generation	ElevenLabs `eleven_v3` synthesizes audio per segment; HOST/GUEST roles mapped to distinct voices; emotion tags (`[excited]`, `[angry]`, etc.) prepended
Audio Stretching	librosa time-stretches each TTS segment to fit original duration without pitch change
Final Mix	FFmpeg mixes `dubbed_vocal.wav` + `rest.wav` → `output/final_dubbed.wav`
SRT Export	Saves translated subtitles to `output/dubbed.srt`

Status flow: RESUMED → LENGTH_ADJUST → MIXING → COMPLETED

Workspace Structure

Each project gets an isolated directory:

BE/workspace/{projectId}/
├── project.json               # name, target_language, created_at
├── status.json                # current pipeline status (polled by FE)
├── pipeline.log               # stdout/stderr from both parts
├── input/
│   ├── video.mp4
│   ├── source.srt
│   └── host_sample.wav
├── temp/
│   ├── mix.wav, voice.wav, rest.wav
│   ├── keyframes/
│   ├── segments.json
│   ├── 02_host_audio_features.json
│   ├── 04_embeddings.json
│   ├── 05_segment_structured.json
│   ├── 06_speaker_profiles.json
│   ├── 07_tts_ready_segments.json
│   ├── video_glossary.json    # editable by user
│   ├── duration_translated.json
│   └── tts_output/            # individual dubbed segments
└── output/
    ├── final_dubbed.wav        ← main output
    └── dubbed.srt              ← translated subtitles

Frontend (Next.js 16)

Pages

Page	Purpose
`/`	Landing page — pipeline overview, features, CTA
`/projects`	List all projects
`/projects/new`	Create project — pick video, SRT, host sample via native macOS file picker; set target language and optional guest count
`/projects/[id]`	Progress tracker — live status stepper, audio player on completion, download buttons
`/projects/[id]/review`	Glossary review — edit/delete extracted terms, save, then resume Part 2

Local API Routes

Route	Method	Purpose
`/api/local/browse`	POST	Native file picker via `osascript` (no file upload, no OOM)
`/api/local/projects`	GET	List projects from filesystem
`/api/local/projects`	POST	Create project, copy files, spawn Part 1 detached
`/api/local/projects/[id]/status`	GET	Read `status.json`
`/api/local/projects/[id]/glossary`	GET/PUT	Read/write `video_glossary.json`
`/api/local/projects/[id]/resume`	POST	Spawn Part 2 detached
`/api/local/projects/[id]/audio`	GET	Stream final dubbed audio
`/api/local/projects/[id]/download/[filename]`	GET	Download output files

External Services

Service	Usage
AWS Bedrock — `amazon.nova-lite-v1:0`	Glossary extraction, translation, phoneme fitting, segment analysis
AWS Bedrock — `amazon.nova-2-multimodal-embeddings-v1:0`	Multimodal segment embeddings
ElevenLabs — `eleven_v3`	TTS with emotion tags and speaker-specific voices
Supabase	Cloud mode only: storage, DB, realtime

Key Technologies

Backend:

librosa, soundfile, numpy — audio analysis
torch, speechbrain, modelscope — speaker embedding models (requires ai_dubbing conda env)
audio-separator — BS RoFormer vocal separation
boto3 — AWS Bedrock API
elevenlabs — TTS API
ffmpeg — video/audio processing (CLI)

Frontend:

Next.js 16, React 19, TypeScript
Tailwind CSS 4, shadcn/ui
Native macOS file picker (osascript) — avoids large file uploads

Environment Variables

FE/.env.local:

NEXT_PUBLIC_LOCAL_MODE=true
BE_ROOT_DIR=/path/to/BE
BE_WORKSPACE_DIR=/path/to/BE/workspace
PYTHON_BIN=/opt/anaconda3/envs/ai_dubbing/bin/python   # must be ai_dubbing env
NEXT_PUBLIC_SUPABASE_URL=...    # cloud mode only
NEXT_PUBLIC_SUPABASE_ANON_KEY=...

BE/.env:

AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
ELEVENLABS_API_KEY=...
ELEVENLABS_HOST_VOICE_ID=...    # ElevenLabs voice ID for host
SUPABASE_URL=...                # optional
SUPABASE_KEY=...                # optional

PYTHON_BIN must point to the ai_dubbing conda environment — base Python lacks speechbrain and modelscope required for speaker grouping.

Data Flow Summary

[User] picks video + SRT + host sample
        ↓
[FE /projects/new] creates workspace, copies files, spawns Part 1
        ↓
[BE Part 1] audio extraction → vocal separation → speaker grouping
            → multimodal analysis → glossary extraction
            → writes status.json: WAITING_FOR_GLOSSARY_REVIEW
        ↓
[FE /projects/[id]/review] user edits glossary → saves → resumes
        ↓
[BE Part 2] translation + phoneme fitting → TTS synthesis
            → audio stretching → final mix
            → writes status.json: COMPLETED
        ↓
[FE /projects/[id]] shows audio player + download buttons
        ↓
[User] downloads final_dubbed.wav + dubbed.srt

Built With

elevenlabs
nova2lite
python

Updates

Hao Ko started this project — Mar 16, 2026 01:24 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.