Auto Dubbing System

An end-to-end AI-powered video dubbing platform. Given a video, subtitle file, and a host voice sample, it produces a fully dubbed audio track in a target language — with proper speaker separation, voice cloning, emotion-aware synthesis, and precise audio timing.


What It Does

  1. Extracts and separates audio — splits vocals from background music/noise
  2. Identifies speakers — detects who is the host vs. guests using voice embeddings
  3. Analyzes content — extracts emotion, tone, pacing, and terminology using multimodal AI
  4. Extracts a glossary — identifies proper nouns (names, places, brands) for human review
  5. Translates — produces phoneme-fitted translations that match original speech timing
  6. Synthesizes speech — generates dubbed audio via ElevenLabs with emotion tags and speaker routing
  7. Mixes final output — combines dubbed vocals with original background audio

Architecture

Auto_Dubbing_System/
├── FE/                  # Next.js 16 frontend (project management UI)
└── BE/                  # Python pipeline (AI/audio processing)

Two-Phase Pipeline

The pipeline is split deliberately at the glossary review step:

  • Part 1 — runs automatically: preprocessing, speaker grouping, multimodal analysis, glossary extraction
  • Human review — user edits/approves the extracted glossary via the web UI
  • Part 2 — resumes after approval: translation, TTS synthesis, audio mixing

Local vs. Cloud Mode

  • NEXT_PUBLIC_LOCAL_MODE=true → files stored locally, status polled from status.json every 3 seconds
  • NEXT_PUBLIC_LOCAL_MODE=false → uses Supabase for storage and real-time updates

Backend Pipeline (Python)

Part 1: Preprocessing & Analysis (run_pipeline_part1.py)

Input: input/video.mp4, input/source.srt, input/host_sample.wav

Stage What happens
Audio Extraction FFmpeg extracts mono WAV (mix.wav) from video
Vocal Separation BS RoFormer separates vocals (voice.wav) from background (rest.wav)
Keyframe Extraction Extracts representative frames per subtitle segment for multimodal context
SRT Parsing Parses subtitles, splits multi-speaker entries (dash-separated)
Speaker Grouping ERes2NetV2 (ModelScope) compares audio segments to host sample; assigns HOST / GUEST_N roles
Audio Feature Extraction librosa extracts duration, pitch, MFCC, RMS, tempo per segment
Nova Embeddings AWS Bedrock amazon.nova-2-multimodal-embeddings-v1:0 creates multimodal vectors (transcript + audio features + keyframes)
Nova Lite Analysis amazon.nova-lite-v1:0 classifies emotion, tone, speech style, pacing; suggests ElevenLabs voice settings
Speaker Profiling Aggregates per-speaker metadata; generates 06_speaker_profiles.json
Glossary Extraction Nova Lite extracts proper nouns (names, places, food, brands, slang) → temp/video_glossary.json

Pauses here → user reviews glossary in the web UI.

Status flow: CREATED → PREPROCESSING → MULTIMODAL → LENGTH_ADJUST → WAITING_FOR_GLOSSARY_REVIEW


Part 2: Translation & Synthesis (run_pipeline_part2.py)

Triggered by: user clicking "Resume" after glossary approval.

Stage What happens
Translation (Phase 1) Nova Lite translates segments in batches of 50, using glossary + surrounding context
Phoneme Fitting (Phase 2) Counts phonemes in translations; shortens segments where speech would be too long for the time slot
TTS Generation ElevenLabs eleven_v3 synthesizes audio per segment; HOST/GUEST roles mapped to distinct voices; emotion tags ([excited], [angry], etc.) prepended
Audio Stretching librosa time-stretches each TTS segment to fit original duration without pitch change
Final Mix FFmpeg mixes dubbed_vocal.wav + rest.wavoutput/final_dubbed.wav
SRT Export Saves translated subtitles to output/dubbed.srt

Status flow: RESUMED → LENGTH_ADJUST → MIXING → COMPLETED


Workspace Structure

Each project gets an isolated directory:

BE/workspace/{projectId}/
├── project.json               # name, target_language, created_at
├── status.json                # current pipeline status (polled by FE)
├── pipeline.log               # stdout/stderr from both parts
├── input/
│   ├── video.mp4
│   ├── source.srt
│   └── host_sample.wav
├── temp/
│   ├── mix.wav, voice.wav, rest.wav
│   ├── keyframes/
│   ├── segments.json
│   ├── 02_host_audio_features.json
│   ├── 04_embeddings.json
│   ├── 05_segment_structured.json
│   ├── 06_speaker_profiles.json
│   ├── 07_tts_ready_segments.json
│   ├── video_glossary.json    # editable by user
│   ├── duration_translated.json
│   └── tts_output/            # individual dubbed segments
└── output/
    ├── final_dubbed.wav        ← main output
    └── dubbed.srt              ← translated subtitles

Frontend (Next.js 16)

Pages

Page Purpose
/ Landing page — pipeline overview, features, CTA
/projects List all projects
/projects/new Create project — pick video, SRT, host sample via native macOS file picker; set target language and optional guest count
/projects/[id] Progress tracker — live status stepper, audio player on completion, download buttons
/projects/[id]/review Glossary review — edit/delete extracted terms, save, then resume Part 2

Local API Routes

Route Method Purpose
/api/local/browse POST Native file picker via osascript (no file upload, no OOM)
/api/local/projects GET List projects from filesystem
/api/local/projects POST Create project, copy files, spawn Part 1 detached
/api/local/projects/[id]/status GET Read status.json
/api/local/projects/[id]/glossary GET/PUT Read/write video_glossary.json
/api/local/projects/[id]/resume POST Spawn Part 2 detached
/api/local/projects/[id]/audio GET Stream final dubbed audio
/api/local/projects/[id]/download/[filename] GET Download output files

External Services

Service Usage
AWS Bedrockamazon.nova-lite-v1:0 Glossary extraction, translation, phoneme fitting, segment analysis
AWS Bedrockamazon.nova-2-multimodal-embeddings-v1:0 Multimodal segment embeddings
ElevenLabseleven_v3 TTS with emotion tags and speaker-specific voices
Supabase Cloud mode only: storage, DB, realtime

Key Technologies

Backend:

  • librosa, soundfile, numpy — audio analysis
  • torch, speechbrain, modelscope — speaker embedding models (requires ai_dubbing conda env)
  • audio-separator — BS RoFormer vocal separation
  • boto3 — AWS Bedrock API
  • elevenlabs — TTS API
  • ffmpeg — video/audio processing (CLI)

Frontend:

  • Next.js 16, React 19, TypeScript
  • Tailwind CSS 4, shadcn/ui
  • Native macOS file picker (osascript) — avoids large file uploads

Environment Variables

FE/.env.local:

NEXT_PUBLIC_LOCAL_MODE=true
BE_ROOT_DIR=/path/to/BE
BE_WORKSPACE_DIR=/path/to/BE/workspace
PYTHON_BIN=/opt/anaconda3/envs/ai_dubbing/bin/python   # must be ai_dubbing env
NEXT_PUBLIC_SUPABASE_URL=...    # cloud mode only
NEXT_PUBLIC_SUPABASE_ANON_KEY=...

BE/.env:

AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
ELEVENLABS_API_KEY=...
ELEVENLABS_HOST_VOICE_ID=...    # ElevenLabs voice ID for host
SUPABASE_URL=...                # optional
SUPABASE_KEY=...                # optional

PYTHON_BIN must point to the ai_dubbing conda environment — base Python lacks speechbrain and modelscope required for speaker grouping.


Data Flow Summary

[User] picks video + SRT + host sample
        ↓
[FE /projects/new] creates workspace, copies files, spawns Part 1
        ↓
[BE Part 1] audio extraction → vocal separation → speaker grouping
            → multimodal analysis → glossary extraction
            → writes status.json: WAITING_FOR_GLOSSARY_REVIEW
        ↓
[FE /projects/[id]/review] user edits glossary → saves → resumes
        ↓
[BE Part 2] translation + phoneme fitting → TTS synthesis
            → audio stretching → final mix
            → writes status.json: COMPLETED
        ↓
[FE /projects/[id]] shows audio player + download buttons
        ↓
[User] downloads final_dubbed.wav + dubbed.srt

Built With

Share this project:

Updates