Inspiration

Manual dubbing and publishing workflows are slow, error-prone, and hard to audit.
We wanted a practical tool that automates multilingual video localization while keeping strict publishing controls.

What it does

VoxShift is a Gemini-first dubbing pipeline that:

  • Transcribes and translates media
  • Generates dubbed audio with TTS
  • Produces output media, subtitles, and segment JSON
  • Runs YouTube intake risk checks when a source URL is provided
  • Uploads to a specific YouTube channel with metadata, dry-run validation, and audit manifests

How we built it

  • Node.js + TypeScript CLI architecture
  • Gemini API for transcription/translation and TTS
  • ffmpeg/ffprobe for media processing and muxing
  • YouTube Data API for intake metadata checks
  • YouTube OAuth upload flow with channel-ID enforcement
  • CI-style checks with typecheck, build, and smoke tests

Challenges we ran into

  • OAuth scope mismatches (youtube.upload vs channel verification needs)
  • Handling structured model output reliably across edge cases
  • Keeping upload automation flexible without weakening safety
  • Managing API auth differences between Gemini and YouTube APIs
  • Designing duplicate protection and idempotent run behavior

Accomplishments that we're proud of

  • End-to-end dubbing pipeline with production-style outputs
  • youtube:run supports both pipeline mode and upload-only mode
  • Optional --source-url with policy-based intake checks
  • Strong safety controls: target channel enforcement, dry-run upload, manifest trail
  • Real speech fixture + automated smoke paths including model-variant checks

What we learned

  • Automation needs guardrails as much as speed
  • Strong schemas and validation save time in LLM-driven pipelines
  • Channel-level publishing checks are essential for real operations
  • Dry-run + manifest logging dramatically improves trust and debugging

What's next for VoxShift

  • Add rights-aware source ingestion workflow (with explicit policy gates)
  • Improve dubbing quality (speaker consistency, pacing, prosody control)
  • Add batch job orchestration and queue-based processing
  • Build a lightweight UI on top of the CLI engine
  • Expand monitoring, retry logic, and publish-state observability

Built With

Share this project:

Updates