MioSub — The AI Subtitle Editor That Actually Understands Context

Inspiration

Every AI subtitle tool today translates line by line — a character name in line 1 becomes a different translation in line 50. These aren't translation errors; they're context errors. We asked: what if Gemini understood the whole video first, then translated like a human would?

What it does

MioSub is an end-to-end AI subtitle engine. Paste a video link → get production-ready bilingual subtitles. 30-min video in under 10 minutes, fully automated.

The pipeline: Gemini 3 Pro pre-scans audio to extract a glossary and speaker profiles, then each chunk flows through Whisper transcription → Gemini refinement (with raw audio) → CTC millisecond alignment → Gemini translation (with Search Grounding) → Gemini proofreading. The glossary locks in consistent terminology across the entire video.

How we built it

Dual-platform app (Web + Electron Desktop) with React 19, TypeScript, and Vite 6. The pipeline leverages five Gemini 3 capabilities working together:

  1. Audio Understanding — Gemini 3 Pro processes raw audio directly to extract glossary terms and speaker profiles. By hearing pronunciation and tone, it catches proper nouns that text-based approaches miss.
  2. Thinking / Reasoning — Glossary and speaker extraction use thinkingLevel: 'high', letting Gemini reason through ambiguous audio and domain jargon before committing output.
  3. Structured Output — Every Gemini call enforces strict JSON schemas, eliminating parsing failures across ~200 API calls per video.
  4. Search Grounding — During translation, Gemini queries Google Search to verify proper noun translations not in the glossary, preventing hallucinations.
  5. Multi-model orchestration — Gemini 3 Pro handles high-stakes tasks (glossary, speakers, proofreading); Gemini 3 Flash handles high-throughput tasks (refinement, translation) at 5× concurrency.

The core innovation is a glossary-first architecture: extract terminology globally, then enforce it locally — so every chunk translates with full context.

Challenges we ran into

The hardest problem was glossary bootstrapping — the glossary must come from the same audio being translated. We solved it with a two-phase async architecture: global extraction runs in parallel with transcription, and a dependency gate ensures no chunk translates without the glossary. Achieving timestamp precision required a three-system hybrid: Whisper for initial timing, Gemini for text refinement, and a CTC forced aligner for millisecond re-sync.

Accomplishments that we're proud of

Real users produce real content today — 30-min Japanese radio shows with speaker labels, railway vlogs with technical terminology, anime commentary with dozens of proper nouns — all in one pass, zero manual correction. Open source, cross-platform, with a live web demo.

What we learned

Context beats speed. The glossary-first design was our most impactful decision. Gemini's audio understanding is massively underutilized — feeding raw audio instead of text produces dramatically better extraction. And structured output + search grounding together make Gemini a reliable production component, not just a demo toy.

What's next for MioSub

Gemini native video understanding for scene-aware translation, real-time streaming subtitles, community glossary sharing, and AI-powered quality scoring for human-in-the-loop workflows.

Share this project:

Updates