VisionTranslate: A Gemini 3-powered Video Localization Tool

Vision Translate
Web App Interface

Inspiration

Global content creators often face a "language barrier". While audio dubbing is common, on-screen text (labels, captions, and graphics) remains trapped in the original language, making videos feel "foreign" to new audiences. After struggling to manually localize my own video edits from Chinese/Korean to English, I realized there was no seamless way to "translate the pixels". I built this to democratize global content distribution for every creator.

What it does

VisionTranslate is a Gemini-powered web app that automates the localization of in-video text. Users upload a video and select a target language, the app then:

Detects and OCRs hardcoded on-screen text;
Translates contextually using Google's LLMs;
Synthesizes new text overlays that match the original's position;
Allows users to adjust the text overlays on screen;
Renders a fully localized video for export.

How I built it

The core engine is powered by Google AI Studio, leveraging Gemini 3 Pro's massive context window to analyze video frames and maintain translation consistency across scenes. The frontend is built with a high-performance stack using Canvas and WebGL for real-time video processing and frame-accurate text synchronization.

Challenges I ran into

Merge captions across multiple frames if text and position didn't change
Export video with overlay text

Accomplishments that I'm proud of

I successfully used VisionTranslate to process my own library of videos. What previously took hours of manual masking and re-typing now happens in a fraction of the time with easy text editing and adjustment.

What I learned

I discovered the limitations of standard DOM elements for video manipulation. Transitioning to a Canvas + WebGL solution allowed me to handle frame buffers and coordinate mapping, which is essential for high-fidelity video tools.

What's next for VisionTranslate: On-Screen Text Replacement for Video

In-Painting: Using generative AI to "erase" the original text pixels before overlaying the new ones for a truly native look.
Style Transfer: Automatically detecting the font, color, and size of the original text to match the translation perfectly.
More refined and flexible video controls.

Built With

Updates

Di Zhu started this project — Jan 31, 2026 06:19 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.