Inspiration
Language barriers are one of the biggest obstacles to knowledge sharing. While YouTube has subtitles, reading them distracts from the visual content, and automated TTS often sounds robotic and emotionless.
When we saw Gemini's audio capabilities, we realized we could build something better: a tool that doesn't just translate text, but re-voices the video in real-time. We wanted to create a "Universal Translator" for YouTube that feels like magic—no captions, just native-sounding audio in your language.
What it does
AnyTongue is a Chrome Extension that provides real-time, AI-powered dubbing for YouTube videos.
- Instant Dubbing: Click "Play", and the video is instantly dubbed into your target language (e.g., English to Chinese).
- Smart Playback: The extension automatically pauses the video while the AI connection is established and resumes seamlessly, ensuring you don't miss a second.
- Native Audio Reasoning: It doesn't use Speech-to-Text -> Translate -> Text-to-Speech. Instead, it streams raw audio directly to Gemini, which understands the emotion, tone, and context, and generates translated audio directly.
- Privacy First: We support a "Bring Your Own Key" (BYOK) mode, so users can use their own Gemini API keys securely.
How we built it
We built a modern, high-performance architecture:
- Chrome Extension (React + Vite): Handles the user interface, injecting content scripts to control the YouTube player, and capturing tab audio using the
tabCaptureAPI. - Offscreen Document: To keep the WebSocket connection alive and process audio in the background, we utilize Chrome's Offscreen API. This ensures smooth streaming even when the popup is closed.
- Gemini Live API (WebSockets): We stream PCM audio chunks (16kHz) directly to Gemini's WebSocket endpoint. The model processes the audio stream and returns translated audio chunks in real-time.
- Audio Worklet: To ensure low-latency playback, we use a custom Audio Worklet to buffer and play the PCM audio response from Gemini, syncing it as closely as possible with the video.
Challenges we ran into
- Audio Sync & Latency: Keeping the translated audio synced with the video was tough. We solved this by implementing a "Smart Pause" feature that halts the video until the AI stream is ready.
- Chrome Extension Restrictions: Modern Manifest V3 restricts background persistent scripts. We had to architect a robust communication bridge between the Popup, Background Service Worker, Offscreen Document, and Content Script.
- Real-time VAD: Initially, the Voice Activity Detection (VAD) was cutting off sentences. We optimized the Gemini config to use manual triggering for more continuous and natural translation flow.
Accomplishments that we're proud of
- Seamless UX: The "Pause-on-Start, Resume-on-Connect" flow feels incredibly polished.
- Apple-Design UI: We spent time crafting a beautiful, glassmorphism-inspired UI that looks right at home on modern macOS/Windows.
- True Multimodality: Successfully leveraging Gemini's ability to "hear" and "speak" directly, bypassing the traditional text bottleneck.
What's next for AnyTongue
- Context Awareness: Sending video frames (screenshots) to Gemini so it can "see" what's on screen for better translation accuracy.
- Multi-Speaker Support: Distinguishing between different speakers in a video and assigning different AI voices.
- Mobile App: Bringing this experience to mobile devices.
Built With
- chrome-extension-manifest-v3
- gemini
- google-ai-studio
- react
- typescript
- vite
- web-audio-api
- websockets
Log in or sign up for Devpost to join the conversation.