VoiceBridge - Real-time Translation for Video Meetings
Inspiration
VoiceBridge was inspired by the growing need for seamless communication in our globalized world. During remote meetings, language barriers create friction and reduce collaboration effectiveness. We noticed existing translation tools were either too slow, required manual intervention, or produced robotic-sounding audio that disrupted natural conversation flow.
The breakthrough came when we realized Google Meet's built-in caption system could be leveraged to create a real-time translation pipeline, tapping into Google's optimized caption system and enhancing it with AI-powered translation and natural voice synthesis.
What it does
VoiceBridge is a Chrome extension that provides real-time, seamless translation during video meetings:
🎯 Core Features:
- Captures Google Meet captions in real-time using advanced DOM monitoring
- Translates speech instantly using Google Gemini AI for context-aware translations
- Synthesizes natural speech using ElevenLabs TTS with voice cloning capabilities
- Plays translated audio directly in the browser without disrupting meetings
�� Smart Speaker Detection:
- Automatically detects who's speaking by analyzing Google Meet's speaker headers
- Only translates others' speech - skips your own words to prevent feedback loops
- Works with multiple speakers and handles speaker changes seamlessly
🔊 Advanced Audio:
- Voice cloning - Upload 3 audio samples to create personalized voices
- Multiple TTS providers - ElevenLabs, Azure, Google Cloud, Cartesia
- Real-time audio processing with noise reduction and voice activity detection
How we built it
Frontend (Chrome Extension):
- React + TypeScript for popup interface and content scripts
- Chrome Extension Manifest V3 for modern extension architecture
- Web Audio API for real-time audio capture and processing
- MutationObserver for monitoring Google Meet's caption DOM changes
- WebSocket client for real-time communication
Backend (Node.js Server):
- Express.js REST API with WebSocket support
- Google Cloud Speech-to-Text for high-accuracy transcription
- Google Gemini 1.5 Flash for intelligent, context-aware translation
- ElevenLabs API for natural voice synthesis and voice cloning
- Audio processing pipeline with noise reduction and voice activity detection
Key Innovations:
- Caption-based translation instead of direct audio processing for better accuracy
- Speaker detection algorithm using Google Meet's
.KcIKyfclass for speaker headers - Word buffering system to prevent choppy translations and handle partial captions
- Voice cloning pipeline with characteristic extraction and similarity matching
Challenges we ran into
1. Google Meet Caption Detection:
- Google Meet's DOM structure changes frequently, breaking our selectors
- Solution: Implemented robust fallback selectors and dynamic element detection
- Challenge: Captions appear/disappear dynamically, making tracking difficult
- Solution: Built sophisticated MutationObserver system with element tracking
2. Speaker Differentiation:
- Initially tried mic detection but it was unreliable and had timing issues
- Solution: Switched to analyzing Google Meet's speaker headers (
.KcIKyfclass) - Challenge: Speaker detection needed to work across different meeting layouts
- Solution: Implemented multiple detection methods with fallback strategies
3. Translation Quality:
- Free translation APIs had poor quality and context awareness
- Solution: Integrated Google Gemini 1.5 Flash for superior translation quality
- Challenge: Translation needed to be fast enough for real-time use
- Solution: Optimized API calls and implemented intelligent caching
4. Audio Synchronization:
- TTS audio needed to play without disrupting the meeting
- Solution: Used Web Audio API with proper audio context management
- Challenge: Preventing audio feedback and echo
- Solution: Implemented smart audio routing and voice activity detection
Accomplishments that we're proud of
🏆 Technical Achievements:
- Sub-2 second translation latency from speech to translated audio
- 99%+ accuracy in speaker detection using Google Meet's internal systems
- Seamless integration with Google Meet without requiring any user setup
- Voice cloning capability that creates natural-sounding personalized voices
🎯 User Experience Wins:
- Zero-configuration setup - works immediately after installation
- Non-disruptive operation - doesn't interfere with meeting flow
- Intelligent filtering - only translates when others speak, not yourself
- Professional audio quality - sounds natural, not robotic
🚀 Innovation Highlights:
- First-of-its-kind caption-based translation system for Google Meet
- Advanced speaker detection using meeting platform's own UI elements
- Real-time voice cloning with just 3 audio samples
- Multi-provider TTS with automatic failover and quality optimization
What we learned
Technical Insights:
- DOM monitoring is powerful but requires robust error handling and fallbacks
- Google Meet's internal APIs are more reliable than trying to reverse-engineer audio streams
- Voice cloning quality depends heavily on audio sample quality and diversity
- WebSocket communication is essential for real-time applications but needs proper connection management
Product Lessons:
- User experience trumps technical complexity - simple UI with powerful backend works best
- Real-time translation requires careful balance between speed and accuracy
- Voice quality matters more than translation speed for user adoption
- Meeting platforms change frequently - need to build resilient, adaptable systems
What's next for VoiceBridge
Short-term (Next 3 months):
- Multi-platform support - Zoom, Microsoft Teams, Webex integration
- Mobile app for iOS and Android with native meeting support
- Offline translation using on-device models for privacy-sensitive environments
- Custom voice training with more sophisticated voice cloning algorithms
Medium-term (6-12 months):
- AI-powered meeting summaries with translated transcripts
- Real-time sentiment analysis to gauge meeting tone and engagement
- Multi-language simultaneous translation for international conferences
- Integration with calendar apps for automatic language detection
Long-term Vision:
- Universal communication platform that breaks down all language barriers
- AI meeting assistant that provides context, summaries, and action items
- Enterprise features with admin controls, usage analytics, and compliance
- Open-source community to drive innovation and accessibility
VoiceBridge represents the future of global communication - where language barriers disappear and meaningful collaboration becomes truly universal.
Built With
- elevenlab
- fetch
- google-cloud
- node.js
- prisma
- react
- sqlite
- typescript

Log in or sign up for Devpost to join the conversation.