Inspiration
In an increasingly globalized world, language barriers remain one of the biggest obstacles to effective communication. I witnessed firsthand how international conference attendees struggled to follow presentations in foreign languages, missing critical information and unable to participate fully. Traditional solutions like human interpreters are expensive and not scalable, while existing translation tools don't work in real-time for live conversations.
This inspired me to create MultiTrans — a platform that leverages Google's Gemini API to break down language barriers in real-time, making multilingual communication accessible to everyone.
What it does
MultiTrans is a real-time multilingual transcription and translation platform powered by Gemini API. It captures audio from multiple sources (microphone, system audio, YouTube, online meetings), transcribes speech to text instantly, and translates it into multiple languages simultaneously.
Core Features:
- Real-time Speech Transcription: Converts live audio into text with timestamps using Gemini's speech recognition
- AI-Powered Text Refinement: Gemini intelligently corrects errors, adds punctuation, and organizes content
- Multi-language Translation: Simultaneously translates to 10+ languages including English, Spanish, Japanese, Korean, French, German, and more
- Live Sharing: Share transcriptions via QR code links for real-time access on any device
- AI Assistant: Ask questions about meeting content and get instant answers
- Smart Notes: Add annotations directly to transcriptions
- Dual Storage: Temporary workspace + permanent folder system for organized content management
How we built it
Architecture:
- Frontend: Vanilla JavaScript, HTML5, CSS3 for a lightweight, responsive interface
- Backend: Firebase Cloud Functions (Node.js) for serverless API handling
- AI Engine: Google Gemini API for all AI-powered features
- Storage: IndexedDB for local caching + File System API for permanent storage
- Hosting: Firebase Hosting for global CDN delivery
Gemini API Integration:
- Speech Transcription: Implemented intelligent audio chunking to adapt Gemini's API for real-time streaming scenarios
- Text Refinement: Created a batch processing system that uses Gemini to polish raw transcriptions
- Translation: Leveraged Gemini's multilingual capabilities for context-aware translations
- AI Assistant: Integrated Gemini's conversational AI for meeting content Q&A
- Keyword Extraction: Used Gemini to automatically generate summaries and action items
Key Technical Innovations:
- Smart audio segmentation algorithm for optimal transcription quality
- Real-time WebSocket-like architecture using Firebase for live updates
- Efficient state management for handling multiple translation panels simultaneously
- QR code generation for instant mobile access
Challenges we ran into
Real-time Audio Processing: Gemini's API wasn't originally designed for streaming audio. I had to develop an intelligent chunking algorithm that segments audio at natural speech pauses to maintain context while enabling real-time processing.
Multi-language Synchronization: Keeping multiple translation panels synchronized while maintaining performance was challenging. I implemented an efficient state management system that batches API calls and updates UI asynchronously.
Cross-browser Compatibility: Different browsers handle audio capture differently. I created a unified audio capture module that works across Chrome, Edge, and Firefox.
AI Refinement Cost Optimization: The AI text refinement feature produces significantly better results when provided with more contextual information from previous transcriptions. However, this increases API costs substantially. I had to balance quality with cost-effectiveness, implementing this enhanced context feature only in testing environments while keeping the production version optimized for cost efficiency.
Accomplishments that we're proud of
Seamless Real-time Experience: Successfully adapted Gemini API for true real-time transcription with minimal latency
Multi-language Support: Achieved simultaneous translation to 10+ languages with context-aware accuracy
User-Friendly Design: Created an intuitive interface that requires no technical knowledge — anyone can start transcribing in seconds
Sharing Innovation: The QR code sharing feature makes it incredibly easy to share live transcriptions with remote participants
Gemini Integration Depth: Integrated multiple Gemini API capabilities (speech recognition, text refinement, translation, chat, analysis) into a cohesive workflow
What we learned
Creative API Application: Learned how to creatively adapt APIs for use cases beyond their original design. We successfully transformed Gemini's batch processing into real-time streaming and integrated multiple Gemini capabilities—speech recognition, text refinement, multilingual translation, conversational AI, and content analysis—into a single system, creating a seamless end-to-end workflow.
Balancing Cost and Quality: Gained deep insights into the tradeoff between quality and cost in AI applications. While providing more contextual information significantly improves output quality, it also substantially increases API costs. We learned to find the optimal balance between user experience and operational costs, making informed decisions based on different use cases.
Real-time Collaboration Complexity: Understood the complexities of building scalable real-time collaborative features. Through implementing the sharing functionality, we identified the core needs of meeting participants: not just real-time transcription viewing, but also synchronized multilingual translation, historical record preservation, and seamless cross-device access.
Social Value of Accessibility: Realized that AI-driven real-time transcription is more than just a convenience tool—it's a vital bridge for breaking down communication barriers. It enables people with hearing impairments to participate in meetings, helps language learners understand content, and allows international teams to overcome language gaps, truly achieving inclusive communication.
What's next for MultiTrans
Short-term Goals:
- Custom Vocabulary: Allow users to add domain-specific terms for improved accuracy
- Speaker Diarization: Identify and label different speakers in conversations
- Export Formats: Add support for SRT, VTT, and DOCX export formats
- RAG Integration: Implement Retrieval-Augmented Generation to enable the built-in AI to produce comprehensive meeting notes similar to NotebookLM
- Enhanced Real-time Performance: Further reduce transcription latency to deliver an even more seamless real-time experience
Long-term Vision:
- Mobile Apps: Native iOS and Android apps for on-the-go transcription
- Voice Cloning: Use Gemini to generate translated audio in the original speaker's voice
Community Goals:
- Open-source core components to help other developers build multilingual applications
- Create educational content about real-time AI integration
- Build a community of users who can share translation templates and best practices
Built With
- code
- css3
- file-system-api
- firebase-cloud-functions
- firebase-hosting
- google-gemini-api
- html5
- indexeddb
- javascript
- mediarecorder-api
- node.js
- qr
- web-audio-api
Log in or sign up for Devpost to join the conversation.