MultiTrans

Title

Inspiration

In an increasingly globalized world, language barriers remain one of the biggest obstacles to effective communication. I witnessed firsthand how international conference attendees struggled to follow presentations in foreign languages, missing critical information and unable to participate fully. Traditional solutions like human interpreters are expensive and not scalable, while existing translation tools don't work in real-time for live conversations.

This inspired me to create MultiTrans — a platform that leverages Google's Gemini API to break down language barriers in real-time, making multilingual communication accessible to everyone.

What it does

MultiTrans is a real-time multilingual transcription and translation platform powered by Gemini API. It captures audio from multiple sources (microphone, system audio, YouTube, online meetings), transcribes speech to text instantly, and translates it into multiple languages simultaneously.

Core Features:

Real-time Speech Transcription: Converts live audio into text with timestamps using Gemini's speech recognition
AI-Powered Text Refinement: Gemini intelligently corrects errors, adds punctuation, and organizes content
Multi-language Translation: Simultaneously translates to 10+ languages including English, Spanish, Japanese, Korean, French, German, and more
Live Sharing: Share transcriptions via QR code links for real-time access on any device
AI Assistant: Ask questions about meeting content and get instant answers
Smart Notes: Add annotations directly to transcriptions
Dual Storage: Temporary workspace + permanent folder system for organized content management

How we built it

Architecture:

Frontend: Vanilla JavaScript, HTML5, CSS3 for a lightweight, responsive interface
Backend: Firebase Cloud Functions (Node.js) for serverless API handling
AI Engine: Google Gemini API for all AI-powered features
Storage: IndexedDB for local caching + File System API for permanent storage
Hosting: Firebase Hosting for global CDN delivery

Gemini API Integration:

Speech Transcription: Implemented intelligent audio chunking to adapt Gemini's API for real-time streaming scenarios
Text Refinement: Created a batch processing system that uses Gemini to polish raw transcriptions
Translation: Leveraged Gemini's multilingual capabilities for context-aware translations
AI Assistant: Integrated Gemini's conversational AI for meeting content Q&A
Keyword Extraction: Used Gemini to automatically generate summaries and action items

Key Technical Innovations:

Smart audio segmentation algorithm for optimal transcription quality
Real-time WebSocket-like architecture using Firebase for live updates
Efficient state management for handling multiple translation panels simultaneously
QR code generation for instant mobile access

Challenges we ran into

Real-time Audio Processing: Gemini's API wasn't originally designed for streaming audio. I had to develop an intelligent chunking algorithm that segments audio at natural speech pauses to maintain context while enabling real-time processing.
Multi-language Synchronization: Keeping multiple translation panels synchronized while maintaining performance was challenging. I implemented an efficient state management system that batches API calls and updates UI asynchronously.
Cross-browser Compatibility: Different browsers handle audio capture differently. I created a unified audio capture module that works across Chrome, Edge, and Firefox.
AI Refinement Cost Optimization: The AI text refinement feature produces significantly better results when provided with more contextual information from previous transcriptions. However, this increases API costs substantially. I had to balance quality with cost-effectiveness, implementing this enhanced context feature only in testing environments while keeping the production version optimized for cost efficiency.

Accomplishments that we're proud of

Seamless Real-time Experience: Successfully adapted Gemini API for true real-time transcription with minimal latency
Multi-language Support: Achieved simultaneous translation to 10+ languages with context-aware accuracy
User-Friendly Design: Created an intuitive interface that requires no technical knowledge — anyone can start transcribing in seconds
Sharing Innovation: The QR code sharing feature makes it incredibly easy to share live transcriptions with remote participants
Gemini Integration Depth: Integrated multiple Gemini API capabilities (speech recognition, text refinement, translation, chat, analysis) into a cohesive workflow

What we learned

Creative API Application: Learned how to creatively adapt APIs for use cases beyond their original design. We successfully transformed Gemini's batch processing into real-time streaming and integrated multiple Gemini capabilities—speech recognition, text refinement, multilingual translation, conversational AI, and content analysis—into a single system, creating a seamless end-to-end workflow.
Balancing Cost and Quality: Gained deep insights into the tradeoff between quality and cost in AI applications. While providing more contextual information significantly improves output quality, it also substantially increases API costs. We learned to find the optimal balance between user experience and operational costs, making informed decisions based on different use cases.
Real-time Collaboration Complexity: Understood the complexities of building scalable real-time collaborative features. Through implementing the sharing functionality, we identified the core needs of meeting participants: not just real-time transcription viewing, but also synchronized multilingual translation, historical record preservation, and seamless cross-device access.
Social Value of Accessibility: Realized that AI-driven real-time transcription is more than just a convenience tool—it's a vital bridge for breaking down communication barriers. It enables people with hearing impairments to participate in meetings, helps language learners understand content, and allows international teams to overcome language gaps, truly achieving inclusive communication.

What's next for MultiTrans

Short-term Goals:

Custom Vocabulary: Allow users to add domain-specific terms for improved accuracy
Speaker Diarization: Identify and label different speakers in conversations
Export Formats: Add support for SRT, VTT, and DOCX export formats
RAG Integration: Implement Retrieval-Augmented Generation to enable the built-in AI to produce comprehensive meeting notes similar to NotebookLM
Enhanced Real-time Performance: Further reduce transcription latency to deliver an even more seamless real-time experience

Long-term Vision:

Mobile Apps: Native iOS and Android apps for on-the-go transcription
Voice Cloning: Use Gemini to generate translated audio in the original speaker's voice

Community Goals:

Open-source core components to help other developers build multilingual applications
Create educational content about real-time AI integration
Build a community of users who can share translation templates and best practices

Built With

code
css3
file-system-api
firebase-cloud-functions
firebase-hosting
google-gemini-api
html5
indexeddb
javascript
mediarecorder-api
node.js
qr
web-audio-api

Updates

EcoveMu ChuanMu started this project — Feb 09, 2026 04:15 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.