Inspiration In brainstorming ideas for our hackathon project, we identified a significant barrier in multilingual conversations: the limited accessibility of real-time translation tools. Although Meta’s new Ray-Ban smart glasses offer real-time translation, they come with a high price of $300 and require both users to own a pair, making them impractical for everyday use. We decided to create a more accessible solution that allows people to have seamless, real-time conversations across languages using just a phone and headphones, making this technology available to everyone.
What it does SpeechSwap is a real-time language translation app designed to facilitate seamless communication across languages without the need for specialized hardware. With just a phone and headphones, User 1 can tap a button to speak, and after a brief pause, User 2 can hear the contents of what User 1 said translated into their native/target language. Additionally, the translated speech appears on the screen.This enables smooth, uninterrupted conversations between speakers of different languages, making multilingual communication more accessible than ever.
How we built it We built SpeechSwap using Python for the backend, with Flask managing the server-side processes, and a React and Node.js frontend for a smooth, user-friendly interface. The app captures audio input, uses OpenAI’s Whisper to transcribe the speech, translates it with Gemini, and then plays back the translation using Google Text-to-Speech. This stack enabled us to create a cohesive, real-time conversation experience.
Challenges we ran into Data Transfer and Latency: Achieving low-latency data transfer between the frontend and backend was essential for a real-time experience. We optimized data flow and timing within Flask to reduce delays in transcription, translation, and playback. Speech Pause Detection: Ensuring accurate detection of pauses to trigger translation playback was challenging. We fine-tuned the transcription-pause algorithm to maintain natural conversation flow without unintended delays. UI Feedback in Real-Time: Creating a user-friendly interface with real-time feedback indicators (like "listening," "translating," and "speaking") required careful design. Implementing these indicators in a non-intrusive way allowed users to focus on the conversation without distraction.
Accomplishments that we're proud of Real-Time Translation Workflow: We successfully developed a continuous workflow where the app detects speech pauses, triggers translation, and plays back audio smoothly, achieving a natural conversation experience. Effective Integration of Whisper and GPT: Integrating OpenAI Whisper for transcription and GPT for translation in a single pipeline was a significant technical achievement. We managed to seamlessly combine these tools to deliver fast, accurate translations. User-Friendly Design: Creating an intuitive interface that simplifies the user experience despite the backend complexities was a highlight. Users can engage in real-time conversations with minimal effort, making the app accessible to everyone.
What we learned We gained a deeper understanding of multimodal AI technologies, particularly in integrating voice and text for real-time applications. This project expanded our experience beyond standard web applications, giving us skills in real-time audio processing, timing, and user-centered design for multilingual communication.
What's next for SpeechSwap In the short term, we plan to improve playback accuracy and introduce additional languages to expand the app’s reach. Longer-term goals include enhancing contextual translation accuracy and integrating additional voice-based features, such as automatic language detection, to make SpeechSwap even more versatile for real-time multilingual conversations.
Log in or sign up for Devpost to join the conversation.