Inspiration
Last year, Aika and I travelled 7 countries — Bangkok, Hiroshima, Fukuoka, Bali, Kathmandu, Cambodia, Chiang Mai, Langkawi. The digital nomad thing sounds romantic until you're standing with 5% battery, explaining your location to a grab driver saying "khrab khrab" with your tiny Thai vocabulary.
That moment stuck with me. Not because it was dramatic, but because it was so ordinary. And it felt so fixable. We are in 2026, in the age of AI, and people are still helpless without Google Translate.
But the real spark came from Aika's world. She grew up on a US military base in Japan and now teaches Japanese to US service members. Tens of thousands of American families rotate through bases in Okinawa, Yokosuka, Sasebo, and Misawa every 2-3 years. Most never learn enough Japanese to make a phone call.
Her friend's mom — an American woman living in Japan for three years — paid $20 to a concierge service just to have someone make a 10-minute phone call to book a local izakaya for her husband's birthday. This isn't unusual. Expat concierge services charge $3-$20 per phone call, interpreters run $80-$130/hour, and even the cheapest app-based interpreter costs $1/minute.
These aren't wealthy expats. They're military families, English teachers, students — people just trying to live their lives in a country where they don't speak the language. We kept hearing the same stories, and it felt so fixable.
What it does
Fly Translate is a real-time speech-to-speech translator. No typing. No buttons. No taking turns. Two people open the app, pick their languages, and have a conversation.
I speak Hindi, Aika speaks Japanese, and we just talk. My phone speaks Japanese to her. Her phone speaks Hindi to me. Simultaneously.
It supports 70+ languages, any pair — Hindi to Japanese, Thai to Korean, Arabic to French. We instruct Gemini to be a strict, transparent translation layer. No commentary, no helpfulness, no "here's the translation." When someone says "I'd like a table for two," the other person hears exactly that. Nothing more.
The technology disappears and you're just two people talking on a call.
How we built it
Dual-session architecture: When two users connect, our server creates two independent Gemini Live API sessions — one per translation direction (e.g., Hindi→Japanese and Japanese→Hindi). This is critical: a single bidirectional session gets confused about who is speaking which language.
Streaming audio, not text: Audio streams continuously over WebSockets as raw PCM binary frames. There's no intermediate text step and no base64 encoding overhead. Speech goes in, translated speech comes out. Gemini's built-in Voice Activity Detection handles turn-taking automatically — no push-to-talk button needed.
Strict translation prompt: We instruct Gemini to be a transparent translation layer with explicit rules. No commentary, no conversational responses, no "here's the translation." This required careful prompt engineering — early versions had Gemini answering questions instead of translating them.
Minimal, intentional stack:
- Model:
gemini-2.5-flash-native-audio-latestvia Google GenAI SDK (not ADK — we're a translation bridge, not an agent with tools) - Backend: Python/FastAPI, under 500 lines of code. WebSocket endpoint handles both binary audio frames and JSON signaling.
- Frontend: Single HTML file, vanilla JS, no framework, no build step. AudioWorklet captures mic at 16kHz, Web Audio API plays back at 24kHz. Installs as a PWA on any phone.
- Deployment: Google Cloud Run with session affinity, always-on CPU, and 3600s timeout for long calls. Secret Manager for API keys.
- Audio format: 16kHz PCM 16-bit mono input → Gemini → 24kHz PCM 16-bit mono output, streamed in 100ms chunks over binary WebSocket frames.
We deliberately avoided clutter and over-engineering. No database (rooms are ephemeral in-memory), no authentication, no framework on the frontend. The entire server is under 500 lines.
Challenges we ran into
Getting Gemini to behave as a pure translator, not a chatbot. Early prompts had Gemini responding with "Sure, here's the translation" or answering questions instead of translating them. If a user said "What time do you close?", Gemini would try to answer instead of passing the question through. We iterated on the system prompt extensively, adding explicit "NEVER" rules to force strict translation-only behavior.
Audio format alignment between browser, server, and Gemini. Mismatched sample rates, endianness, or chunk sizes produce silence or noise — with no helpful error message. The Gemini Live API expects 16kHz PCM input and returns 24kHz PCM output. Getting the browser's AudioWorklet to capture at exactly 16kHz, stream it cleanly over WebSockets, and then play back 24kHz audio without gaps or pops took careful attention to byte-level details.
WebSocket lifecycle management. Users disconnect unexpectedly (close the tab, lose network). Gemini sessions have a 15-minute limit. Callbacks fire on stale connections. Every callback needed try/except wrapping to handle the case where we're trying to send translated audio to a user who already left. Coordinating the cleanup of rooms, Gemini sessions, and WebSocket connections without race conditions was one of the trickier engineering problems.
Dual-session coordination. We initially considered a single Gemini session handling both directions. It didn't work — the model got confused about which language to translate into. The dual-session architecture (one dedicated session per direction) was the key insight that made everything work reliably.
Accomplishments that we're proud of
- Two people who don't share a language can have a natural, flowing conversation through their phones — no turn-taking, no buttons, no awkward pauses.
- 70+ language pairs work with zero per-language configuration. The system prompt and Gemini config are generated dynamically for any source-target pair. Adding a new language is a single line in the config.
- The entire backend is under 500 lines of code. We deliberately avoided clutter and over-engineering — no framework on the frontend, no database, no unnecessary abstractions.
- 65 unit tests covering the backend and integration flows, all passing with mocked Gemini sessions.
- Deployed and live on Google Cloud Run. Anyone can try it right now — open two phones, pick two languages, and talk.
- The translation feels natural. Gemini's native audio model preserves tone and intent, and the VAD-based turn-taking means conversations flow like real phone calls.
What we learned
- Gemini's native audio model is genuinely fast enough for real-time conversation when you stream raw PCM and avoid text intermediaries. The sub-second latency makes it feel like a real phone call, not a translation app.
- The dual-session architecture is non-negotiable. A single Gemini session handling both translation directions collapses — it loses track of which language to translate into. One dedicated session per direction is the key architectural insight.
- Prompt engineering for translation is its own discipline. A chatbot prompt and a translator prompt are fundamentally different. The model's helpful instincts (wanting to answer questions, add context, be conversational) are exactly what you need to suppress for transparent translation.
AudioContext({ sampleRate: 16000 })lets the browser handle downsampling natively. No manual resampling code needed. This one line eliminated an entire class of audio bugs.- Binary WebSocket frames matter. Sending PCM audio as raw binary instead of base64-encoded text cuts bandwidth and CPU overhead significantly. At continuous streaming rates, this makes a real difference in latency.
- Building something invisible is harder than building something impressive. The goal isn't to show off AI — it's to make users forget they're using a translator. Every UI decision was about removing friction, not adding features.
What's next for Fly Translate
Kigaru Calls — telephony integration. Fly Translate becomes the engine behind Kigaru Calls, integrated into our product Kigaru Talks built for the US military community in Japan. Military families won't need to pay $20 for a concierge to make a phone call. They open the app, call the restaurant directly, and Kigaru handles the translation in real time. We're working on PSTN/Twilio integration so users can dial a real phone number — no codes or links needed.
Native iOS app. The current PWA works well, but a native app unlocks better audio handling, background audio, push notifications, and a smoother experience. Same WebSocket protocol, same backend.
Group translation. Not just 1-on-1 conversations. A tour guide speaking Japanese to a group where everyone hears it in their own language through their own earbuds. A business meeting across three languages with no human interpreter.
Single-device mode. One phone, walkie-talkie style, for face-to-face conversations. Hand the phone back and forth, or set it on the table between you. Perfect for ordering at a restaurant or asking for directions.
The technology is here. Gemini's native audio model is fast enough for real-time conversation. Cloud infrastructure is cheap enough to run this at scale. The missing piece was never the AI — it was building something that feels invisible enough that people forget they're using it. That's what we're chasing.
Built With
- gcp
- gemini-live
- python

Log in or sign up for Devpost to join the conversation.