JustRemind — Multilingual Voice Reminders Without Language Settings
Inspiration
In real life, language is rarely simple.
Many people live and work in multilingual environments. Their phone system language might be English, while their daily conversations and thoughts happen in Chinese — often switching naturally between languages, sometimes within the same sentence.
Most voice interfaces assume a single language context. As a result, developers are forced to:
- Ask users to choose a language in advance
- Require manual language switching
- Or build fragile language-detection pipelines
These approaches introduce configuration screens, edge cases, and failure modes — and still break down in mixed-language scenarios.
I encountered this problem firsthand. My phone system language is English, but I often create reminders by speaking Chinese, or by mixing Chinese and English together. Existing voice reminder solutions either failed to recognize my speech or required awkward manual setup.
I wanted to eliminate language configuration entirely — and let users speak naturally.
What it does
JustRemind allows users to create reminders using natural voice input in complex, multilingual environments — without selecting or configuring any language.
Users can speak in any language, or mix languages freely, and JustRemind will:
- Automatically detect one or multiple spoken languages
- Accurately transcribe mixed-language speech
- Understand user intent and temporal expressions
- Convert speech into structured, reliable reminders
- Ask clarifying questions when necessary instead of failing silently
The user experience is intentionally minimal: press, speak naturally, and the reminder is created.
For example, even if the phone’s system language is English, a user can say:
- “提醒我明天下午三点 meeting 前半小时喝咖啡”
- “Next Friday 帮我 remind 一下给 Mom 打电话”
- “提醒我 after work 去 pick up 干洗的衣服”
JustRemind understands the intent and creates the correct reminder automatically.
How I built it
The project is built as a native iOS application with a lightweight cloud backend.
- The iOS app records short, user-initiated voice clips.
- Audio is securely sent to a Cloudflare Worker endpoint.
- The Worker forwards the audio to Gemini 3, which performs language detection, transcription, and semantic understanding.
- Gemini returns a strictly structured JSON output containing detected languages, parsed reminder fields, and confidence.
- The app creates the reminder locally and synchronizes it across devices.
- If the cloud request fails, the app gracefully falls back to on-device speech recognition.
This architecture keeps the client simple while allowing Gemini to handle the most challenging parts: multilingual understanding, reasoning, and ambiguity resolution.
Challenges I faced
One major challenge was ensuring reliable, structured outputs from a generative model. Since the app depends on precise reminder data, I carefully designed prompts that enforce strict JSON responses and added validation and retry logic on the backend.
Another challenge was privacy. The system minimizes data transfer by only processing short, user-initiated voice clips and avoiding long-term storage of audio data.
What I learned
This project showed me that language complexity is not just a speech recognition problem — it is a reasoning problem.
By delegating multilingual understanding and intent reasoning to Gemini, an entire layer of application complexity disappeared. Instead of building language-specific logic, configuration flows, and edge-case handling, I was able to design a simpler and more robust user experience.
This reinforced how powerful multimodal models can be when applied thoughtfully to everyday tools: not by adding features, but by removing friction.
Log in or sign up for Devpost to join the conversation.