Monti | AI voice companion

Child profile setup and scenario selection — personalized by name, age, and interests
AI character calls the child — celebrates with fanfare when the goal is achieved
System Architecture — Flutter app connects to Vertex AI (Gemini Live API) and Cloud Run on Google Cloud

Inspiration

It started with my niece. She absolutely refused to brush her teeth — no amount of asking, pleading, or explaining worked. But one day, while playing pretend phone calls with her stuffed bear, she suddenly announced: "Bear says I should brush my teeth! OK!" and ran to the bathroom.

That moment stuck with me. A third party — even an imaginary one — could reach her in a way her parents couldn't. This isn't a coincidence. Research on Parasocial Relationships (Horton & Wohl, 1956) shows that children form genuine emotional bonds with characters, making them more receptive to guidance. And Self-Determination Theory (Ryan & Deci, 2000) explains why: when autonomy is preserved — when the child chooses to act rather than being told — motivation becomes intrinsic and lasting.

I thought: what if we could combine the power of a beloved character with the intelligence of a real-time AI that listens, adapts, and guides through questions? That's how Monti was born.

What It Does

Monti is a real-time AI voice companion for children aged 2–8, built for the Live Agents category. A friendly AI character literally calls the child's phone, creating a magical moment — "Monty is calling you!"

The conversation is fully voice-driven with no buttons needed. The child just talks naturally, and Monti:

Listens with Voice Activity Detection — no tap-to-talk
Responds with natural, age-adaptive speech in real time
Handles interruptions gracefully via barge-in support
Guides through Socratic questioning — never commands, always asks
Celebrates when the child decides to act — with fanfare and a "pinky promise"

Example flow for a tooth-brushing scenario:

🐻 "Heeey Yuu-chan! What yummy food did you eat today?" 👧 "Curry!" 🐻 "Ohhh! What do you think happens to curry bits on your teeth?" 👧 "They stay there?" 🐻 "Right! What could you do about that?" 👧 "Brush my teeth!" 🐻 "Yaaay! It's a pinky promise with Monty! Go brush! 🎉"

The child decided on their own. No commands. No fear. Just guided discovery.

The Science Behind It

This isn't just a chatbot with a cute voice. Every conversation design choice is backed by developmental psychology:

Principle	Application in Monti	Evidence
Montessori Education	Child discovers, AI guides	Cognitive flexibility d=0.61 (Lillard & Else-Quest, 2006)
Socratic Method	Questions, not answers	Critical thinking gains in 5-6 year-olds (Reznitskaya et al., 2009)
Self-Determination Theory	Autonomy over control	Lasting behavior change vs. temporary compliance (Ryan & Deci, 2000)
Growth Mindset	Praise effort, not ability	Process praise increases persistence 30%+ (Mueller & Dweck, 1998)
Personalization	Always calls child by name	Name recognition activates unique brain regions (Mayer, 2005)
Age-Adaptive Pacing	Slower speech for younger kids	Slower rate improves comprehension in young children (Zangl & Mills, 2007)

How We Built It — 5 Days, Start to Finish

Day 1-2: Foundation & Gemini Live API

Built the Flutter app foundation and connected to Gemini's Live API via WebSocket. Got real-time audio streaming working — 16kHz PCM in, 24kHz PCM out. The first time Monty spoke back was magical.

Day 3: The Audio Pipeline Battle

The hardest part of the entire project. Three major issues:

Echo barge-in — The speaker's audio leaked into the microphone, causing Gemini to think the child was interrupting. Solved with _isSpeaking flag + calculated drain timing based on buffered PCM bytes minus elapsed playback time
Function calling failure — The Google AI preview model never called end_conversation. Switched to Vertex AI GA model (gemini-live-2.5-flash-native-audio) which resolved it immediately
Premature goal detection — The model sometimes called end_conversation on the first turn. Added both prompt-level and code-level guards (minimum 2 completed turns)

Day 4: Cloud Run & Character Generation

Built a Cloud Run token server so the app never needs hardcoded credentials
Added AI character generation using Gemini 3.1 Flash Image Preview — parents describe any character ("a friendly purple dinosaur") and get a flat emoji-style avatar instantly
This is what makes Monti truly AI-native: both the conversation AND the characters are AI-generated

Day 5: Polish & Submission

Age-adaptive system prompts (3-4: baby-talk with stretched vowels, 5-6: moderate, 7+: natural)
detail.design UI techniques: page transitions, celebration screen with floating emojis, talking pulse animation
Character asset images, incoming call simulation, bilingual support (EN/JP)

Google Cloud Architecture

Flutter App ──WebSocket──► Vertex AI (Gemini Live API)
     │                      • Native audio processing
     │                      • VAD & barge-in
     │                      • Function calling
     │
     ├──── HTTP ──────────► Cloud Run
     │                      • Token server (service account auth)
     │                      • Character image generation
     │                      • Gemini 3.1 Flash proxy

Google Cloud services used:

Vertex AI — Gemini Live 2.5 Flash Native Audio (GA) for real-time voice conversation
Cloud Run — Backend token server + AI character generation proxy
Gemini 3.1 Flash Image Preview — On-demand character avatar generation

What We Learned

Vertex AI GA vs Google AI Preview — Function calling only worked reliably on the GA model. This single switch fixed our biggest blocker
Audio timing is everything — The gap between "server finished sending data" and "speaker finished playing" was the root cause of 80% of our UX bugs
Designing for children ≠ designing for adults — Slower pace, shorter sentences, emotional warmth, and name-calling matter far more than response accuracy
AI-generated characters change the product — Moving from template selection to "describe your character" transformed Monti from an app into a platform

What's Next

Parent dashboard with AI-generated conversation insights and progress tracking
Community scenario marketplace — parents share custom missions
Speech development analytics based on conversation patterns
Multi-language expansion — currently EN/JP, targeting 24 languages supported by Gemini
Rive character animations — replacing static images with dynamic expressions

Built With

cloud-run
dart
elevenlabs
flask
flutter
gemini-3.1-flash-image-preview
gemini-live-api
go-router
google-cloud
python
riverpod
vertex-ai
websocket

Updates

Eiji Motomura started this project — Mar 16, 2026 04:42 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.