Babel — Immersive Language Learning with Live API
A real-time immersive language learning web application
Built for the Gemini Live Agent Challenge · Creative Storyteller ✍️
🗼 Inspiration
There's a specific kind of frustration that every language learner knows.
You've done the lessons. You've kept the streak. You can conjugate the verb perfectly in a quiz, and then you stand in front of a real person, in a real place, and your mind goes completely blank. Not because you didn't study. Because studying and speaking are two entirely different things that traditional apps never bridge.
Less than 10% of people who start learning a language with a traditional app ever reach conversational fluency. Not because they gave up; because the method was never designed to get them there.
We kept coming back to one question: why do people who move abroad learn a language in months, when people who study it for years never do?
The answer isn't motivation. It isn't intelligence. It's that immersion forces your brain into a completely different relationship with language. When you need a word to survive a situation; to order food, to ask for help, to not get lost, your brain stores it differently. With weight. With context. With emotion. It doesn't go in as data. It goes in as experience.
That's the gap. Every app teaches you data. Nobody was giving you the experience.
The name Babel came early; and once it arrived, it felt inevitable. The Tower of Babel is the origin story of every language barrier that has ever existed. Building something called Babel and pointing it directly at the problem of human miscommunication felt less like branding and more like a mission statement.
We wanted to build something where you forget you're learning. Where the goal is to survive the scene, not to complete the lesson. Where a word you hear in scene one comes back to save you in scene four, and in that moment, you realize you already knew it. You just needed the story to make it stick.
✨ What It Does
Babel is a live voice language immersion experience powered by Google Gemini. You don't study a language here. You use it. Under pressure. Inside a story. From the very first second.
The Experience, Step by Step
┌─────────────────────────────────────────────────────────────┐
│ 1. Sign in → Your library, progress & stories are saved │
│ 2. Pick your native language + the language you want │
│ 3. Choose your difficulty: Simple → Advanced │
│ 4. Enter a world: predefined story or build your own │
│ 5. The story begins — ambient music, live scene, AI voice │
│ 6. You speak. The world responds. The story moves. │
│ 7. Finish — get your score, feedback & a generated video │
│ 8. Your library saves everything. Words. Scenes. Stories. │
└─────────────────────────────────────────────────────────────┘
Story Worlds
Every story in Babel is a cinematic scenario set in the culture of the language you're learning. You are not a student. You are a character.
Predefined stories take you across history, mystery, and wonder:
| Story | World | Narrator |
|---|---|---|
| 🚀 Discover the Universe | Outer space — a mission where the instruments only speak your target language | Wise Space Explorer |
| 🌲 Journey Through the Enchanted Forest | A magical woodland where every creature guards a secret | Friendly Forest Guide |
| 🌊 Voyage Under the Sea | A deep-sea expedition — coral, current, and creatures that only communicate one way | Brave Deep-Sea Diver |
| 🏛️ Time Travel to Ancient Egypt | The pharaohs' court — you arrive without a guide, only your words to navigate | Mysterious Time Keeper |
| 🏰 Escape the Haunted Castle | Every door is locked. Every shadow speaks. The only way out is through the language. | Spooky Castle Ghost |
| 🏝️ Survive on a Deserted Island | No map. No rescue. Just the island, the language, and what you can say. | Adventurous Castaway |
The Custom Story Builder lets you create your own world. Write a title. Choose a genre. Define the narrator's role. Leave the rest blank — Babel's AI fills the entire arc, the vocabulary curve, the obstacles, the music, and the scene images. You write one sentence. The AI builds the world.
The Lock & Key Mechanic
Every scene has one Lock — an obstacle that stops the story cold. The user provides the Key — by speaking a word or phrase in the target language.
The lock doesn't open until you earn it. The story doesn't move until you speak. No passive reading. No clicking. Just you, the scene, and the language.
Four Difficulty Levels
| Level | Feel | Language Used | Challenge |
|---|---|---|---|
| 🟢 Simple | A guide beside you | Native language for all instructions | One word at a time |
| 🟡 Medium | A coach who believes in you | Native + short target phrases | 2-3 word phrases |
| 🔴 Hard | You were dropped in | 100% target language | Full sentences, no hints |
| ⚫ Advanced | There is no game. Only the story. | Native speaker speed, idioms, slang | Native-level expression |
Six Mission Types
The AI Game Master uses six tools to challenge you — rotating them so no two scenes feel the same:
- 🎤 Vocal Unlock — say the word or phrase out loud to advance
- ⚡ Word Duel — rapid-fire three-round exchange: native language → target language

- ✍️ Writing Challenge — type what you just heard to reinforce through a second modality

- 🧠 Quiz Challenge — pick the correct answer from 4 multiple-choice options; tests vocabulary comprehension and reading recognition

- 🔊 Echo Challenge — hear the word first, then repeat it (for brand new vocabulary)
- 🎭 Story Choice — two paths presented in the target language; your fluency determines which one you get
The Word Vault
Every word used correctly becomes part of your Word Vault — your personal inventory. Two or three scenes later, the story creates a lock that requires a vaulted word to open.
A word learned in scene one becomes the key that saves you in scene four. This is not spaced repetition. This is memory built from consequence.
Ambient Music
The moment you enter a story, its world begins before a single word is spoken. Each story has its own ambient soundtrack — the hum of a market, rain on Parisian cobblestones, the quiet of a feudal forge. Music primes emotional state. And emotional state accelerates language retention.
The Video Recap
When your story ends, Babel generates a personalized video recap — powered by Google Veo 2. It shows what happened in the story, the words you mastered with their translations, your score, and a closing line of encouragement. Your trophy. Your proof. Something you can share.

Your Library

Every session is saved to your personal library:
- All generated scene images from your stories
- Session summaries with your score and personalized feedback
- Your full vocabulary list per story with translations
- Your generated recap video
🏗️ How We Built It
Architecture Overview
▶ Watch the Architecture Overview


The Five Agents
The intelligence of Babel lives in five agents that run simultaneously — each doing one job, invisibly:
🎭 Character Agent Voices the story world. Speaks in the target language at the appropriate difficulty level, plays the AI persona defined by the story, drives dramatic tension, adapts register and vocabulary. Stays in character completely — in Advanced mode, it does not know English exists.
👂 Listener Agent Processes every word the user says in real-time via Gemini Live API's bidirectional audio channel. Evaluates: which language was spoken, what word was attempted, was it correct, was it confident? Passes that signal to all other agents so the story responds to what was actually said.
📚 Pedagogy Agent The invisible teacher. Manages the Word Vault, enforces a strict vocabulary difficulty ramp across six scenes, tracks the streak counter, and engineers story moments where new words arrive through dramatic necessity — never through a pop-up or a list.
🔄 Adaptation Agent Reads struggle signals after every learner response: pauses, native-language switching, broken phrases. Silently adjusts story complexity. Never announces the adjustment. The story just gets harder or easier — and the learner feels it without being told.
🎬 Memory Agent Activates at the end of every story. Scores the session, generates three personalized feedback points, builds the recap video via Veo 2, and saves the full session to the library. The only moment in the entire experience where the game master steps out of character to say: here is what you actually learned today.
The Interleaved Output
The technical centerpiece of Babel's Creative Storyteller claim: voice narration and scene illustrations arrive in the same stream.
When the Character Agent begins speaking, it simultaneously calls generate_scene() —
triggering an image generation request to Gemini 2.5 Flash. The image renders while the voice plays.
Text, audio, and visuals are not sequential. They are woven.
This is Gemini's interleaved mixed-output capability — used here not as a demo feature but as the foundation of the entire sensory experience.
Tech Stack
| Layer | Technology |
|---|---|
| Frontend | Vanilla JS, Web Components (Custom Elements), Vite 7 |
| Backend | Python 3.10, FastAPI, Uvicorn |
| AI — Voice & Intelligence | Gemini Live API (gemini-2.5-flash-native-audio-preview-12-2025) |
| AI — Scene Images | Gemini Image Generation |
| AI — Ambient Music | Vertex AI Lyria 2 (lyria-002) |
| AI — Video Recap | Vertex AI Veo (veo-3.1-fast-generate-preview) |
| Authentication | Firebase Authentication (Google Sign-In) |
| Database | Cloud Firestore |
| Media Storage | Cloudinary (images, video, audio) |
| Rate Limiting | Redis (falls back to in-memory) |
| Analytics | Google BigQuery |
| Deployment | Docker → Google Cloud Run |
🧱 Challenges We Ran Into
1. Teaching the AI When to Be Silent
The hardest problem in Babel is not technical. It is behavioral.
A language model wants to help. It wants to explain, translate, scaffold, support. But the most powerful moment in Babel is silence — the microphone pulsing, the scene waiting, the user needing to produce the word themselves. That silence is where learning happens.
Prompt engineering the AI to stop talking and wait — and to hold that silence under pressure when a user is clearly struggling — took longer to solve than the entire cloud architecture. The answer was the 3-Beat Rule: no more than three sentences before a challenge, always.
2. Vocabulary That Arrives Too Fast
Early versions of the story arc introduced complex words in scene two. "Magnificent." "Countdown." Words that sounded impressive but broke the experience for beginners.
We rebuilt the vocabulary model from scratch with a strict difficulty ramp:
- Scene 1: simplest visible object (fire, water, door)
- Scene 2: basic action (open, run, stop)
- Scene 3: describing word (dark, big, fast)
- Scene 4: social word (please, help, wait)
- Scenes 5-6: recall only — no new vocabulary
The constraint that changed everything: a word can only be introduced if the scene visually justifies it.
3. Making the Story Feel Like a Game, Not a Lesson
The original experience had one interaction per scene — the AI narrated, then asked one question. Too slow. Too passive. Too much like school.
We introduced the Ping-Pong Rule for Hard and Advanced modes: the AI speaks one or two sentences, then stops. Always. The model cannot fill silence with more narration. We also added four new mission types beyond the original vocal challenge, and scaled the number of interactions per scene with difficulty: Simple = 1, Medium = 2, Hard = 3+, Advanced = continuous.
4. Interleaved Output Timing
Getting voice and image to feel simultaneous — not voice-then-image — required careful
orchestration of the generate_scene() tool call timing relative to the audio stream.
The tool must be called first, before narration begins, so that image generation
is already in progress when the first word is spoken.
The AI's natural instinct is to narrate first and call tools after. Breaking that instinct
required explicit rule enforcement and auto-correction loops in the system prompt.
5. Custom Story Quality at Scale
When a user writes "a detective story set on Mars" and leaves everything else blank, Babel needs to generate a complete, coherent six-scene arc with a vocabulary ramp, Lock & Key obstacles per scene, appropriate music selection, and scene image prompts — all consistent with the story world and the chosen difficulty level.
Making that auto-fill consistently good, without hallucination or genre drift mid-story, required a tightly structured arc template with explicit field-level constraints and a multi-pass generation approach.
🏆 Accomplishments That We're Proud Of
The Word Vault Payoff
The moment a user hears "You know this word — from the café" and suddenly realizes
that EAU, the word for water they learned in scene one, is the exact word they need
right now to stop a basement from flooding in scene four —
That moment is not a feature. It is the proof of concept. The agent remembered. The story remembered. And the learner never forgot that word again.
Five Mission Types With Zero New Tools
Word Duel, Echo Challenge, and Story Choice are three completely new interaction patterns — none of them required a single new tool call or backend change. They are purely conversational patterns, entirely prompt-engineered, that made the experience feel like a video game instead of a language drill.
Advanced Mode as a Native Conversation
In Advanced mode, the AI character does not know English exists. It speaks at native speed with idioms, register-appropriate language, cultural references, and natural interruptions. When a user sounds non-native, the character reacts as a real person would: "You speak strangely. Where are you from exactly?"
No grammar correction. No lecture. The consequence is built into the story. This is the closest thing to being dropped in a foreign country that a screen can simulate.
A Complete End-to-End Learning Loop
Babel is the first application we know of that closes the full loop: immersive live voice → multimodal scene generation → writing reinforcement → vocabulary tracking → narrative consequence → personalized video recap → saved library.
Three modalities — speaking, listening, writing — in a single session, all woven into one continuous story.
The Design
Babel's visual world is built to feel like you've stepped inside the story before a word is spoken.
The interface runs on a deep space background — #050510 — with aurora animations shifting
across the screen in rotating cyan and purple gradients, alive and breathing behind every scene.
Generated story images don't live inside a box or a panel. They spread across the entire screen,
edge to edge, making every scene feel like a portal rather than a picture.
Glassmorphism layers float over the immersive imagery — semi-transparent cards with
backdrop-filter: blur() and subtle borders that let the world bleed through the UI.
Nothing feels like a traditional app screen. Everything feels like you are in the world,
with the interface barely there around you.
Typography pairs Space Grotesk for headings; geometric, confident, slightly otherworldly
with Inter for body text, clean and readable against any background.
Aurora shift animations, spring-curve transitions, and custom ease-out-expo easing
give every interaction a physical weight that most apps never achieve.
The result: the moment a story begins, the UI disappears. There is only the world.
📖 What We Learned
Silence is a feature. The most counterintuitive lesson: the moments where Babel says nothing are its most powerful teaching moments. Every instinct in AI product design pushes toward more output, more help, more scaffolding. Babel works because we learned to resist that instinct.
Emotion accelerates memory. The research is clear — information attached to an emotional experience is retained longer and recalled more easily. This is why ambient music matters. This is why story consequences matter more than grades. The brain doesn't store facts. It stores feelings about facts.
The best correction is a consequence. When we removed correction badges and replaced them with story consequences — the guard doesn't believe you, the door stays closed, the water rises — users tried harder and retained more. Being corrected by a teacher feels like failure. Being corrected by the world feels like an adventure.
Prompt engineering is product design. The behavior of Babel's five agents is defined entirely in language — in the structure, rhythm, and precision of the system prompt. We spent as much time on prompt architecture as on code architecture. The 3-Beat Rule, the vocabulary ramp, the Ping-Pong Rule — these are design decisions expressed in words, not code. That is a new kind of craft.
The name carries weight. Naming the app Babel was not cosmetic. It gave us a myth to write against, a symbol to build toward, and a single word that tells the entire origin story of the problem we set out to solve. Every strong product has a story. The name is the story.
🔭 What's Next for Babel
Short Term
Vocabulary Constellation UI — a visual star map where every word learned is a glowing star, connected to the scenes where it was learned. Words used multiple times shine brighter. Words from your first session are still there, dimmer but present. Your entire learning history, visible.
More Story Worlds — expand the predefined library across all major world languages and cultural eras: Medieval Andalusia (Arabic), Meiji Tokyo (Japanese), Renaissance Florence (Italian), Revolutionary Paris (French), Ottoman Constantinople (Turkish)
Multiplayer Mode — two learners enter the same story world. One plays the hero. One plays an NPC. Both must communicate in the target language to advance. Language learning as collaborative theatre.
Medium Term
Cross-Session Memory — the Memory Agent currently tracks within a session. Next: it tracks across sessions. Words learned in last Tuesday's Paris story appear naturally in this Friday's Tokyo story — because the agent remembers your vocabulary.
Teacher Dashboard — classrooms can assign specific story worlds to students, track vocabulary acquisition per learner, and see which words needed the most scaffolding. Babel as a school tool, not just a personal one.
Voice Personality Customization — choose your AI character's personality, accent, and era. A gruff 1940s detective. A warm 1970s grandmother. A sharp contemporary journalist. Same story world. Completely different emotional tone.
Long Term
Babel Live — synchronous group sessions where a human teacher runs the Game Master role and Babel handles all image generation, vocabulary tracking, and session scoring automatically. AI-augmented human instruction, not AI replacing it.
The Babel Archive — a community library of user-created story worlds, rated and curated. If your custom story is exceptional, it enters the public library. Language learning content created by learners, for learners.
Built for the Gemini Live Agent Challenge · 2026
by Yahya Samet & Ghada Eladeb
Built With
- cloud-run
- cloudinary
- docker
- fastapi
- firebase
- google-bigquery
- javascript
- python
- redis
- uvicorn
- vanillajs
- websockets

Log in or sign up for Devpost to join the conversation.