EduMind — AI Virtual Teacher with Lip-Synced 3D Avatar

Inspiration

260 million children worldwide lack access to quality education. In Bangladesh alone, private tutoring costs $500–$1,500/year — unaffordable. Existing AI tools are text-based chatbots — smart but cold. Students don't just need answers; they need a teacher they can see, hear, and interact with naturally. We asked: What if every child had a personal AI teacher that feels real?

What it does

EduMind is a 3D AI virtual teacher that speaks, emotes, and lip-syncs in real time. Students have natural voice conversations with a lifelike avatar — just like sitting in a real classroom.

Key capabilities:

Live Voice Conversation — Full-duplex voice via Gemini Live API with real-time lip-synced 3D avatar
AI Image Generation — Educational diagrams with Bengali text support via Gemini 3 Pro Image, explained by the teacher using Vision API
Smart Quiz System — Adaptive MCQs with Bayesian Knowledge Tracing (85% mastery prediction)
Deep Research — Google Search-grounded comprehensive reports on any topic
Curriculum Mode — RAG-powered lessons from NCTB/CBSE/Cambridge syllabi
Dual Avatars — Male and female teachers with 8 emotions and 8 hand gestures

All five modes work in both text chat and live voice conversation.

How we built it

Multi-Model Orchestration with Gemini 3:

Gemini 3 Flash (gemini-3-flash-preview) — Powers chat, quiz generation, research, and curriculum with 1M token context
Gemini 3 Pro Image (gemini-3-pro-image-preview) — Generates educational diagrams with accurate Bengali text rendering
Gemini 2.5 Flash Native Audio (gemini-2.5-flash-native-audio-preview) — Real-time bidirectional voice via Live API with tool calling
Google Cloud TTS — Fallback text-to-speech in 70+ languages

The Core Innovation — Live API Lip Sync: We built a custom _autoLipsyncFromPCM() engine that analyzes Gemini Live API's PCM16 audio stream in 25ms segments, calculates RMS amplitude, and maps it to viseme mouth shapes (aa/O/E/I) — creating real-time lip sync on a Three.js 3D avatar during live voice conversations. No other platform does this.

Live API Tool Calling: During voice conversations, the AI teacher can execute tools (generate images, create quizzes, run deep research) without interrupting the conversation flow.

Tech Stack: Three.js + TalkingHead (custom fork), Vanilla JS, Firebase Auth/Firestore, Stripe, Vercel Edge Functions.

Challenges we faced

Lip sync from raw PCM — Gemini Live API returns raw audio with no viseme/phoneme data. We had to build amplitude-to-viseme mapping from scratch using RMS analysis on 25ms audio segments.
Tool calling during streaming audio — Coordinating image generation and quiz overlays while the avatar is speaking required careful state management.
Bengali text in AI images — Most image models fail at non-Latin scripts. Gemini 3 Pro Image handles Bengali accurately.
Audio buffering — Managing AudioWorklet streaming with proper buffering to prevent gaps or overlaps in avatar speech.

Accomplishments we're proud of

First platform to achieve real-time lip sync with Gemini Live API on a 3D avatar
5 learning modes all working in both text and live voice
< 1200ms latency from student speech to avatar response
Adaptive learning with 85% mastery prediction accuracy

What we learned

Gemini Live API's native audio streaming is incredibly powerful for building conversational AI
Gemini 3 Pro Image's ability to render Bengali text accurately opens education to non-English speakers
Real-time lip sync from PCM audio is achievable with simple amplitude analysis — no need for complex phoneme detection