Inspiration

Every student deserves a patient, always-available tutor — but private tutoring is expensive and inaccessible for millions. We asked: what if AI could replicate the experience of sitting next to a real tutor? Not a chatbot you type at, but one that sees your notebook and talks you through it — in your own language. When we saw the Gemini Live API's native audio capabilities, we knew we could build something that feels like a FaceTime call with the world's smartest tutor.

What it does

EduNova is a real-time, multimodal AI tutor powered by Gemini. Students can:

Speak naturally and get spoken responses in real time (no text-to-speech — native audio via Gemini Live API) Point their camera at homework or upload an image — the tutor sees the problem and talks through it Learn in 20+ languages — Hindi, Spanish, French, and more Interrupt anytime — just like a real conversation Get structured help — practice problems, concept explanations, study plans, and step-by-step walkthroughs via ADK agent tools It covers Math, Physics, Chemistry, Biology, CS, Language Arts, and History.

How we built it

The architecture is a bidirectional streaming bridge:

Frontend — Vanilla HTML/CSS/JS captures mic audio (PCM @ 16kHz) and camera frames, sends them over a WebSocket Backend — FastAPI server on Python 3.12 manages WebSocket connections and bridges them to the Gemini Live API Voice — Gemini 2.5 Flash Native Audio model handles real-time audio in/out via the Live API with interrupt support Vision — Since the native audio model doesn't accept images directly, we use a hybrid approach: images are analyzed by Gemini 2.5 Flash (vision), and the resulting description is injected into the live audio session as context Agent Tools — Google ADK provides structured tools (generate practice problems, explain concepts, create study plans) that the tutor can invoke mid-conversation User Storage — Student profiles (name, grade, language) are persisted to Google Cloud Firestore, with automatic fallback to local JSON Deployment — Dockerized and deployed to Cloud Run via Terraform IaC and Cloud Build CI/CD The key insight was the hybrid architecture: voice flows through the Live API's native audio pipeline for low-latency conversation, while vision goes through a separate Gemini 2.5 Flash call — the two are stitched together seamlessly so the student just sees a tutor that can hear and see.

Challenges we ran into

Native audio model can't process images — ( \text{audio} \oplus \text{vision} ) required a hybrid pipeline: one model for ears, another for eyes, fused at the session level Audio format wrangling — Bridging browser MediaRecorder PCM (Float32, 48kHz) to Gemini's expected format (Int16, 16kHz) required precise resampling:

WebSocket lifecycle management — Maintaining a bidirectional bridge between the client WebSocket and Gemini Live API session with proper cleanup on disconnect Graceful interruptions — When a student speaks mid-response, the tutor must stop, acknowledge, and pivot — coordinating audio buffer flushing with session state was tricky

Accomplishments that we're proud of

A truly multimodal live agent — not just text-in/text-out, but real-time voice + vision The hybrid vision approach works seamlessly — students don't know two models are working behind the scenes 20+ language support with a single model — just select your language and the tutor switches Full IaC deployment — one terraform apply and it's live on Cloud Run

What we learned

The Gemini Live API with native audio is remarkably natural — latency is low enough for real conversation Designing around model limitations (no image input on native audio) forces creative architecture decisions ADK agent tools add real structure to what would otherwise be freeform chat — practice problems and study plans feel like a real tutoring product Firestore's serverless model is perfect for hackathons — zero database ops

What's next for EduNova

Real-time whiteboard — draw and solve math problems collaboratively Progress tracking — track mastery across sessions with Firestore persistence Curriculum alignment — map to Common Core / CBSE / ICSE standards Google OAuth — one-click login alongside username/password Multi-agent collaboration — specialized sub-agents for each subject

Built With

Share this project:

Updates