Inspiration

Every student learns math differently, but most don't have access to a patient, always-available tutor who can adapt to their pace. Traditional math help — textbooks, videos, even chatbots — feels static and one-directional. We wanted to build something that feels like sitting next to a real tutor: you talk, they listen, you interrupt when confused, they adjust. The Gemini Live API's native audio streaming and interruptibility made this possible for the first time — a tutor that truly converses about math in real time, rather than just generating text answers.

What it does

SolveWave is a live, voice-first math tutor that lets students have a natural conversation about math — just like sitting next to a real teacher. Students can:

• Ask math questions by voice and get instant spoken explanations (Gemini Live API with Kore voice) • Interrupt the tutor mid-explanation to ask for clarification — just start talking • Snap a photo of a math problem and get a step-by-step walkthrough • See the tutor's spoken words appear as live streaming text with word-level highlighting • Switch between Explain, Quiz, and Homework modes for different learning styles • Type questions when voice isn't convenient

The tutor explains one step at a time, checks understanding, uses real-world analogies, and celebrates progress — just like the best math teacher you ever had.

How we built it

Frontend: Next.js 14 with TypeScript, Tailwind CSS, Framer Motion for animations, and KaTeX for math rendering. The UI uses an obsidian-dark theme with emerald accents.

Backend: FastAPI (Python) handles WebSocket sessions, bridging browser audio with the Gemini Live API. Audio flows as 16kHz PCM upstream and 24kHz PCM downstream.

Audio Pipeline: WebRTC (via aiortc) is attempted first for low-latency audio transport, with WebSocket binary as fallback on Cloud Run. The browser captures audio with echo cancellation, noise suppression, and auto gain control enabled.

AI: Gemini 2.5 Flash (native audio) powers the Live API voice conversation with Kore voice. Gemini 2.5 Flash (text) handles typed questions and image analysis. The system prompt creates a warm, patient teacher persona.

Echo Prevention: Mic audio is muted to Gemini while the tutor speaks. Energy-based voice activity detection monitors for barge-in — when the student speaks loudly enough over the tutor, it triggers an interrupt.

Interruption: Audio chunks are discarded client-side after interrupt, playback is flushed immediately, and Gemini's native interruption detection handles the server-side turn management.

Deployment: Google Cloud Run (us-central1) with multi-stage Docker builds for both frontend and backend services.

Challenges we ran into

  1. Echo loops: The tutor's voice played through speakers gets picked up by the mic, creating an infinite loop. We solved this with mic muting during speech + energy-based barge-in detection.

  2. Barge-in without mic muting: Competition requires natural interruption, but muting the mic kills barge-in. Our solution monitors mic energy even while muted — when the student speaks loudly (RMS > threshold for consecutive frames), it triggers interrupt and unmutes.

  3. WebRTC on Cloud Run: Cloud Run doesn't support peer-to-peer WebRTC without TURN servers, so we implemented WebSocket binary fallback for audio transport.

  4. Thinking text leaking: Gemini's internal reasoning ("I'll break this down...") was being shown to users. We fixed this by using output_audio_transcription (the actual spoken words) instead of model_turn text parts.

  5. Response length for voice: Text-optimized responses are too long for voice. We tuned the system prompt to enforce "one step at a time, 2-3 sentences max, then check understanding."

Accomplishments that we're proud of

• Natural voice conversation with a math tutor — feels like talking to a real teacher • Working barge-in/interruption without killing echo prevention • Live streaming transcript with word-level highlighting synced to speech • Camera-to-solution: snap a math problem and get spoken step-by-step help • Sub-second voice response latency via Gemini Live API • Clean, production-quality UI with smooth animations and state visualization • Full deployment on Google Cloud Run with zero infrastructure management

What we learned

• Gemini Live API's native audio is incredibly powerful for building conversational agents — the voice quality and responsiveness rival commercial products • Echo cancellation in web browsers is harder than expected — WebRTC's AEC works well with standard media elements but not with AudioContext-based playback • Voice UX is fundamentally different from text UX — responses must be short, conversational, and interruptible • The output_audio_transcription feature is essential for showing what the tutor says without exposing internal reasoning • Building a real-time audio pipeline with proper interrupt handling requires careful state management across client and server

What's next for SolveWave

• Multi-subject expansion beyond math (physics, chemistry, biology) • TURN server integration for reliable WebRTC on Cloud Run • Session history and progress tracking across sessions • Adaptive difficulty based on student performance • Support for Arabic-language tutoring (bilingual mode) • Collaborative whiteboard for drawing and annotating problems • Mobile app with offline problem scanning

Built With

Share this project:

Updates