SolveWave

logo
architecture.png
features
home work
explain feature

Inspiration

Every student learns math differently, but most don't have access to a patient, always-available tutor who can adapt to their pace. Traditional math help — textbooks, videos, even chatbots — feels static and one-directional. We wanted to build something that feels like sitting next to a real tutor: you talk, they listen, you interrupt when confused, they adjust. The Gemini Live API's native audio streaming and interruptibility made this possible for the first time — a tutor that truly converses about math in real time, rather than just generating text answers.

What it does

SolveWave is a live, voice-first math tutor that lets students have a natural conversation about math — just like sitting next to a real teacher. Students can:

• Ask math questions by voice and get instant spoken explanations (Gemini Live API with Kore voice) • Interrupt the tutor mid-explanation to ask for clarification — just start talking • Snap a photo of a math problem and get a step-by-step walkthrough • See the tutor's spoken words appear as live streaming text with word-level highlighting • Switch between Explain, Quiz, and Homework modes for different learning styles • Type questions when voice isn't convenient

The tutor explains one step at a time, checks understanding, uses real-world analogies, and celebrates progress — just like the best math teacher you ever had.

How we built it

Frontend: Next.js 14 with TypeScript, Tailwind CSS, Framer Motion for animations, and KaTeX for math rendering. The UI uses an obsidian-dark theme with emerald accents.

Backend: FastAPI (Python) handles WebSocket sessions, bridging browser audio with the Gemini Live API. Audio flows as 16kHz PCM upstream and 24kHz PCM downstream.

Audio Pipeline: WebRTC (via aiortc) is attempted first for low-latency audio transport, with WebSocket binary as fallback on Cloud Run. The browser captures audio with echo cancellation, noise suppression, and auto gain control enabled.

AI: Gemini 2.5 Flash (native audio) powers the Live API voice conversation with Kore voice. Gemini 2.5 Flash (text) handles typed questions and image analysis. The system prompt creates a warm, patient teacher persona.

Echo Prevention: Mic audio is muted to Gemini while the tutor speaks. Energy-based voice activity detection monitors for barge-in — when the student speaks loudly enough over the tutor, it triggers an interrupt.

Interruption: Audio chunks are discarded client-side after interrupt, playback is flushed immediately, and Gemini's native interruption detection handles the server-side turn management.

Deployment: Google Cloud Run (us-central1) with multi-stage Docker builds for both frontend and backend services.

Challenges we ran into

Echo loops: The tutor's voice played through speakers gets picked up by the mic, creating an infinite loop. We solved this with mic muting during speech + energy-based barge-in detection.
Barge-in without mic muting: Competition requires natural interruption, but muting the mic kills barge-in. Our solution monitors mic energy even while muted — when the student speaks loudly (RMS > threshold for consecutive frames), it triggers interrupt and unmutes.
WebRTC on Cloud Run: Cloud Run doesn't support peer-to-peer WebRTC without TURN servers, so we implemented WebSocket binary fallback for audio transport.
Thinking text leaking: Gemini's internal reasoning ("I'll break this down...") was being shown to users. We fixed this by using output_audio_transcription (the actual spoken words) instead of model_turn text parts.
Response length for voice: Text-optimized responses are too long for voice. We tuned the system prompt to enforce "one step at a time, 2-3 sentences max, then check understanding."

Accomplishments that we're proud of

• Natural voice conversation with a math tutor — feels like talking to a real teacher • Working barge-in/interruption without killing echo prevention • Live streaming transcript with word-level highlighting synced to speech • Camera-to-solution: snap a math problem and get spoken step-by-step help • Sub-second voice response latency via Gemini Live API • Clean, production-quality UI with smooth animations and state visualization • Full deployment on Google Cloud Run with zero infrastructure management

What we learned

• Gemini Live API's native audio is incredibly powerful for building conversational agents — the voice quality and responsiveness rival commercial products • Echo cancellation in web browsers is harder than expected — WebRTC's AEC works well with standard media elements but not with AudioContext-based playback • Voice UX is fundamentally different from text UX — responses must be short, conversational, and interruptible • The output_audio_transcription feature is essential for showing what the tutor says without exposing internal reasoning • Building a real-time audio pipeline with proper interrupt handling requires careful state management across client and server

What's next for SolveWave

• Multi-subject expansion beyond math (physics, chemistry, biology) • TURN server integration for reliable WebRTC on Cloud Run • Session history and progress tracking across sessions • Adaptive difficulty based on student performance • Support for Arabic-language tutoring (bilingual mode) • Collaborative whiteboard for drawing and annotating problems • Mobile app with offline problem scanning

Built With

fastapi
framer-motion
gemini-2.5-flash
gemini-live-api
google-cloud-run
next.js
python
tailwind-css
typescript
webrtc
websockets

Submitted to

Gemini Live Agent Challenge

Created by

I built SolveWave end-to-end as a solo project — fullstack architecture, UI design, and deployment. On the backend, I integrated the Gemini Live API for real-time native audio streaming, implemented WebRTC audio transport via aiortc (with WebSocket binary fallback), and wired up the tutoring agent with tools for hints, answer checking, and session recap. On the frontend, I built the Next.js UI with a floating glass-morphism composer, LaTeX transcript rendering, interruptible voice sessions, and camera-based math problem input. I deployed everything to Google Cloud Run with a CI-ready build pipeline. The biggest challenge was getting full-duplex audio working reliably — managing echo cancellation, interrupt handling, and keeping transcript state in sync with Gemini's audio stream.

Mohamed Ghareeb
Chief Technologist building scalable mobile and AI platforms from whiteboard ideas to production systems used across MENA.

Updates

Mohamed Ghareeb started this project — Mar 16, 2026 07:59 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.