Inspiration
Education rarely meets students where they are. In the flow of thinking, sketching, and talking through problems. We were inspired by the way the best tutors work: not by lecturing, but by listening, watching what you draw, and responding in the moment. We wanted to build something that felt like having a brilliant, patient tutor sitting next to you at all times; one that could see your whiteboard, hear your questions, and talk back in real time.
What it does
Aris is a multimodal AI learning tutor that combines voice, vision, and an interactive whiteboard into a single seamless experience. You can:
Speak naturally to ask questions, and Aris responds with spoken audio in real time Draw and diagram on an Excalidraw-powered whiteboard. Aris can see what you're sketching Share your screen so Aris can follow along with whatever you're working on
The result is a tutoring session that feels alive: you think out loud, sketch your understanding, and Aris responds like a real conversational partner.
How we built it
Aris is built on a two-tier architecture: Frontend (React + Vite)
Built with React, styled with Tailwind CSS, and powered by the Excalidraw library for the interactive whiteboard A custom AudioRecorder class uses the Web Audio API and AudioWorkletProcessor to capture microphone input at 16kHz and stream it as PCM16 base64 chunks An AudioStreamer class handles real-time playback of AI-generated audio, queuing PCM16 chunks and playing them back smoothly at 24kHz A MediaHandler captures screen share frames and sends them as JPEG images All communication flows through a GeminiAPI WebSocket client that multiplexes audio, image, text, and control messages
Backend (Python + FastAPI)
A FastAPI server accepts WebSocket connections and proxies them to Gemini 2.0 Flash via the Gemini Multimodal Live API (supporting both Vertex AI and the developer API) Bidirectional message handling uses asyncio.TaskGroup to concurrently process incoming client messages and outgoing Gemini responses Function calling is handled via a tool queue — Gemini can invoke cloud functions (e.g., get_weather) and the results are fed back into the session in real time Secrets (API keys) are managed via Google Cloud Secret Manager
Challenges we ran into
Audio synchronization: Streaming PCM16 audio in real time required careful buffer management on both ends. Getting the AudioWorkletProcessor to chunk and flush data at the right cadence (every 2048 samples, ~8x/sec at 16kHz) without introducing latency or gaps took significant iteration. Interruption handling: When a user starts speaking while Aris is mid-response, the system needs to immediately stop playback, cancel the current audio stream, and signal Gemini — coordinating this gracefully across the WebSocket boundary was tricky. Concurrent message streams: Gemini responses can interleave audio, text, tool calls, and interruption signals. Using asyncio.TaskGroup and a dedicated tool queue let us process these concurrently without race conditions. AudioContext lifecycle: Browser restrictions on AudioContext creation (requiring user gesture) and the shared worklet registry pattern needed careful handling to avoid duplicate registrations across re-renders.
Accomplishments that we're proud of
True real-time voice: The audio pipeline from mic to Gemini and back is low-latency enough to feel like a genuine conversation, not a walkie-talkie Multimodal context: Aris can simultaneously understand what you're saying and what you're drawing, the whiteboard and voice inputs are unified in a single session Clean interruption model: You can cut Aris off mid-sentence naturally, just like a real conversation, and the system recovers gracefully
What we learned
The Gemini Multimodal Live API is remarkably expressive at handling audio, images, text, and function calls through a single session abstraction simplified what could have been a very complex integration Real-time audio in the browser is deceptively complex. The Web Audio API's worklet model is powerful but requires careful attention to sample rates, buffer sizes, and thread boundaries Designing for interruption from the start (rather than bolting it on) pays dividends since it forced a cleaner separation between playback state and session state throughout the codebase
What's next for Learning Tutor
Persistent sessions: Save and resume tutoring sessions with full whiteboard state and conversation history Subject-specific modes: Specialized system prompts and tools for math (step-by-step equation solving), science (real-time simulations), and coding (live code review) Student progress tracking: Identify concepts a student struggles with across sessions and adapt explanations accordingly Collaborative mode: Multiple students in the same whiteboard session, with Aris mediating the discussion Voice personas: Let students choose a tutor voice and personality that resonates with their learning style
Log in or sign up for Devpost to join the conversation.