AI Tutoring Whiteboard: Building a Multimodal Learning Platform
Inspiration
The idea came from observing how students struggle with traditional online tutoring tools that separate visual work from conversation. Math students often need to show their work, draw diagrams, and receive real-time feedback—but most tutoring platforms force them to choose between screen sharing, uploading images, or just describing problems verbally.
We wanted to create a seamless experience where students could naturally write on a whiteboard, ask questions with their voice or text, and receive intelligent tutoring that actually “sees” and understands their work.
Our vision was simple: a smart whiteboard that listens, sees, and teaches.
What We Built
An AI-powered tutoring application that combines:
- 🖊️ Interactive whiteboard with full drawing capabilities
- 🎤 Voice Access for natural question asking
- 👁️ Computer vision to analyze student work on the canvas
- 🧠 AI tutoring with customizable teaching styles (from hands-off to highly supportive)
- 🔊 Text-to-speech for natural voice responses
The app allows students to draw problems, ask questions verbally, and receive guidance from an AI tutor that can actually see and analyze their whiteboard work—just like a human tutor looking over their shoulder.
Technical Stack
Frontend: React + TypeScript (Vite) with Tailwind CSS for styling
AI Services:
- Google Gemini 2.0 Flash for natural language tutoring and image analysis
- ElevenLabs for natural-sounding text-to-speech
- Runware for AI-generated visual aids
Browser APIs:
- Web Speech API for voice input
- Canvas API for drawing
The Build Process
Phase 1: Foundation
We started with user authentication through Bolt Database, building a clean landing page and settings interface.
We implemented the core whiteboard using HTML5 Canvas with drawing tools (pencil, eraser, colors, line widths).
This part was straightforward—standard canvas event handlers for mouse interactions.
Phase 2: Voice Integration
We integrated Web Speech API for voice recognition and ElevenLabs for text-to-speech.
We also created a toggleable interface where students can switch between voice and text modes.
Main challenge: managing state—ensuring the microphone stops when processing, avoiding feedback loops, and handling browser permission requests gracefully.
Phase 3: AI Tutoring System
We built an AITutorService class wrapping the Google Gemini API.
We implemented a "pushiness level" system (1–5 scale) that adjusts the AI’s teaching style via system prompts:
| Level | Description |
|---|---|
| 1 | Minimal intervention, lets students struggle productively |
| 5 | Highly supportive with step-by-step guidance |
The AI maintains conversation history for context-aware responses and speaks naturally for voice output.
Major Challenges
Challenge 1: Sending Whiteboard Data to Gemini
The Problem:
Getting the whiteboard content into a format Gemini could analyze was significantly harder than expected.
The whiteboard has multiple layers:
- A background image (optional uploaded problem)
- A transparent drawing overlay
- Various coordinate systems due to responsive sizing
The Solution:
We created a composite canvas approach in the captureScreenshot() method:
// Creates a temporary canvas
const compositeCanvas = document.createElement('canvas');
// Fills it with white background
ctx.fillStyle = 'white';
ctx.fillRect(0, 0, compositeCanvas.width, compositeCanvas.height);
// Draws background image if present
if (backgroundImg) ctx.drawImage(backgroundImg, 0, 0, ...);
// Draws student's drawing overlay
ctx.drawImage(canvas, 0, 0, ...);
// Converts to base64 PNG
return compositeCanvas.toDataURL('image/png');
The tricky part was synchronizing dimensions across device pixel ratios and ensuring the image wasn’t corrupted.
We had to account for devicePixelRatio scaling and proper canvas sizing using getBoundingClientRect().
Challenge 2: Orchestrating Multiple AI Services
The Problem:
The app integrates three separate AI APIs (Gemini, ElevenLabs, Runware), each with unique:
- Authentication methods
- Request/response formats
- Rate limits and timing requirements
- Error handling patterns
Key challenges:
- Service initialization timing (missing keys, invalid credentials)
- Data flow coordination (user question → Gemini → ElevenLabs → Runware)
- Error cascading (one service failure breaking the flow)
- Race conditions (user asking new questions mid-response)
The Solution:
We created dedicated service classes—GeminiService, ElevenLabsService, and RunwareService—with graceful degradation:
// Conditional initialization with fallback warnings
if (elevenLabsKey) {
elevenLabsRef.current = new ElevenLabsService(elevenLabsKey);
} else {
console.warn('Voice output disabled - check API key');
}
We also implemented a state machine in WhiteboardPage.tsx:
isProcessingprevents overlapping requestsstatusMessageprovides real-time feedbacktry-catchblocks isolate service errors- Services fail independently
Example sequence:
const response = await aiTutorRef.current.getResponse(message, imageUrl);
setAiResponse(response); // Store for text display
if (isSpeaking && elevenLabsRef.current) {
await elevenLabsRef.current.speak(response, voiceId);
} else {
// Graceful degradation: show text instead
}
The hardest part was debugging the image analysis pipeline—ensuring the base64 image data, MIME type, and Gemini input structure were all correct.
What We Learned
- 🎨 Canvas API mastery: composite operations, pixel ratios, and efficient drawing
- 🤖 Multimodal AI integration: combining vision, language, and speech models
- ⚙️ Graceful degradation: designing resilient systems
- 🔁 Real-time state management: coordinating async workflows
- 🧩 Secure RLS policies: user data isolation in Bolt Database
- 🧠 Prompt engineering: designing natural and pedagogically sound tutor responses
Future Improvements
- 📝 Persistent canvas history — save and reload sessions
- 👥 Real-time collaboration — multiple students drawing together
- ✍️ Handwriting recognition — convert math to $\LaTeX$
- 📊 Progress tracking — analyze performance trends with ML
- 📱 Mobile support — touch drawing and responsive layouts
Takeaway:
Building truly useful AI applications isn’t just about calling APIs—it’s about orchestrating multiple systems into one seamless, human-centered experience.
Built With
- canvas-api-for-drawing
- elevenlabs-api
- frontend:-react
- google-gemini-api
- runware-api
- tailwind
- typescript
- vite
- web-speech-api
Log in or sign up for Devpost to join the conversation.