AI Tutoring Whiteboard: Building a Multimodal Learning Platform

Inspiration

The idea came from observing how students struggle with traditional online tutoring tools that separate visual work from conversation. Math students often need to show their work, draw diagrams, and receive real-time feedback—but most tutoring platforms force them to choose between screen sharing, uploading images, or just describing problems verbally.

We wanted to create a seamless experience where students could naturally write on a whiteboard, ask questions with their voice or text, and receive intelligent tutoring that actually “sees” and understands their work.

Our vision was simple: a smart whiteboard that listens, sees, and teaches.

What We Built

An AI-powered tutoring application that combines:

🖊️ Interactive whiteboard with full drawing capabilities
🎤 Voice Access for natural question asking
👁️ Computer vision to analyze student work on the canvas
🧠 AI tutoring with customizable teaching styles (from hands-off to highly supportive)
🔊 Text-to-speech for natural voice responses

The app allows students to draw problems, ask questions verbally, and receive guidance from an AI tutor that can actually see and analyze their whiteboard work—just like a human tutor looking over their shoulder.

Technical Stack

Frontend: React + TypeScript (Vite) with Tailwind CSS for styling

AI Services:

Google Gemini 2.0 Flash for natural language tutoring and image analysis
ElevenLabs for natural-sounding text-to-speech
Runware for AI-generated visual aids

Browser APIs:

Web Speech API for voice input
Canvas API for drawing

The Build Process

Phase 1: Foundation

We started with user authentication through Bolt Database, building a clean landing page and settings interface.
We implemented the core whiteboard using HTML5 Canvas with drawing tools (pencil, eraser, colors, line widths).
This part was straightforward—standard canvas event handlers for mouse interactions.

Phase 2: Voice Integration

We integrated Web Speech API for voice recognition and ElevenLabs for text-to-speech.
We also created a toggleable interface where students can switch between voice and text modes.

Main challenge: managing state—ensuring the microphone stops when processing, avoiding feedback loops, and handling browser permission requests gracefully.

Phase 3: AI Tutoring System

We built an AITutorService class wrapping the Google Gemini API.
We implemented a "pushiness level" system (1–5 scale) that adjusts the AI’s teaching style via system prompts:

Level	Description
1	Minimal intervention, lets students struggle productively
5	Highly supportive with step-by-step guidance

The AI maintains conversation history for context-aware responses and speaks naturally for voice output.

Major Challenges

Challenge 1: Sending Whiteboard Data to Gemini

The Problem:
Getting the whiteboard content into a format Gemini could analyze was significantly harder than expected.
The whiteboard has multiple layers:

A background image (optional uploaded problem)
A transparent drawing overlay
Various coordinate systems due to responsive sizing

The Solution:
We created a composite canvas approach in the captureScreenshot() method:

// Creates a temporary canvas
const compositeCanvas = document.createElement('canvas');

// Fills it with white background
ctx.fillStyle = 'white';
ctx.fillRect(0, 0, compositeCanvas.width, compositeCanvas.height);

// Draws background image if present
if (backgroundImg) ctx.drawImage(backgroundImg, 0, 0, ...);

// Draws student's drawing overlay
ctx.drawImage(canvas, 0, 0, ...);

// Converts to base64 PNG
return compositeCanvas.toDataURL('image/png');

The tricky part was synchronizing dimensions across device pixel ratios and ensuring the image wasn’t corrupted.
We had to account for devicePixelRatio scaling and proper canvas sizing using getBoundingClientRect().

Challenge 2: Orchestrating Multiple AI Services

The Problem:
The app integrates three separate AI APIs (Gemini, ElevenLabs, Runware), each with unique:

Authentication methods
Request/response formats
Rate limits and timing requirements
Error handling patterns

Key challenges:

Service initialization timing (missing keys, invalid credentials)
Data flow coordination (user question → Gemini → ElevenLabs → Runware)
Error cascading (one service failure breaking the flow)
Race conditions (user asking new questions mid-response)

The Solution:

We created dedicated service classes—GeminiService, ElevenLabsService, and RunwareService—with graceful degradation:

// Conditional initialization with fallback warnings
if (elevenLabsKey) {
  elevenLabsRef.current = new ElevenLabsService(elevenLabsKey);
} else {
  console.warn('Voice output disabled - check API key');
}

We also implemented a state machine in WhiteboardPage.tsx:

isProcessing prevents overlapping requests
statusMessage provides real-time feedback
try-catch blocks isolate service errors
Services fail independently

Example sequence:

const response = await aiTutorRef.current.getResponse(message, imageUrl);
setAiResponse(response); // Store for text display

if (isSpeaking && elevenLabsRef.current) {
  await elevenLabsRef.current.speak(response, voiceId);
} else {
  // Graceful degradation: show text instead
}

The hardest part was debugging the image analysis pipeline—ensuring the base64 image data, MIME type, and Gemini input structure were all correct.

What We Learned

🎨 Canvas API mastery: composite operations, pixel ratios, and efficient drawing
🤖 Multimodal AI integration: combining vision, language, and speech models
⚙️ Graceful degradation: designing resilient systems
🔁 Real-time state management: coordinating async workflows
🧩 Secure RLS policies: user data isolation in Bolt Database
🧠 Prompt engineering: designing natural and pedagogically sound tutor responses

Future Improvements

📝 Persistent canvas history — save and reload sessions
👥 Real-time collaboration — multiple students drawing together
✍️ Handwriting recognition — convert math to $\LaTeX$
📊 Progress tracking — analyze performance trends with ML
📱 Mobile support — touch drawing and responsive layouts

Takeaway:
Building truly useful AI applications isn’t just about calling APIs—it’s about orchestrating multiple systems into one seamless, human-centered experience.