Inspiration

The best tutoring moments aren’t transactional. They happen when a professor leans over your work, notices something specific, and says, “Wait! Look at that sign right there.” That moment of shared attention over a shared surface is what makes one-on-one teaching so powerful.

Every AI tutor we tried felt more like a smarter search engine. You type a question, it types an answer. There’s no shared space, no gesture, no sense that you’re working through something together. We kept asking: what if the AI could actually see what you’re writing, talk to you naturally, and pick up a marker to write back?

That question became Professor KIA.


What it does

Professor KIA is a voice-first AI tutoring app built around a shared freehand whiteboard. You speak naturally, no forms, no typing, and KIA speaks back. You sketch your work with a mouse or stylus; KIA watches through computer vision and responds to what he sees. When he explains something, he picks up the marker himself and writes alongside you, with animated handwriting that appears stroke by stroke in real time.

The core experience:

  • Talk naturally, your mic is always live, no push-to-talk
  • Write freely, draw diagrams, work through problems, cross things out
  • KIA sees your work! Rhe board is automatically shared with the AI, which interprets your handwriting using vision
  • KIA writes back, too! Hints in blue, corrections in red, confirmed work in green, all in his own handwriting
  • Interrupt anytime. Start speaking mid-explanation and he stops instantly, both voice and pen

KIA is Socratic by design. He asks before he tells. He guides you toward the answer rather than handing it over.


How we built it

The system runs over a single WebSocket connection between a Next.js frontend and a FastAPI backend, with four subsystems operating in parallel:

Voice pipeline:
The student’s microphone streams raw audio to the backend, which forwards it to Deepgram Nova-2 for real-time transcription. Final transcripts are merged across a short buffer window (to avoid splitting one sentence into multiple responses) and sent to Claude. ElevenLabs streams TTS audio back in chunks, which the frontend queues into the Web Audio API and plays sequentially.

Vision loop:
The frontend composites two canvas layers: the student’s tldraw drawing and KIA’s transparent handwriting overlay, into a single PNG every few seconds. This snapshot is attached to the Claude API call as a vision input, so the model sees the full board alongside conversation history.

Handwriting synthesis:
KIA’s writing needed to feel human, not rendered. We extracted glyph outlines from the Caveat handwriting font using fonttools, sampled Bézier curves into ordered point sequences, and animated them with requestAnimationFrame. The timing is calibrated so he finishes writing at roughly the same moment he finishes speaking. A small ±1px jitter per point adds an organic feel.

Structured LLM responses:
Claude returns JSON on every turn. Speech text, board actions, tutor state, and a wait_for_student flag. A partial JSON parser detects the completed speech field during streaming and triggers TTS immediately, without waiting for board actions. This reduces perceived latency by the full duration of KIA’s speech.


Challenges we ran into

Echo suppression:
KIA’s own voice fed back into the microphone and got transcribed, causing him to respond to himself. We built a two-stage barge-in system: a Deepgram SpeechStarted event alone doesn’t trigger an interrupt, but SpeechStarted followed by a real transcript within 1.5 seconds does. Combined with a post-TTS cooldown window, this nearly eliminated false self-interruptions.

Canvas coordinate alignment:
KIA’s handwriting lives on a transparent overlay canvas, while the student’s drawing lives inside tldraw’s internal canvas system. Keeping both in the same coordinate space required locking the tldraw camera to the identity transform and syncing overlay dimensions on every resize event.

LLM positional drift:
Claude produced good content but unreliable placement, sometimes writing over existing text. We solved this server-side: an orchestrator rebases all write-action coordinates below the current content cursor. The LLM always targets a fixed Y value; the orchestrator handles geometry silently.

Full-stack barge-in:
When the student starts speaking, four things must stop immediately: the audio currently playing, queued audio chunks, the handwriting animation loop, and backend board dispatch. Each required a different mechanism: audio stop, queue drain, frame-level cancellation flags, and backend interruption checks, all triggered by a single frontend event.


Accomplishments we’re proud of

The moment everything synchronized, KIA speaking while his handwriting appeared stroke by stroke, both finishing at the same time, felt genuinely magical. It’s the closest thing we’ve seen to a real professor at a real whiteboard.

We’re especially proud of the barge-in system. Interrupting an AI mid-sentence and mid-stroke, with both stopping instantly and the conversation continuing naturally, is technically subtle and experientially significant. It makes the interaction feel like a real conversation rather than a query-response loop.

We’re also proud of the handwriting pipeline. No external model either, just font curves, Bézier sampling, and careful animation timing, producing something that reads as genuinely hand-drawn.


What we learned

Real-time multimodal systems have failure modes that only appear when all the pieces run together. STT, LLM, TTS, handwriting synthesis, and WebSocket state each work fine in isolation. The interesting bugs live in the transitions: what happens when a Deepgram chunk arrives while the LLM lock is held? What happens when the student starts drawing 300ms after KIA begins speaking?

We learned to think about the system as a timeline rather than a state machine, every event has a timestamp, and correctness depends on the ordering of those timestamps as much as the values themselves.

We also learned that the system prompt matters more than model tier for personality. A well-crafted Socratic prompt on a smaller model produces a more useful tutor than a generic prompt on a larger one. The pedagogy lives in the instructions, not the weights.


What’s next for Professor KIA

LaTeX handwriting:
The math rendering pipeline: LaTeX → MathJax SVG → stroke paths, is partially built but not yet live. Enabling it would let KIA write real math notation on the board instead of text approximations.

Diagram primitives:
Arrows, boxes, and labeled nodes would allow KIA to draw linked lists, trees, and free-body diagrams, unlocking CS and physics as first-class subjects.

Session memory:
Right now everything resets on page refresh. Persisting conversation and board history across sessions would let KIA remember where you left off and track progress over time.

Multi-student mode:
A shared board visible to a small study group, with KIA facilitating everyone draws, everyone hears him, and he mediates the discussion.

Built With

Share this project:

Updates