Headline Vox Diagram: Speak Your Diagram Into Existence

Inspiration Watching students fight with graphing tools at 2am to visualize something like F⃗=qv⃗×B⃗\vec{F} = q\vec{v} \times \vec{B} F=qv×B when they could just say it out loud in three seconds. A spoken sentence like "draw a force vector pointing down labeled mgmg mg" contains everything needed to render the diagram. The missing piece was a system connecting speech to structured rendering in real-time, and letting you explore the result with your hands instead of a mouse.

How We Built It Three parallel pipelines feed a shared state store. ElevenLabs Scribe v2 transcribes speech over WebSocket, the transcript hits Claude's API which converts it into a strict JSON diagram schema (vectors, points, curves, equations, etc.), and Three.js renders the result with KaTeX labels. Simultaneously, MediaPipe HandLandmarker tracks 21 hand landmarks client-side at 30+ fps. Palm position maps to camera orbit angles, and thumb-to-index distance controls zoom. All camera values pass through exponential smoothing (camt=0.15⋅rawt+0.85⋅camt−1\text{cam}t = 0.15 \cdot \text{raw}_t + 0.85 \cdot \text{cam}{t-1} camt​=0.15⋅rawt​+0.85⋅camt−1​) with a 10% dead zone to kill jitter. The stack is React + Vite + TypeScript, react-three-fiber, Zustand for state, and a thin Express backend proxying API keys.

Challenges and Learnings Three main problems. LLM output reliability: Claude doesn't always return clean JSON, so we added aggressive parsing, schema validation, and dozens of prompt revisions with concrete examples. Hand tracking jitter: raw MediaPipe landmarks fluctuate 1-3% per frame even with a still hand, solved by combining smoothing with velocity thresholding. Latency: the Claude API call (1-3s) is the bottleneck, mitigated by streaming the response and rendering diagram objects incrementally as they arrive. The biggest lesson: the diagram JSON schema is the entire product. The tighter and more constrained the output format, the more reliable everything becomes.

Built With

  • anthropic-messages-api
  • client-side)-speech-to-text:-elevenlabs-scribe-v2-realtime-websocket-api-ai/nlp:-anthropic-claude-api-(claude-sonnet-4-20250514)-backend:-node.js
  • express-apis:-elevenlabs-stt-api
  • javascript-frontend:-react
  • katex
  • languages:-typescript
  • mediadevices
  • react-three-drei)
  • three.js-(react-three-fiber
  • vite
  • web-audio-api
  • zustand-hand-tracking:-google-mediapipe-handlandmarker-(webassembly
Share this project:

Updates