Headline Vox Diagram: Speak Your Diagram Into Existence
Inspiration Watching students fight with graphing tools at 2am to visualize something like F⃗=qv⃗×B⃗\vec{F} = q\vec{v} \times \vec{B} F=qv×B when they could just say it out loud in three seconds. A spoken sentence like "draw a force vector pointing down labeled mgmg mg" contains everything needed to render the diagram. The missing piece was a system connecting speech to structured rendering in real-time, and letting you explore the result with your hands instead of a mouse.
How We Built It Three parallel pipelines feed a shared state store. ElevenLabs Scribe v2 transcribes speech over WebSocket, the transcript hits Claude's API which converts it into a strict JSON diagram schema (vectors, points, curves, equations, etc.), and Three.js renders the result with KaTeX labels. Simultaneously, MediaPipe HandLandmarker tracks 21 hand landmarks client-side at 30+ fps. Palm position maps to camera orbit angles, and thumb-to-index distance controls zoom. All camera values pass through exponential smoothing (camt=0.15⋅rawt+0.85⋅camt−1\text{cam}t = 0.15 \cdot \text{raw}_t + 0.85 \cdot \text{cam}{t-1} camt=0.15⋅rawt+0.85⋅camt−1) with a 10% dead zone to kill jitter. The stack is React + Vite + TypeScript, react-three-fiber, Zustand for state, and a thin Express backend proxying API keys.
Challenges and Learnings Three main problems. LLM output reliability: Claude doesn't always return clean JSON, so we added aggressive parsing, schema validation, and dozens of prompt revisions with concrete examples. Hand tracking jitter: raw MediaPipe landmarks fluctuate 1-3% per frame even with a still hand, solved by combining smoothing with velocity thresholding. Latency: the Claude API call (1-3s) is the bottleneck, mitigated by streaming the response and rendering diagram objects incrementally as they arrive. The biggest lesson: the diagram JSON schema is the entire product. The tighter and more constrained the output format, the more reliable everything becomes.
Built With
- anthropic-messages-api
- client-side)-speech-to-text:-elevenlabs-scribe-v2-realtime-websocket-api-ai/nlp:-anthropic-claude-api-(claude-sonnet-4-20250514)-backend:-node.js
- express-apis:-elevenlabs-stt-api
- javascript-frontend:-react
- katex
- languages:-typescript
- mediadevices
- react-three-drei)
- three.js-(react-three-fiber
- vite
- web-audio-api
- zustand-hand-tracking:-google-mediapipe-handlandmarker-(webassembly
Log in or sign up for Devpost to join the conversation.