Inspiration

What happens if you reverse-engineer the mouth movements that a piece of audio implies? Real speech comes from a tongue, jaw, and lips that have physical speed limits. A human tongue tip tops out around 20 cm/s during natural speech. TTS models don't simulate a mouth. They learn acoustic patterns and reproduce them, and the articulator movements their audio implies are often biomechanically impossible.

LARYNX reconstructs those implied movements, measures their velocities, and renders the result inside a 3D skull cross-section. When a deepfake's implied tongue is moving at 9x the human maximum and clipping through the hard palate, you don't need a confidence score to understand what you're seeing.

What It Does

  • Audio Processing: You upload a voice clip. LARYNX extracts the formant trajectories (the resonant frequencies shaped by tongue position, jaw height, and lip rounding) and maps them backward to the articulatory movements that would have produced those sounds.
  • Velocity Computation: It computes velocities frame by frame to determine how fast each articulator would need to move between consecutive positions. Real speech stays within biomechanical limits. Deepfakes don't.
  • Real-Time 3D Visualization: The analysis streams in real time to a translucent sagittal cross-section of a human skull with a tongue model driven directly by the audio's formant data. The formant-to-morph-target mapping is intentionally unclamped. If the audio implies impossible positions, the tongue goes there—through bone, through the nasal cavity.
  • Side-by-Side Comparison: You can paste text, generate synthetic speech from it via TTS, and compare it side-by-side against a real recording of the same words, watching the articulatory differences play out in the same skull.

How I Built It

  • Audio Pipeline: Runs on Modal B200 GPUs. Audio is processed through HuBERT Large for learned speech representations, then passed through an Articulatory Analysis by Inference (AAI) model that maps acoustic features to six articulatory channels: tongue tip, tongue body, tongue dorsum, jaw, upper lip, and lower lip.
  • Feature Extraction: In parallel, Praat (via parselmouth) extracts F1 through F4 formant contours at high temporal resolution. Velocity is computed frame-to-frame across each articulator, producing 108 features including cross-correlations, jerk profiles, and velocity distribution statistics.
  • Streaming Delivery: Results stream back as Server-Sent Events so the frontend animates progressively. There is no loading spinner and no waiting for the full analysis to finish.
  • The Classifier: A HistGradientBoostingClassifier trained on articulatory features extracted from thousands of samples across 73 TTS architectures, from real speech (LibriSpeech) to synthetic speech spanning ElevenLabs, WaveFake vocoders, and more. The feature space targets generalized physics: velocity distributions, formant transition rates, and jerk profiles. Cross-validated accuracy sits at 89.2% using StratifiedGroupKFold to prevent speaker leakage.
  • Frontend Stack: Built with React and Three.js via React Three Fiber. The head model is an ARKit 52-blendshape mesh rendered with mesh transmission materials for the translucent skull effect. GSAP handles animation, Zustand manages state, and Tone.js drives reactive sound design that pitch-shifts and distorts as velocity anomalies appear. All animation state lives in refs and transient stores to maintain 60fps.
  • Cloud Infrastructure: Seven Cloudflare products handle routing, storage, and inference at the edge: Workers (Hono API proxy), Pages (static hosting), D1 (analysis history), R2 (audio storage), Workers AI (BGE embeddings for audio fingerprinting), Vectorize (similarity search), and AI Gateway (rate limiting and observability).

Challenges I Ran Into

  • Signal Processing: Formant extraction is inherently noisy. Praat's tracker produces octave jumps where F2 leaps 500+ Hz between 10ms frames. The linear mapping amplified those into velocity spikes that looked identical to real anomalies. Separating genuine articulatory anomalies from extraction artifacts to prevent false positives was the hardest signal processing problem.
  • Pipeline Discrepancies: I lost time to a VELOCITY_SCALE mismatch between the training pipeline and the inference path. The training code computed velocities raw, while the classifier applied a 1.5x scaling factor. This caused a systematic distribution shift that quietly tanked accuracy until I audited the pipeline end-to-end.
  • Data Integrity: Long audio (15+ seconds) produced noisy articulatory trajectories that corrupted downstream features. I added duration filtering to keep inference clean rather than letting bad data poison the classifier.
  • Frontend Performance: Driving a physically-based 3D model from streaming data at 60fps required strict optimization. This meant no useState for animation, no object allocation inside useFrame, delta clamping for browser tab throttling, and pre-allocating Three.js objects. Tone.js also threw a RangeError using rampTo() when current equaled target, costing an hour before switching to linearRampTo().

Accomplishments I'm Proud Of

  • The Visualization: The skull clip moment works. Watching a deepfake's implied tongue phase through the hard palate provides an immediate understanding of why the audio is fake. It validates the core thesis.
  • Classifier Performance: Achieving 89.2% cross-validated accuracy across 73 TTS architectures using only articulatory physics features. No spectrograms or acoustic signatures tied to specific models were used.
  • Future-Proof Detection: The detection generalizes across TTS architectures the system hasn't trained on because the features are grounded in physics. Acoustic fingerprinting breaks when models update; articulatory physics doesn't.
  • Edge Architecture: Integrating seven Cloudflare products into a cohesive data flow where each product handles one specific concern, touching all of them on every request.

What I Learned

  • Explainability as a Product: 3D visualization turns explainability from a nice-to-have into the actual product. Hearing that "velocity exceeded 184 cm/s" is just a number, but watching it happen inside a skull creates conviction.
  • Data vs. Model: The hard engineering wasn't the classifier; it was the training pipeline. Processing thousands of samples consistently, keeping feature extraction deterministic, and catching silent distribution shifts is where the real hours go. The model is the easy part.
  • Demo Environments: 80dB of ambient noise in a convention hall will destroy formant extraction. Pre-recorded audio is strictly required for live demos.

Built With

  • cloudflare-ai-gateway
  • cloudflare-d1
  • cloudflare-pages
  • cloudflare-r2
  • cloudflare-vectorize
  • cloudflare-workers
  • cloudflare-workers-ai
  • gsap
  • librosa
  • modal
  • numpy
  • praat
  • python
  • react
  • react-three-fiber
  • three.js
  • tone.js
  • typescript
  • vite
  • zustand
Share this project:

Updates