The eye is listening

Inspiration

Music has always been a visual experience in our minds, a bass drop feels like an explosion, a soft piano looks like rippling water. We wanted to make that synesthesia real. The question was simple: what if Darude - Sandstorm didn't just hit your ears, but painted a fractal?

Every genre carries a different emotional geometry. A techno kick lives in a different mathematical universe than a jazz chord. We wanted to prove that visually.

What we built

A real-time pipeline that listens to music and generates unique fractal images driven by the physics of the sound itself.

The system extracts a rich set of audio features on every frame:

Feature	What it captures
Volume (dBFS)	Energy level, from silence (−80 dB) to full scale (0 dB)
Bass / Mid / Treble	Spectral energy per band, each normalised 0 → 1
Spectral centroid	Perceived brightness of the mix (Hz + log % scale)
Smoothness	Crest factor, how percussive vs. sustained the sound is
BPM	Tempo, recomputed every ~2 s from onset history
Chroma (12 bins)	Harmonic content per pitch class (C, C#, … B)

Each of these values becomes a parameter of the fractal renderer, zoom depth, iteration count, color palette, symmetry axis, animation speed. A heavy bass hit warps the geometry. A bright cymbal opens the color space. The BPM drives the rhythm of the visual evolution.

The result: no two genres look the same.

How we built it

The architecture is a clean producer → wire → consumer split:

Music / mic / loopback ↓ Python audio brain (FFT, chroma, BPM, smoothness) ↓ OSC over UDP (~86 Hz) ↓ C++ fractal renderer ↓ Live fractal image

Producer (Python): A FeatureExtractor runs one FFT per chunk (512 samples @ 44.1 kHz) and derives all features. A RollingPeak normaliser keeps band values in [0, 1] relative to recent history, so quiet ambient music still drives the visuals, it auto-adapts. Features are packed into an OSC /audio/frame message and fired over UDP at ~86 Hz.

Wire format: ~140 bytes per packet. A SequenceTracker on the receiver counts gaps, so dropped packets are visible in the dashboard. On localhost, dropped = 0 always.

Consumer (C++): UdpListener → OscParser → fractal parameter mapping → rendered frame. The C++ side is fully header-only modular, drop the headers into any renderer.

Infrastructure: The entire project ships as two Docker images orchestrated by a single ./play.sh command. One Ctrl-C stops everything. Works on Linux (PulseAudio Unix socket) and macOS (PulseAudio TCP via BlackHole).

The smoothness metric deserves a special mention — it is derived from the crest factor:

$$ \text{smoothness} = 1 - \log_{\,c_{\max}}!\left(\frac{\text{peak}}{\text{RMS}}\right) $$

A pure sine wave (crest $= \sqrt{2}$) gives smoothness $\approx 1.0$. A sharp snare hit (crest $\approx 10$) gives smoothness $\approx 0.0$. This single value tells the renderer whether to show fluid organic forms or sharp geometric spikes.

Challenges

Latency vs. accuracy tradeoff: smaller chunks = lower latency but noisier FFT. We settled on 512 samples (~12 ms) as the sweet spot between reactivity and stable feature values.
BPM stability: librosa's tempo estimator needs ~4 s of onset history before it locks on, and half/double-tempo errors are common in the first few seconds. We display until it stabilises.
Cross-platform audio routing: Docker can't see Core Audio or WASAPI directly. We bridge through PulseAudio TCP on macOS/Windows, which required careful ALSA → pulse configuration inside the container.
Chroma vs. note: early versions tried to send a single MIDI note name. Real music is chords, argmax of chroma is meaningless most of the time. We switched to sending all 12 chroma bins and let the renderer decide what to do with harmony.

What we learned

Real-time audio feature extraction is as much about normalisation as it is about the math. Raw FFT values are useless without a running-peak normaliser that adapts to the dynamic range of each song.
OSC over UDP is a surprisingly elegant wire protocol for this use case , stateless, low overhead, and trivially parseable in C++.
The most musically meaningful features for visuals are not always the obvious ones. Smoothness (crest factor) and spectral centroid consistently produce more interesting mappings than raw volume alone.

What's next for The eye is listening

Adding features such as scene creation for shows and spectacles would be really nice to submerge the users into a full visual and audio experience !

Built With

arch
c++
docker
hlsl
hypr
metal
midi.js
python
sdl
tcp
udp
vulcan

Updates

MrD0nKey Desjardins started this project — May 24, 2026 10:57 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.