Landing page
Happy reading
Locked in reading
Party reading
Group Reading

VenIQ

Inspiration

Music playback is static. The room isn't. We wanted to close that loop and give a venue the ability to sense its own energy and respond without a human constantly managing the queue. Not only face-by-face sentiment averaging, but actual scene understanding: is the room packed and moving, or quiet and heads-down?

How We Built It

Two layers run in parallel.

Local (MediaPipe, ~60fps): FaceLandmarker tracks 478 facial landmarks — blink rate, smile score, brow furrow — in _ Lock In _ mode for single-person focus detection. PoseLandmarker (full model, CPU delegate) tracks up to 6 people's skeletons and detects hands raised in Club mode. Both output natural-language context strings that get passed to Gemini with every frame.
Cloud (Gemini 2.5 Flash, every 3s): receives the JPEG + MediaPipe context, returns { sentiment, energy, description }. Change detection uses |ΔE| ≥ 3 with a 30-second cooldown. An isAnalyzing guard prevents frame stacking when Gemini is mid-response.
Audio (Tone.js): Tone.js runs two Player nodes through independent Filter and Volume chains. On track change, a transition is randomly selected (lowpass sweep, highpass sweep, or cut) and both volume nodes ramp simultaneously — zero bleed between tracks.

What We Learned

Multimodal AI is surprisingly good at scene understanding. Gemini 2.5 Flash doesn't just see faces. It reads posture, density, lighting, and context. A single prompt can tell the difference between "students quietly typing" and "people jumping and cheering."
MediaPipe runs fast enough to be useful at 60fps in the browser. Running a 478-point face mesh and a 33-point pose skeleton entirely client-side, with no server round-trip, was genuinely impressive.
The Web Audio API has sharp edges. Tone.js abstracts a lot of pain, but getting true gapless crossfades with filter sweeps required careful volume isolation to prevent bleed between the two audios.