aux.scene

To hear the final audio output of our demo, please click on the Google Drive link and listen to the audio files WITH HEADPHONES!!

Inspiration

Sound is the most underexplored modality in generative AI. Everyone is building image generators, but nobody is building for the ears. We kept asking: what if you could describe a place, or just photograph it, and actually hear it with full spatial dimension? That question became aux.scene.

What it does

aux.scene turns any text description or uploaded image into a fully spatial, interactive audio environment. Type "rainy Tokyo street" or drop a photo of a forest, and an AI pipeline decomposes the scene into individual sound elements, generates each one in parallel, and places them in a 2D space around the listener. You can drag sound sources in real time and hear them move. Abstract inputs like "peace and serenity" are grounded through live web search before generation, so even emotional or conceptual prompts produce accurate, high-quality results.

How we built it

The pipeline has four stages. First, Gemini classifies the input as abstract or concrete. Abstract inputs trigger Google Search grounding to extract real-world sonic descriptors before any audio is generated. Second, Gemini decomposes the scene into 3–4 spatially positioned sound elements with layer, reverb, and mix metadata. Third, all elements are sent to ElevenLabs Sound Effects V2 concurrently via asyncio.gather. Fourth, a custom DSP mixer applies distance attenuation, air absorption EQ, stereo panning (equal-power law), per-layer noise gating, RMS normalization, and soft limiting before exporting a stereo WAV. The frontend spatial editor runs entirely on the Web Audio API, dragging a source updates PannerNode and GainNode parameters in real time without touching the backend.

Challenges we ran into

Seamless looping was the hardest audio problem, ElevenLabs clips are short and tiling them creates audible seams. We built a crossfade loop algorithm that blends each clip's tail into its own head before tiling. Noise gating was equally tricky: sparse sounds like seagulls need aggressive gating, while continuous beds must never be gated. We implemented per-layer RMS envelope detection with hold time and exponential attack/release curves. On the AI side, getting Gemini to extract sonic descriptors (not visual ones) from web search results required significant prompt engineering.

Accomplishments that we're proud of

The abstract-to-concrete grounding pipeline is the contribution we're most proud of, it's not API plumbing, it's a genuine solution to a real limitation of text-to-audio generation. The spatial drag demo is the other: put on headphones, drag a wind chime from left to right, hear it cross. That moment lands every time. We're also proud of the DSP chain, the noise gate, layer-aware EQ, and seamless looping all came from scratch.

What we learned

Spatial audio perception depends on three independent cues: level (distance), pan (angle), and reverb (room size). All three must update together for the illusion to hold. We also learned that LLMs are surprisingly effective sound designers when given tight output schemas and domain-specific constraints. And practically: audio is the most memorable hackathon demo modality. Nobody forgets hearing a sound move around their head.

What's next for aux.scene

Export directly to Unity and Unreal audio formats with spatial metadata intact. A timeline editor for staggered sound entry and scene transitions. Multi-room scene graphs where moving between zones changes the ambient mix. And a filmmaker-specific mode that matches ambiance to video frames in real time, upload a scene, get a temp track that moves with the picture.