Inspiration
EchoVision was inspired by my personal experience as a live sound engineer. I’ve worked in many venues where sound system decisions were made by guesswork—after installation, problems like dead zones, poor coverage, and costly rework became obvious. I wanted a way to understand a space before touching any equipment, using visuals instead of trial and error. That desire to make sound planning clearer, faster, and more human is what led to EchoVision.
What it does
EchoVision is a multimodal AI-powered sound planning and visualization platform. Users upload photos of a venue, and EchoVision transforms them into interactive 2D floor plans, 3D spatial models, and VR walkthroughs.
Instead of producing a wall of text, the system generates structured spatial data—positions, angles, speaker roles, and room dimensions—allowing users to see how sound behaves and identify coverage issues before deployment.
How EchoVision was built
EchoVision was built with Next.js, React Three Fiber, Three.js, and WebXR.
At the core is Google Gemini, which performs multimodal image analysis and spatial reasoning. Gemini interprets venue images to infer geometry, speaker placement, and acoustic problem areas. Its structured output feeds directly into:
- 2D Schematic View: For layout clarity and quick overhead planning.
- 3D Environment: A full-scale digital twin for spatial accuracy using Three.js.
- VR Experience: A WebXR-powered walkthrough that places users inside the venue at human height.
All AI processing is handled securely server-side to protect API keys and ensure scalability.
Challenges I ran into
One major challenge was translating AI reasoning into physically accurate 3D space. Aligning coordinate systems, camera logic, and VR entry points required careful iteration. Another challenge was documenting Gemini’s limitations regarding exact acoustic physics—designing EchoVision as a planning assistant rather than a replacement for raw measurement tools.
Accomplishments that I'm proud of
- Visual Synthesis: Successfully turning flat venue photos into usable 3D spatial models.
- Seamless Flow: Creating a unified pipeline from AI reasoning to 2D, 3D, and VR visualization.
- Advanced AI Usage: Using Gemini not just for answers, but as a reasoning engine that identifies issues and recommends fixes with cost awareness.
What I learnt
I learned how powerful multimodal AI becomes when paired with visualization. Gemini’s ability to reason across images enabled a new workflow where AI feels like a teammate, not a chatbot. I also gained deeper insight into spatial computing, WebXR constraints, and how engineers naturally interpret space when making critical sound decisions.
Panoramic images introduced unexpected complexity. Early prototypes attempted to use 360° panoramic photos for single-shot venue capture. However, Gemini's spatial analysis struggled with the geometric distortion inherent in panoramic projections, leading to inaccurate dimension estimates and speaker placement. Switching to multiple standard-view images (stage/front, left, right, back/ceiling) proved more reliable.
Complex venues exposed model limitations. Non-rectangular halls, curved walls, and multi-tier seating arrangements challenged Gemini's ability to infer accurate spatial models. These edge cases highlighted the importance of user verification and the need for future improvements in handling architectural complexity.
What's next for EchoVision
The roadmap includes:
- Frequency-specific modeling to visualize bass build-up vs. high-end clarity.
- AR-based speaker placement for on-site technicians.
- Complex Venue Support: Specialized modeling for balconies and tiered seating.
- Exportable Technical Reports: Automated PDF documentation for venue stakeholders.
Log in or sign up for Devpost to join the conversation.