Low-latency Sound Disambiguator

Inspiration

We want to help people detect sounds better, to protect from danger or find source of sound. This project addresses accessibility needs for hearing-impaired individuals and situational awareness for detecting critical environmental sounds like sirens, dog barks, explosions, and gunshots in real-time. The goal is to create a low-latency system that not only classifies sounds but also estimates their direction, providing comprehensive audio intelligence for safety and accessibility applications. We also envision applications with military-grade equipment when dealing with very low sounds that require enhanced detection capabilities.

The Problem: Over 47 million Americans miss emergency alerts because of hearing loss. 1 in 5 deaf people report missing emergency alerts. Imagine a deaf person crossing a busy road when an ambulance approaches—they hear nothing, keep walking, and are at risk. Our system changes that by making sounds visible instantly.

What it does

The Low-Latency Sound Disambiguator is a real-time Sound Alert & Direction Detection System that integrates deep learning, signal processing, and generative AI to perform:

Real-time sound classification using Google YAMNet model, recognizing 521 environmental sound classes with confidence scores
Sound direction estimation using TDOA (Time Difference of Arrival) algorithms, calculating 2D sound angles using cross-correlation
Simulated stereo mode for mono microphones, enabling direction detection even on devices without hardware stereo support (like MacBook)
AI-generated natural summaries using the open-source Mistral model via Ollama, providing contextual insights about detected sounds
Interactive Streamlit dashboard with dark/light themes, visual alerts, detection history, and confidence trend analysis

The system processes audio in 3-second chunks at 16kHz sample rate, providing near real-time feedback with visual alerts and directional compass visualization. When danger sounds are detected, the system displays colored screen flashes, custom alert images, giant onomatopoeia text (like "WOO WOO!" for sirens, "BOW WOW!" for dog barks, "WAA WAA!" for baby cries), and directional arrows using 2D TDOA spatial recognition. This same technology can extend to car horn direction detection while driving, providing quick visual indicators for safer navigation.

Privacy & Performance: Everything runs locally—no cloud, no internet required, no privacy risks. This ensures maximum speed, security, and complete data privacy.

How we built it

Technology Stack

Frontend: Streamlit for interactive web UI
Classification: TensorFlow Hub (YAMNet) for sound recognition
Signal Processing: SciPy/NumPy for audio resampling and cross-correlation
Direction Estimation: Custom TDOA implementation using cross-correlation
AI Summary: Ollama + Mistral for local LLM-based summaries
Visualization: Matplotlib for polar direction plots and confidence charts

Core Algorithms

TDOA (Time Difference of Arrival) Calculation:

The system uses cross-correlation to compute the time difference between two audio channels:

\( \text{TDOA} = \frac{\text{lag}_{\max}}{f_s} \)

where \( \text{lag}_{\max} \) is the lag at maximum correlation and \( f_s \) is the sample rate (16kHz).

Angle Estimation:

The direction angle is calculated from TDOA using the speed of sound:

\( \theta = \arcsin\left(\frac{c \cdot \text{TDOA}}{d}\right) \)

where:

\( c = 343 \text{ m/s} \) (speed of sound)
\( d \) is the microphone spacing (configurable, default 0.15m)
\( \theta \) is the estimated angle in degrees

Audio Processing Pipeline:

Audio Capture (sounddevice) 
  → Preprocessing (resampling to 16kHz)
    → YAMNet Classification (TensorFlow Hub)
      → Confidence & Label Extraction
        → TDOA Direction Estimation (if stereo available)
          → UI Display (Streamlit)
            → AI Summary Generation (Ollama/Mistral)

Key Features

Stereo Detection: Automatically detects hardware stereo support, falls back to mono with simulated stereo (2-6ms micro-delay)
Real-time Processing: Continuous audio chunk processing with configurable duration (1-5 seconds)
Alert System: Monitors 9 critical sound types (Siren, Dog bark, Fire alarm, Smoke alarm, Explosion, Gunshot, Crying, Scream, Thunder)
Visual Alerts: Colored screen flashes, custom alert images, and giant onomatopoeia text for instant recognition
History Logging: Maintains rolling detection history with timestamps and confidence scores
Theme Support: Dark/light mode toggle for user preference
Edge Case Handling: No stereo mic? System automatically fakes it with a small delay. Errors? Clear warnings displayed (e.g., "Stereo not available")
100% Local Processing: All computation happens locally—no cloud, no internet, maximum privacy and speed

Challenges we ran into

Microphone Quality and Hardware Limitations: We wanted better quality microphones and were looking for something with Raspberry Pi compatibility, but we didn't find suitable options. This led us to use laptop microphones, which limited us to 2D detection since we only had stereo microphones. The direction detection was also not completely accurate due to reverb and environmental noise affecting the TDOA calculations. Noisy rooms can confuse the system, though better microphones would significantly improve accuracy. USB stereo arrays could cut latency by 50% compared to laptop mics.
Real-time Latency: Processing 3-second audio chunks through YAMNet inference while maintaining responsive UI required optimization of the audio pipeline and efficient use of Streamlit's caching mechanisms for model loading. Current 3-second latency works well for sirens but is too slow for fast dangers like gunshots. Future optimization with TensorFlow Lite could reduce this to sub-second response times.
YAMNet Live Stream Integration: One of our main struggles was getting YAMNet working for live stream data. The model was designed for batch processing, and adapting it to work with continuous real-time audio streams required careful handling of audio chunking, preprocessing, and inference timing.
TDOA Accuracy: The accuracy of direction estimation depends on proper microphone spacing calibration and environmental factors. We implemented configurable mic spacing parameters and clipping of angle ratios to valid ranges (\( -1 \leq \text{ratio} \leq 1 \)).
AI Summary Integration: Integrating Ollama with the Streamlit app required subprocess management and timeout handling to ensure the UI remained responsive during LLM inference.

Accomplishments that we're proud of

Successful Integration: Successfully integrated three distinct technologies (deep learning classification, signal processing for direction estimation, and generative AI for summarization) into a cohesive real-time system.
Accessibility Focus: Created a practical tool that addresses real-world accessibility needs, with potential applications for hearing-impaired individuals and safety monitoring.
Low-Latency Design: Achieved near real-time processing with minimal delay between audio capture and classification results, processing audio chunks in approximately 1-2 seconds.
Clean Architecture: Developed a modular codebase with separate utilities for TDOA calculations (tdoa_utils.py), making the system maintainable and extensible.
User Experience: Created an intuitive Streamlit interface with multiple visualization modes (live detection, history, confidence trends, AI summaries) and theme customization.
Open-Source Integration: Successfully integrated open-source tools (Ollama, Mistral) for local AI summarization, ensuring privacy and avoiding cloud API dependencies.
Privacy-First Architecture: Built entirely for local processing—no cloud dependencies, no data transmission, complete privacy and security. This local-first approach also provides faster response times compared to cloud-based solutions.

What we learned

Signal Processing Fundamentals: Deepened understanding of cross-correlation, TDOA algorithms, and audio signal processing techniques for direction estimation.
Real-time ML Inference: Learned to optimize TensorFlow model inference for real-time applications, including proper audio preprocessing and batch processing.
Streamlit Development: Gained expertise in building interactive dashboards with Streamlit, including state management, caching, and real-time updates.
Audio Hardware Limitations: Discovered the challenges of working with consumer-grade audio hardware. Initially, we thought our laptops only supported mono input, but learned they actually support stereo (2-channel) input, which enabled our direction detection capabilities.
Mathematical Modeling: Applied physics principles (speed of sound, trigonometry) to convert time differences into spatial angles, with proper handling of edge cases and numerical stability.
LLM Integration: Learned to integrate local LLM models (Ollama) into Python applications, including subprocess management and prompt engineering for contextual summaries.
System Architecture: Understood the importance of modular design, separating concerns (audio processing, ML inference, UI) for maintainability and testing.

What's next for Low-latency Sound Disambiguator

Beamforming for Simultaneous Sound Direction: Build a system that can show the direction of all different sounds simultaneously using beamforming techniques. This should be integrated as a mobile app or, ideally, as an AR application with platforms like Google Lens, featuring a clean UI that allows ease of detecting sounds to users. Additionally, implement amplitude-based detection which makes louder or closer sounds appear more prominent in the visualization.
Sub-Second Latency: Reduce current 3-second latency to sub-second response times using edge AI (TensorFlow Lite) for faster alerts, especially critical for fast dangers like gunshots.
Expanded Sound Detection: Add car horns, door knocks, and other custom sounds via custom YAMNet model training on datasets like AudioSet, expanding to 1000+ sound classes.
3D Spatial Localization: Upgrade from 2D to full 3D direction detection using 3-4 microphone arrays with multilateration techniques. Implement using libraries like Pyroomacoustics to calculate elevation angles, enabling detection of sounds above (e.g., drones) or below (e.g., children). This addresses the current limitation of only 2D detection—for 3D, we need a third microphone in a triangle array configuration.
Mobile App with Haptic Feedback: Develop a Flutter-based mobile app (Android/iOS) with haptic vibration alerts for smartwatches and wearables, enabling pocket alerts and on-the-go sound detection.
Smart Home Integration: Pair with smart home systems to notify via lights, vibrating floors, or other IoT devices for multi-sensory alerts.
AR Glasses Overlay: Integrate with AR platforms to show sound alerts and direction indicators in real-world view through AR glasses.
User Training Feature: Add capability for users to "teach" the app custom sounds (e.g., their dog's bark, specific doorbell) through user training and fine-tuning.
Commercial Earpiece Product: Develop a portable, battery-powered earpiece device with Bluetooth connectivity to phones, partnering with deaf organizations for real-user testing and validation.
Database Integration: Implement SQLite database for persistent event logging, enabling long-term analytics and pattern detection in sound events.
Voice/TTS Feedback: Add text-to-speech capabilities to provide audio feedback for detected alerts, enhancing accessibility for visually impaired users.
Analytics Dashboard: Build comprehensive analytics dashboard with SQLite integration, showing detection trends, frequency analysis, and environmental sound patterns over time.
Custom Alert Configuration: Allow users to configure custom alert sounds and sensitivity thresholds based on their specific needs.
Multi-language Support: Extend AI summaries to support multiple languages for global accessibility.

Built With

numpy
ollama
python
scipy
streamlit
subprocess
tdoa
tensor

Submitted to

UB Hacking Fall 2025

Created by

I captured both channels from the laptop’s dual-mic array, computed the cross-correlation between the two signals to estimate the time-difference-of-arrival (TDOA), and then used the known mic spacing and speed of sound to convert that delay into an azimuth angle (direction of arrival).

Manogna R
Amal Namboodiri
rohith sagar
Bhargav Hegde

Updates

Amal Namboodiri started this project — Nov 08, 2025 11:43 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.