Massive Audio Classifier

Inspiration

We were inspired by the idea of enabling machines to understand the world the way humans do—through sound. From emergency sirens to dog barks, environmental audio contains rich information that can be used in safety systems, smart cities, and assistive technologies. The ESC-50 dataset presented the perfect challenge to explore large-scale audio classification within a limited timeframe.

What it does

The Massive Audio Classifier takes in a short audio clip and identifies the most likely environmental sound class from a set of 50 predefined categories. It uses a pretrained model (YamNet) for feature extraction and a custom neural network classifier to make predictions. The model is also deployed as an interactive web app where users can upload their own .wav files for real-time classification.

How we built it

We used the ESC-50 dataset, consisting of 2000 labeled 5-second audio clips.
Each clip was processed through YamNet, a pretrained model from Google that outputs 1024-dimensional embeddings.
We built a compact feedforward neural network on top of those embeddings and trained it on the ESC-50 labels.
Training and evaluation were done in TensorFlow/Keras.
We built a web app using Streamlit that accepts user-uploaded audio and displays classification results.
All of this was completed in under 8 hours as part of a hackathon.

Challenges we ran into

Preprocessing thousands of audio files quickly, including consistent resampling and padding.
Handling class imbalance and evaluating model generalization on a small test set.
Integrating the embedding pipeline with real-time user input in the web app.
Interpreting noisy predictions and debugging issues with metric reporting (e.g., confusion matrix dimensionality).

Accomplishments that we're proud of

Achieved strong classification performance (>XX% test accuracy) using only a few dense layers and transfer learning.
Built a working web app that handles real-time audio predictions.
Visualized results effectively with training curves and confusion matrices.
Went from raw dataset to deployable prototype in under 8 hours.

What we learned

The power of transfer learning, especially in domains like audio where feature engineering is complex.
How to work efficiently with large audio datasets and embeddings in a hackathon setting.
Practical experience deploying ML models in web apps using Streamlit.
Tricks to debug and evaluate multiclass classifiers effectively under time pressure.

What's next for Massive Audio Classifier

Deploy the web app to Hugging Face Spaces or Streamlit Cloud.
Improve model accuracy using data augmentation and ensemble methods.
Add live microphone input and streaming predictions.
Expand to multilingual speech or musical instrument recognition.
Create a mobile app that listens and classifies in real time for accessibility or safety use cases.

Built With

keras
python
tensorflow

Updates

Vishruth Rao started this project — Jul 26, 2025 08:27 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.