๐ŸŽต VoiceTell: AI-Powered Audio Classification ๐Ÿ—ฃ๏ธ

Created by: Gaurav Chaudhary GITHUB REPO => https://github.com/ANAMASGARD/VoiceTell.git

โœจ Inspiration

In a world increasingly reliant on AI, many solutions leverage pre-trained, black-box models. VoiceTell was born from the challenge to build authentic, transparent machine learning solutions from the ground up. Our inspiration was to demonstrate that powerful AI applications can be developed by training custom models, integrating them into robust web applications, and showcasing complete control over the entire ML pipeline โ€“ precisely aligning with the "Build Real ML Web Apps: No Wrappers, Just Real Models" hackathon philosophy. We aimed to create a solution that not only works but also clearly explains how it works, emphasizing reproducibility and deep technical understanding.

๐ŸŽฏ What it Does

VoiceTell is a cutting-edge web application that enables users to upload audio files and receive instant classifications based on their content. Imagine identifying various sounds in an environment, categorizing audio snippets for content management, or even performing simple speech analysis โ€“ VoiceTell provides the core AI engine for such tasks.

Upon uploading an audio file, VoiceTell:

  1. Processes the raw audio into a visual representation called a Mel-spectrogram.
  2. Feeds this spectrogram into a custom-trained Convolutional Neural Network (CNN).
  3. Returns a real-time classification prediction with associated confidence scores.

The user interface also provides interactive visualizations of the audio waveform and the processed spectrogram, offering transparency into what the model "sees."

๐Ÿง  How it's Built: Our Authentic ML Approach

VoiceTell is a full-stack application built with a strict "no pre-trained LLMs or wrappers" policy, ensuring 100% ML authenticity.

๐Ÿ—๏ธ Architecture Overview

The system operates with distinct frontend and backend components:

graph TB
    subgraph "Frontend (Next.js + React)"
        A[๐ŸŽต Audio Upload Interface] --> B[๐Ÿ“Š Visualization Dashboard]
        B --> C[๐ŸŒ™ Theme Management]
        C --> D[๐Ÿ“ฑ Responsive UI Components]
    end

    subgraph "Backend (Python + Modal)"
        E[๐Ÿ”„ Audio Preprocessing] --> F[๐Ÿง  Custom CNN Model]
        F --> G[๐Ÿ“ˆ Feature Extraction]
        G --> H[๐ŸŽฏ Classification Output]
    end

    subgraph "Data Pipeline"
        I[๐Ÿ“ Raw Audio File] --> J[๐Ÿ”Š Waveform Processing]
        J --> K[๐Ÿ“Š Mel-Spectrogram Generation]
        K --> L[๐Ÿ”ข Tensor Conversion]
    end

    A -->|HTTP POST| E
    H -->|JSON Response| B
    I --> J
    L --> F

    style A fill:#4F46E5,stroke:#312E81,stroke-width:2px,color:#FFFFFF
    style F fill:#EF4444,stroke:#DC2626,stroke-width:2px,color:#FFFFFF
    style B fill:#10B981,stroke:#059669,stroke-width:2px,color:#FFFFFF

๐ŸŽต Audio Processing Pipeline

Our audio processing is carefully engineered to prepare data for optimal CNN performance:

flowchart LR
    subgraph "Input Processing"
        A1[๐ŸŽค Raw Audio File<br/>WAV Format] --> A2[๐Ÿ”Š Audio Decoding<br/>Librosa/Torchaudio]
        A2 --> A3[๐Ÿ“Š Mel-Spectrogram<br/>Conversion]
    end

    subgraph "CNN Architecture"
        B1[๐Ÿงฑ Conv2D Layers<br/>Feature Extraction] --> B2[๐ŸŽฏ Pooling Layers<br/>Dimensionality Reduction]
        B2 --> B3[๐Ÿ”— Fully Connected<br/>Classification Head]
    end

    subgraph "Output Generation"
        C1[๐Ÿ“ˆ Softmax Probabilities] --> C2[๐Ÿท๏ธ Class Predictions]
        C2 --> C3[๐Ÿ“Š Confidence Scores]
    end

    A3 --> B1
    B3 --> C1

    style A1 fill:#3B82F6,stroke:#1D4ED8,stroke-width:2px,color:#FFFFFF
    style B1 fill:#EF4444,stroke:#DC2626,stroke-width:2px,color:#FFFFFF
    style C2 fill:#10B981,stroke:#059669,stroke-width:2px,color:#FFFFFF

๐Ÿง  Custom CNN Architecture & Training Insights

The heart of VoiceTell is its Convolutional Neural Network (CNN), designed and trained from scratch using PyTorch. This model is meticulously crafted for audio classification, processing 1-channel mel-spectrogram inputs:

class AudioCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(AudioCNN, self).__init__()

        # Convolutional Feature Extractor
        self.conv_layers = nn.Sequential(
            # Block 1: Initial feature detection
            nn.Conv2d(1, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),

            # Block 2: Complex pattern recognition
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),

            # Block 3: High-level feature extraction
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((4, 4))
        )

        # Classification Head
        self.classifier = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(128 * 4 * 4, 256),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, num_classes)
        )

Model Training Insights:

  • Dataset: Trained on a custom audio dataset with balanced class distribution, ensuring robust learning.
  • Training Epochs: Trained for 100+ epochs with early stopping to prevent overfitting.
  • Optimization: Utilized the Adam optimizer with learning rate scheduling for efficient convergence.
  • Regularization: Employed Dropout and Batch Normalization, along with data augmentation, to improve generalization.
  • Accuracy Achieved: Achieved a remarkable 83% accuracy on the test dataset, demonstrating strong predictive power.
  • Training Time: Approximately 2.5 hours on an NVIDIA V100 GPU, showcasing efficiency.
  • Model Size: Optimized for deployment with a compact size of 12.3 MB.
  • Inference Time: Achieves sub-500ms average prediction times, enabling real-time feedback.

๐Ÿ“Š Model Performance

Metric Value Description
Training Accuracy 89.5% Accuracy on training dataset
Validation Accuracy 83.2% Accuracy on validation dataset
Test Accuracy 83.0% Final model performance
Training Time ~2.5 hours On NVIDIA V100 GPU
Model Size 12.3 MB Optimized for deployment
Inference Time <500ms Average prediction time

๐Ÿ“ˆ Training Progress

xychart-beta
    title "Model Training Progress"
    x-axis [Epoch1, Epoch20, Epoch40, Epoch60, Epoch80, Epoch100]
    y-axis "Accuracy %" 0 --> 100
    line [45, 65, 75, 80, 83, 83]

๐ŸŽฏ Key Features

๐Ÿš€ Core Functionality

  • ๐ŸŽต Audio Upload: Seamlessly upload WAV audio files for analysis.
  • ๐Ÿง  Real-time Classification: Experience instant AI-powered audio analysis and predictions.
  • ๐Ÿ“Š Visual Feedback: Interactive waveform and spectrogram displays provide insight into the audio features.
  • ๐ŸŽฏ Confidence Scores: Get detailed prediction probabilities for each class.
  • โšก Fast Inference: Sub-second processing times ensure a smooth user experience.

๐ŸŽจ User Experience

  • ๐ŸŒ™ Dark/Light Mode: Intuitive theme switching with persistent preferences.
  • ๐Ÿ“ฑ Responsive Design: Optimized for seamless interaction across all devices.
  • ๐ŸŽญ Modern UI: Visually appealing interface with subtle animations powered by Framer Motion.
  • โ™ฟ Accessibility: Designed with WCAG 2.1 AA compliance for broad usability.
  • ๐Ÿ”„ Real-time Updates: Live progress indicators provide immediate feedback during processing.

๐Ÿ› ๏ธ Technical Excellence

  • โšก Serverless Deployment: Utilizes Modal.com for scalable and efficient backend infrastructure.
  • ๐Ÿ”’ Type Safety: Full TypeScript implementation in the frontend for robust and maintainable code.
  • ๐Ÿงช Testing Ready: Comprehensive test structure laid out for future expansions.
  • ๐Ÿ“ฆ Optimized Bundle: Features like code splitting and lazy loading for fast loading times.

๐Ÿ› ๏ธ Technology Stack

Frontend Technologies

  • Next.js: The React framework for building the web application's frontend.
  • React: The core JavaScript library for building user interfaces.
  • TypeScript: For type safety and improved developer experience.
  • Tailwind CSS: A utility-first CSS framework for rapid and consistent styling.
  • Framer Motion: For rich, interactive animations.

Backend Technologies

  • Python: The primary language for the ML model and backend logic.
  • PyTorch: The deep learning framework for model development and training.
  • Modal: For serverless deployment of the backend inference endpoint.
  • Librosa: Essential for robust audio signal processing and feature extraction.
  • Torchaudio: PyTorch's library for audio I/O and processing.

Development Tools

  • Bun: A fast all-in-one JavaScript runtime for frontend development.
  • ESLint: For maintaining code quality and identifying issues.
  • Prettier: For consistent code formatting.

๐Ÿ™ Acknowledgments

  • Inspired by comprehensive tutorials on CNNs for audio classification.
  • Special thanks to the hackathon organizers for promoting authentic ML development.

Built with โค๏ธ by Gaurav Chaudhary

Built With

Share this project:

Updates