Voice-Tell

🎵 VoiceTell: AI-Powered Audio Classification 🗣️

Created by: Gaurav Chaudhary GITHUB REPO => https://github.com/ANAMASGARD/VoiceTell.git

✨ Inspiration

In a world increasingly reliant on AI, many solutions leverage pre-trained, black-box models. VoiceTell was born from the challenge to build authentic, transparent machine learning solutions from the ground up. Our inspiration was to demonstrate that powerful AI applications can be developed by training custom models, integrating them into robust web applications, and showcasing complete control over the entire ML pipeline – precisely aligning with the "Build Real ML Web Apps: No Wrappers, Just Real Models" hackathon philosophy. We aimed to create a solution that not only works but also clearly explains how it works, emphasizing reproducibility and deep technical understanding.

🎯 What it Does

VoiceTell is a cutting-edge web application that enables users to upload audio files and receive instant classifications based on their content. Imagine identifying various sounds in an environment, categorizing audio snippets for content management, or even performing simple speech analysis – VoiceTell provides the core AI engine for such tasks.

Upon uploading an audio file, VoiceTell:

Processes the raw audio into a visual representation called a Mel-spectrogram.
Feeds this spectrogram into a custom-trained Convolutional Neural Network (CNN).
Returns a real-time classification prediction with associated confidence scores.

The user interface also provides interactive visualizations of the audio waveform and the processed spectrogram, offering transparency into what the model "sees."

🧠 How it's Built: Our Authentic ML Approach

VoiceTell is a full-stack application built with a strict "no pre-trained LLMs or wrappers" policy, ensuring 100% ML authenticity.

🏗️ Architecture Overview

The system operates with distinct frontend and backend components:

graph TB
    subgraph "Frontend (Next.js + React)"
        A[🎵 Audio Upload Interface] --> B[📊 Visualization Dashboard]
        B --> C[🌙 Theme Management]
        C --> D[📱 Responsive UI Components]
    end

    subgraph "Backend (Python + Modal)"
        E[🔄 Audio Preprocessing] --> F[🧠 Custom CNN Model]
        F --> G[📈 Feature Extraction]
        G --> H[🎯 Classification Output]
    end

    subgraph "Data Pipeline"
        I[📁 Raw Audio File] --> J[🔊 Waveform Processing]
        J --> K[📊 Mel-Spectrogram Generation]
        K --> L[🔢 Tensor Conversion]
    end

    A -->|HTTP POST| E
    H -->|JSON Response| B
    I --> J
    L --> F

    style A fill:#4F46E5,stroke:#312E81,stroke-width:2px,color:#FFFFFF
    style F fill:#EF4444,stroke:#DC2626,stroke-width:2px,color:#FFFFFF
    style B fill:#10B981,stroke:#059669,stroke-width:2px,color:#FFFFFF

🎵 Audio Processing Pipeline

Our audio processing is carefully engineered to prepare data for optimal CNN performance:

flowchart LR
    subgraph "Input Processing"
        A1[🎤 Raw Audio File<br/>WAV Format] --> A2[🔊 Audio Decoding<br/>Librosa/Torchaudio]
        A2 --> A3[📊 Mel-Spectrogram<br/>Conversion]
    end

    subgraph "CNN Architecture"
        B1[🧱 Conv2D Layers<br/>Feature Extraction] --> B2[🎯 Pooling Layers<br/>Dimensionality Reduction]
        B2 --> B3[🔗 Fully Connected<br/>Classification Head]
    end

    subgraph "Output Generation"
        C1[📈 Softmax Probabilities] --> C2[🏷️ Class Predictions]
        C2 --> C3[📊 Confidence Scores]
    end

    A3 --> B1
    B3 --> C1

    style A1 fill:#3B82F6,stroke:#1D4ED8,stroke-width:2px,color:#FFFFFF
    style B1 fill:#EF4444,stroke:#DC2626,stroke-width:2px,color:#FFFFFF
    style C2 fill:#10B981,stroke:#059669,stroke-width:2px,color:#FFFFFF

🧠 Custom CNN Architecture & Training Insights

The heart of VoiceTell is its Convolutional Neural Network (CNN), designed and trained from scratch using PyTorch. This model is meticulously crafted for audio classification, processing 1-channel mel-spectrogram inputs:

class AudioCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(AudioCNN, self).__init__()

        # Convolutional Feature Extractor
        self.conv_layers = nn.Sequential(
            # Block 1: Initial feature detection
            nn.Conv2d(1, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),

            # Block 2: Complex pattern recognition
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),

            # Block 3: High-level feature extraction
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((4, 4))
        )

        # Classification Head
        self.classifier = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(128 * 4 * 4, 256),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, num_classes)
        )

Model Training Insights:

Dataset: Trained on a custom audio dataset with balanced class distribution, ensuring robust learning.
Training Epochs: Trained for 100+ epochs with early stopping to prevent overfitting.
Optimization: Utilized the Adam optimizer with learning rate scheduling for efficient convergence.
Regularization: Employed Dropout and Batch Normalization, along with data augmentation, to improve generalization.
Accuracy Achieved: Achieved a remarkable 83% accuracy on the test dataset, demonstrating strong predictive power.
Training Time: Approximately 2.5 hours on an NVIDIA V100 GPU, showcasing efficiency.
Model Size: Optimized for deployment with a compact size of 12.3 MB.
Inference Time: Achieves sub-500ms average prediction times, enabling real-time feedback.

📊 Model Performance

Metric	Value	Description
Training Accuracy	89.5%	Accuracy on training dataset
Validation Accuracy	83.2%	Accuracy on validation dataset
Test Accuracy	83.0%	Final model performance
Training Time	~2.5 hours	On NVIDIA V100 GPU
Model Size	12.3 MB	Optimized for deployment
Inference Time	<500ms	Average prediction time

📈 Training Progress

xychart-beta
    title "Model Training Progress"
    x-axis [Epoch1, Epoch20, Epoch40, Epoch60, Epoch80, Epoch100]
    y-axis "Accuracy %" 0 --> 100
    line [45, 65, 75, 80, 83, 83]

🎯 Key Features

🚀 Core Functionality

🎵 Audio Upload: Seamlessly upload WAV audio files for analysis.
🧠 Real-time Classification: Experience instant AI-powered audio analysis and predictions.
📊 Visual Feedback: Interactive waveform and spectrogram displays provide insight into the audio features.
🎯 Confidence Scores: Get detailed prediction probabilities for each class.
⚡ Fast Inference: Sub-second processing times ensure a smooth user experience.

🎨 User Experience

🌙 Dark/Light Mode: Intuitive theme switching with persistent preferences.
📱 Responsive Design: Optimized for seamless interaction across all devices.
🎭 Modern UI: Visually appealing interface with subtle animations powered by Framer Motion.
♿ Accessibility: Designed with WCAG 2.1 AA compliance for broad usability.
🔄 Real-time Updates: Live progress indicators provide immediate feedback during processing.

🛠️ Technical Excellence

⚡ Serverless Deployment: Utilizes Modal.com for scalable and efficient backend infrastructure.
🔒 Type Safety: Full TypeScript implementation in the frontend for robust and maintainable code.
🧪 Testing Ready: Comprehensive test structure laid out for future expansions.
📦 Optimized Bundle: Features like code splitting and lazy loading for fast loading times.

🛠️ Technology Stack

Frontend Technologies

Next.js: The React framework for building the web application's frontend.
React: The core JavaScript library for building user interfaces.
TypeScript: For type safety and improved developer experience.
Tailwind CSS: A utility-first CSS framework for rapid and consistent styling.
Framer Motion: For rich, interactive animations.

Backend Technologies

Python: The primary language for the ML model and backend logic.
PyTorch: The deep learning framework for model development and training.
Modal: For serverless deployment of the backend inference endpoint.
Librosa: Essential for robust audio signal processing and feature extraction.
Torchaudio: PyTorch's library for audio I/O and processing.

Development Tools

Bun: A fast all-in-one JavaScript runtime for frontend development.
ESLint: For maintaining code quality and identifying issues.
Prettier: For consistent code formatting.

🙏 Acknowledgments

Inspired by comprehensive tutorials on CNNs for audio classification.
Special thanks to the hackathon organizers for promoting authentic ML development.

Built with ❤️ by Gaurav Chaudhary

Built With

modal
nextjs
python
pytorch
react
tailwind
typescript
vercel

Updates

GAURAV CHAUDHARY started this project — Jul 25, 2025 11:00 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.