🎤 VocalAIAgent PRO: A Multimodal Conversational Vocal Coach

https://vocal-ai.site/
https://prezi.com/p/uulwqjra-tgm/enhancing-vocal-skills-with-voiceai/?present=1
Author: Vocal Coach AI PRO Team

Overview

This project is a comprehensive AI-powered vocal coaching system built for the Bolt.new Hackathon. The application provides a full-stack solution for personalized vocal training, combining real-time voice analysis, intelligent coaching, and conversational AI agents to create a holistic vocal development experience.

VocalAIAgent solves the problem of fragmented vocal training by combining multiple functionalities into one intuitive and multimodal assistant. This AI-powered system simplifies vocal coaching by:

  • Understanding natural voice patterns and interpreting audio recordings to gather vocal characteristics
  • Providing personalized recommendations based on vocal analysis and user preferences
  • Fetching real-time data for vocal metrics, progress tracking, and performance insights
  • Generating dynamic lesson plans tailored to the user's vocal type, skill level, and practice goals
  • Offering intelligent coaching conversations grounded in real-time voice analysis data

VocalAIAgent is not just an LLM chatbot. The agent is enhanced with numerous AI capabilities, including:

  • Voice Understanding: Real-time pitch detection, vocal analysis with metrics like jitter, shimmer, vibrato rate
  • Retrieval-Augmented Generation (RAG): Providing personalized coaching tips by retrieving relevant vocal techniques from a knowledge base
  • Few-Shot Prompting: Generating dynamic lesson plans and exercises based on minimal user input
  • Function Calling: Executing specific functions based on user commands, such as starting voice sessions, analyzing recordings, or generating progress reports
  • Long Context Window: Managing and retaining user vocal profiles and practice history across multiple sessions
  • Context Caching: Storing relevant vocal data temporarily to improve response speed and reduce redundant analysis
  • AI Evaluation: Using LLM-based evaluation to assess vocal progress and provide "Vocal Scores" based on improvement and consistency
  • Grounding: Ensuring that coaching recommendations are grounded in real-time vocal analysis data
  • Embeddings: Utilizing embeddings for effective vocal pattern matching and personalized exercise recommendations
  • Multimodal Integration: Understanding both voice inputs and conversational text for comprehensive coaching

Problem Statement

Vocal training can be an isolated and inconsistent process. Singers and speakers often struggle with:

  • Lack of real-time feedback during practice sessions
  • Limited access to personalized coaching based on their specific vocal characteristics
  • Difficulty tracking progress and identifying improvement areas
  • Fragmented resources across multiple platforms and tools
  • Inconsistent practice routines without proper guidance

VocalAIAgent addresses these challenges by providing a unified, intelligent coaching platform that combines voice analysis, personalized AI coaching, and comprehensive progress tracking in one seamless experience.

🚀 Key Features

Core Vocal Analysis

🎵 Real-Time Pitch Detection: Instant feedback during practice sessions with live pitch visualization
📊 Deep Vocal Analysis: Advanced metrics including jitter, shimmer, vibrato rate, vocal range analysis
🎯 Voice Type Classification: Automatic classification of voice types (soprano, alto, tenor, bass)
📈 Progress Tracking: Comprehensive tracking of vocal improvements over time

AI-Powered Coaching System

🤖 Dual-AI Architecture: Proactive Fetch.ai Agent + Reactive Letta Conversational Agent
💬 Stateful Conversations: AI coach that remembers context and discusses specific progress
📋 Personalized Lesson Plans: Dynamic lesson generation based on vocal analysis and user goals
🎓 Exercise Recommendations: Tailored vocal exercises based on analysis results

Advanced Features

🗣️ VAPI Voice Integration: Real-time voice conversations with AI coach
📱 Multimodal Interface: Support for voice input, text chat, and visual feedback
🔄 Lesson Feedback Loop: Comprehensive storage and analysis of lesson completion data
📊 AI-Generated Reports: Daily summaries of performance trends and insights
🎪 Community Features: Progress sharing and vocal challenges

Data & Memory Management

💾 Persistent Memory: User preferences, vocal characteristics, and practice history retention
📤 Export Capabilities: Save vocal analyses, lesson plans, and progress reports
🔐 Secure Data Storage: Supabase integration with proper authentication and RLS
📋 Session Management: Comprehensive tracking of practice sessions and improvements

Why This Matters

Vocal training today lacks the personalized, data-driven approach that modern AI can provide. VocalAIAgent brings together voice science, conversational AI, and personalized coaching into one intelligent system, offering a more effective, engaging, and accessible vocal training experience. By combining real-time voice analysis, stateful AI conversations, and comprehensive progress tracking, this tool showcases the potential of Generative AI in revolutionizing music education and vocal development.

🛠 Tech Stack

Frontend

  • React: Modern component-based UI framework
  • TypeScript: Type-safe development with enhanced IDE support
  • Vite: Fast build tool and development server
  • Tailwind CSS: Utility-first CSS framework for responsive design
  • Framer Motion: Smooth animations and transitions

Backend

  • FastAPI: High-performance Python web framework with automatic API docs
  • Python: Core backend language with extensive AI/ML libraries
  • Uvicorn: ASGI server for production deployment

AI Services

  • Letta: Stateful conversational AI with long-term memory capabilities
  • Fetch.ai: Autonomous agent system for proactive analysis and reporting
  • VAPI: Voice AI platform for real-time voice conversations

Database & Authentication

  • Supabase: PostgreSQL database with built-in authentication and real-time features
  • Row Level Security (RLS): Secure user data isolation

Voice Processing

  • Web Audio API: Real-time audio processing and pitch detection
  • Custom Voice Analyzer: Advanced vocal metrics calculation

Hosting & Deployment

  • Netlify: Frontend hosting with automatic deployments
  • Google Cloud Run: Scalable backend container hosting
  • Docker: Containerized backend for consistent deployments

How It Works

Intent Recognition and Routing System

VocalAIAgent uses a sophisticated routing system that directs user requests to appropriate handlers based on vocal coaching context:

def interpret_vocal_request(user_input, session_context):
    prompt = (
        "You are a vocal coaching function router. Based on the user message and session context, "
        "output ONLY a Python dictionary.\n"
        "- If asking for voice analysis: {'intent': 'analyze_voice', 'session_type': '<TYPE>'}\n"  
        "- If requesting lesson: {'intent': 'start_lesson', 'category': '<CATEGORY>', 'level': '<LEVEL>'}\n"
        "- If seeking feedback: {'intent': 'get_feedback', 'aspect': '<VOCAL_ASPECT>'}\n"
        f"Session Context: {session_context}\n"
        f"User: {user_input}"
    )
    # Route to appropriate vocal coaching handler

Vocal Analysis Pipeline

The voice analysis system combines multiple AI techniques:

  1. Real-time Processing: Web Audio API captures and processes audio in real-time
  2. Feature Extraction: Advanced algorithms extract vocal characteristics (pitch, formants, etc.)
  3. AI Classification: Machine learning models classify voice type and detect patterns
  4. Contextual Analysis: Results are interpreted within the user's vocal development context

Dual-AI Architecture

Proactive Fetch.ai Agent

class VocalCoachAgent:
    """Autonomous agent for vocal analysis and report generation"""

    async def generate_daily_reports(self):
        """Automatically analyze practice sessions and generate insights"""
        users = await self.get_active_users()
        for user_id in users:
            # Analyze vocal progress
            report = await self.analyze_vocal_progress(user_id)
            # Store insights for Letta conversations
            await self.store_insights(user_id, report)

Reactive Letta Conversational Agent

class LettaVocalCoach:
    """Stateful conversational coach with long-term memory"""

    async def generate_response(self, context, user_message):
        """Generate contextual coaching based on vocal analysis data"""
        # Retrieve user's vocal history and analysis
        vocal_context = await self.build_vocal_context(context.user_id)

        # Generate personalized coaching response
        response = await self.letta_client.agents.messages.create(
            agent_id=self.agent_id,
            messages=[{
                "role": "user", 
                "content": f"Vocal Context: {vocal_context}\nUser: {user_message}"
            }]
        )

Session Management and Memory

VocalAIAgent maintains comprehensive session state and user memory:

def start_vocal_session():
    session_memory = {
        'vocal_profile': {
            'voice_type': user.voice_type,
            'skill_level': user.skill_level,
            'practice_goals': user.goals,
            'vocal_range': user.range_analysis
        },
        'practice_history': [],
        'current_focus_areas': [],
        'last_analysis_results': {}
    }

    # Main coaching loop with state management
    while session_active:
        user_input = await get_user_input()
        action = interpret_vocal_request(user_input, session_memory)

        if action['intent'] == 'analyze_voice':
            await handle_voice_analysis(action, session_memory)
        elif action['intent'] == 'start_lesson':
            await handle_lesson_start(action, session_memory)
        # ... additional handlers

Current Capabilities Demonstrated

Voice Analysis & Processing

  • Real-time pitch detection with Web Audio API
  • Advanced vocal metrics (jitter, shimmer, vibrato)
  • Voice type classification and range analysis
  • Session recording and playback capabilities

AI-Powered Coaching

  • Fetch.ai autonomous agents for progress analysis
  • Letta conversational AI with stateful memory
  • VAPI real-time voice conversations
  • Personalized lesson and exercise generation

Data Management & Persistence

  • Comprehensive user vocal profiles
  • Session history and progress tracking
  • Lesson feedback storage and retrieval
  • Secure multi-user data isolation

User Experience Features

  • Modern, responsive React interface
  • Real-time visual feedback during voice sessions
  • Progress dashboards and analytics
  • Community features and challenges

Current Errors and Solutions

Issues Identified:

  1. URL Construction Error: Double slash in API endpoints causing malformed URLs
  2. Database Connection Issues: Lesson feedback storage failing due to Supabase credential problems
  3. Error Handling: Generic error messages making debugging difficult

Solutions Implemented:

  1. Fixed URL Construction: Added trailing slash removal in frontend API calls
  2. Enhanced Error Logging: Improved backend error reporting with detailed messages
  3. Database Health Checks: Added endpoints to verify service connectivity

Limitations & Future Work

Current Limitations:

  • Voice Processing Accuracy: Browser-based analysis has limitations compared to specialized hardware
  • AI Model Training: Limited training data for vocal coaching specific AI models
  • Scalability: Current architecture needs optimization for large-scale deployment

Future Enhancements:

High Priority:

  1. Enhanced Voice Processing: Integrate professional-grade voice analysis libraries
  2. Advanced AI Models: Fine-tune models specifically for vocal coaching contexts
  3. Mobile Applications: Native iOS/Android apps with enhanced voice processing

Medium Priority:

  1. Social Features: Enhanced community aspects with vocal challenges and peer learning
  2. Integration Ecosystem: Connect with music learning platforms and DAWs
  3. Offline Capabilities: Voice analysis and basic coaching without internet connection

Built With

Core Technologies

  • React 18 with TypeScript for modern, type-safe frontend development
  • FastAPI for high-performance Python backend with automatic API documentation
  • Supabase for PostgreSQL database, authentication, and real-time features
  • Tailwind CSS for responsive, utility-first styling

AI & Voice Technologies

  • Letta for stateful conversational AI with long-term memory
  • Fetch.ai for autonomous agent systems and proactive analysis
  • VAPI for real-time voice AI conversations
  • Web Audio API for browser-based voice processing

DevOps & Deployment

  • Docker for containerized backend deployment
  • Google Cloud Run for scalable, serverless backend hosting
  • Netlify for frontend hosting with automatic deployments

VocalAIAgent demonstrates the transformative potential of AI in music education, combining cutting-edge voice processing, conversational AI, and personalized coaching to create a comprehensive vocal training platform.

Inspirations:

Developer 1: I’ve always loved singing, and the idea for Vocal AI came to me a long time ago when I realized I wanted to improve my vocal skills but didn’t have enough time for in-person lessons. I thought, why not create a website or app that helps with singing practice anytime? Developer 2: I realized that there is no such product available online and AI are more accessible to people than real coaches. I think Vocal AI is able to solve this problem

Challenges I ran into:

Developer 1: Balancing late nights without sleep, working on the website while preparing for exams, and pushing myself to stay focused and motivated despite the tiredness. It was tough, but I kept reminding myself that the final result would be worth it. Developer 2: Backend integration was tough and it took me some time to find best solutions for ideas that I had. I did not want to leave static frontend. Also, changing the images on the project developed by Bolt was tough.

Accomplishments I’m proud of:

Developer 1: I’m really happy with the design — it’s clean, not overloaded, and easy on the eyes. There are no harsh or bright elements that distract or tire the viewer, which makes the user experience pleasant. Developer 2: I'm proud to connect three distinct AIs to this project and integrate numerous services.

What I learned:

We learned that ideas which seem impossible at first can be brought to life with effort and teamwork. Dedication is important — sometimes working late nights is necessary to achieve great results. We also learned to be patient, manage my emotions, and focus on simplicity in design rather than adding too many elements.

What’s next for Vocal AI:

We plan to keep improving Vocal AI by enhancing both the design and functionality. Our goal is to make it a user-friendly and helpful app that many people can use to improve their singing skills. This version of a project is just the beginning, and we're excited to continue developing it further and deploying on other platforms. We're open for investments.

Built With

Share this project:

Updates