🎤 VocalAIAgent: A Multimodal Conversational Vocal Coach

Author: Vocal Coach AI Team
Repository: Berkeley Hack 2025

Overview

This project is a comprehensive AI-powered vocal coaching system built for the Berkeley Hackathon 2025. The application provides a full-stack solution for personalized vocal training, combining real-time voice analysis, intelligent coaching, and conversational AI agents to create a holistic vocal development experience.

VocalAIAgent solves the problem of fragmented vocal training by combining multiple functionalities into one intuitive and multimodal assistant. This AI-powered system simplifies vocal coaching by:

  • Understanding natural voice patterns and interpreting audio recordings to gather vocal characteristics
  • Providing personalized recommendations based on vocal analysis and user preferences
  • Fetching real-time data for vocal metrics, progress tracking, and performance insights
  • Generating dynamic lesson plans tailored to the user's vocal type, skill level, and practice goals
  • Offering intelligent coaching conversations grounded in real-time voice analysis data

VocalAIAgent is not just an LLM chatbot. The agent is enhanced with numerous AI capabilities, including:

  • Voice Understanding: Real-time pitch detection, vocal analysis with metrics like jitter, shimmer, vibrato rate
  • Retrieval-Augmented Generation (RAG): Providing personalized coaching tips by retrieving relevant vocal techniques from a knowledge base
  • Few-Shot Prompting: Generating dynamic lesson plans and exercises based on minimal user input
  • Function Calling: Executing specific functions based on user commands, such as starting voice sessions, analyzing recordings, or generating progress reports
  • Long Context Window: Managing and retaining user vocal profiles and practice history across multiple sessions
  • Context Caching: Storing relevant vocal data temporarily to improve response speed and reduce redundant analysis
  • AI Evaluation: Using LLM-based evaluation to assess vocal progress and provide "Vocal Scores" based on improvement and consistency
  • Grounding: Ensuring that coaching recommendations are grounded in real-time vocal analysis data
  • Embeddings: Utilizing embeddings for effective vocal pattern matching and personalized exercise recommendations
  • Multimodal Integration: Understanding both voice inputs and conversational text for comprehensive coaching

Problem Statement

Vocal training can be an isolated and inconsistent process. Singers and speakers often struggle with:

  • Lack of real-time feedback during practice sessions
  • Limited access to personalized coaching based on their specific vocal characteristics
  • Difficulty tracking progress and identifying improvement areas
  • Fragmented resources across multiple platforms and tools
  • Inconsistent practice routines without proper guidance

VocalAIAgent addresses these challenges by providing a unified, intelligent coaching platform that combines voice analysis, personalized AI coaching, and comprehensive progress tracking in one seamless experience.

🚀 Key Features

Core Vocal Analysis

🎵 Real-Time Pitch Detection: Instant feedback during practice sessions with live pitch visualization
📊 Deep Vocal Analysis: Advanced metrics including jitter, shimmer, vibrato rate, vocal range analysis
🎯 Voice Type Classification: Automatic classification of voice types (soprano, alto, tenor, bass)
📈 Progress Tracking: Comprehensive tracking of vocal improvements over time

AI-Powered Coaching System

🤖 Dual-AI Architecture: Proactive Fetch.ai Agent + Reactive Letta Conversational Agent
💬 Stateful Conversations: AI coach that remembers context and discusses specific progress
📋 Personalized Lesson Plans: Dynamic lesson generation based on vocal analysis and user goals
🎓 Exercise Recommendations: Tailored vocal exercises based on analysis results

Advanced Features

🗣️ VAPI Voice Integration: Real-time voice conversations with AI coach
📱 Multimodal Interface: Support for voice input, text chat, and visual feedback
🔄 Lesson Feedback Loop: Comprehensive storage and analysis of lesson completion data
📊 AI-Generated Reports: Daily summaries of performance trends and insights
🎪 Community Features: Progress sharing and vocal challenges

Data & Memory Management

💾 Persistent Memory: User preferences, vocal characteristics, and practice history retention
📤 Export Capabilities: Save vocal analyses, lesson plans, and progress reports
🔐 Secure Data Storage: Supabase integration with proper authentication and RLS
📋 Session Management: Comprehensive tracking of practice sessions and improvements

Why This Matters

Vocal training today lacks the personalized, data-driven approach that modern AI can provide. VocalAIAgent brings together voice science, conversational AI, and personalized coaching into one intelligent system, offering a more effective, engaging, and accessible vocal training experience. By combining real-time voice analysis, stateful AI conversations, and comprehensive progress tracking, this tool showcases the potential of Generative AI in revolutionizing music education and vocal development.

🛠 Tech Stack

Frontend

  • React: Modern component-based UI framework
  • TypeScript: Type-safe development with enhanced IDE support
  • Vite: Fast build tool and development server
  • Tailwind CSS: Utility-first CSS framework for responsive design
  • Framer Motion: Smooth animations and transitions

Backend

  • FastAPI: High-performance Python web framework with automatic API docs
  • Python: Core backend language with extensive AI/ML libraries
  • Uvicorn: ASGI server for production deployment

AI Services

  • Letta: Stateful conversational AI with long-term memory capabilities
  • Fetch.ai: Autonomous agent system for proactive analysis and reporting
  • VAPI: Voice AI platform for real-time voice conversations

Database & Authentication

  • Supabase: PostgreSQL database with built-in authentication and real-time features
  • Row Level Security (RLS): Secure user data isolation

Voice Processing

  • Web Audio API: Real-time audio processing and pitch detection
  • Custom Voice Analyzer: Advanced vocal metrics calculation

Hosting & Deployment

  • Netlify: Frontend hosting with automatic deployments
  • Google Cloud Run: Scalable backend container hosting
  • Docker: Containerized backend for consistent deployments

How It Works

Intent Recognition and Routing System

VocalAIAgent uses a sophisticated routing system that directs user requests to appropriate handlers based on vocal coaching context:

def interpret_vocal_request(user_input, session_context):
    prompt = (
        "You are a vocal coaching function router. Based on the user message and session context, "
        "output ONLY a Python dictionary.\n"
        "- If asking for voice analysis: {'intent': 'analyze_voice', 'session_type': '<TYPE>'}\n"  
        "- If requesting lesson: {'intent': 'start_lesson', 'category': '<CATEGORY>', 'level': '<LEVEL>'}\n"
        "- If seeking feedback: {'intent': 'get_feedback', 'aspect': '<VOCAL_ASPECT>'}\n"
        f"Session Context: {session_context}\n"
        f"User: {user_input}"
    )
    # Route to appropriate vocal coaching handler

Vocal Analysis Pipeline

The voice analysis system combines multiple AI techniques:

  1. Real-time Processing: Web Audio API captures and processes audio in real-time
  2. Feature Extraction: Advanced algorithms extract vocal characteristics (pitch, formants, etc.)
  3. AI Classification: Machine learning models classify voice type and detect patterns
  4. Contextual Analysis: Results are interpreted within the user's vocal development context

Dual-AI Architecture

Proactive Fetch.ai Agent

class VocalCoachAgent:
    """Autonomous agent for vocal analysis and report generation"""

    async def generate_daily_reports(self):
        """Automatically analyze practice sessions and generate insights"""
        users = await self.get_active_users()
        for user_id in users:
            # Analyze vocal progress
            report = await self.analyze_vocal_progress(user_id)
            # Store insights for Letta conversations
            await self.store_insights(user_id, report)

Reactive Letta Conversational Agent

class LettaVocalCoach:
    """Stateful conversational coach with long-term memory"""

    async def generate_response(self, context, user_message):
        """Generate contextual coaching based on vocal analysis data"""
        # Retrieve user's vocal history and analysis
        vocal_context = await self.build_vocal_context(context.user_id)

        # Generate personalized coaching response
        response = await self.letta_client.agents.messages.create(
            agent_id=self.agent_id,
            messages=[{
                "role": "user", 
                "content": f"Vocal Context: {vocal_context}\nUser: {user_message}"
            }]
        )

Session Management and Memory

VocalAIAgent maintains comprehensive session state and user memory:

def start_vocal_session():
    session_memory = {
        'vocal_profile': {
            'voice_type': user.voice_type,
            'skill_level': user.skill_level,
            'practice_goals': user.goals,
            'vocal_range': user.range_analysis
        },
        'practice_history': [],
        'current_focus_areas': [],
        'last_analysis_results': {}
    }

    # Main coaching loop with state management
    while session_active:
        user_input = await get_user_input()
        action = interpret_vocal_request(user_input, session_memory)

        if action['intent'] == 'analyze_voice':
            await handle_voice_analysis(action, session_memory)
        elif action['intent'] == 'start_lesson':
            await handle_lesson_start(action, session_memory)
        # ... additional handlers

Current Capabilities Demonstrated

Voice Analysis & Processing

  • Real-time pitch detection with Web Audio API
  • Advanced vocal metrics (jitter, shimmer, vibrato)
  • Voice type classification and range analysis
  • Session recording and playback capabilities

AI-Powered Coaching

  • Fetch.ai autonomous agents for progress analysis
  • Letta conversational AI with stateful memory
  • VAPI real-time voice conversations
  • Personalized lesson and exercise generation

Data Management & Persistence

  • Comprehensive user vocal profiles
  • Session history and progress tracking
  • Lesson feedback storage and retrieval
  • Secure multi-user data isolation

User Experience Features

  • Modern, responsive React interface
  • Real-time visual feedback during voice sessions
  • Progress dashboards and analytics
  • Community features and challenges

Current Errors and Solutions

Issues Identified:

  1. URL Construction Error: Double slash in API endpoints causing malformed URLs
  2. Database Connection Issues: Lesson feedback storage failing due to Supabase credential problems
  3. Error Handling: Generic error messages making debugging difficult

Solutions Implemented:

  1. Fixed URL Construction: Added trailing slash removal in frontend API calls
  2. Enhanced Error Logging: Improved backend error reporting with detailed messages
  3. Database Health Checks: Added endpoints to verify service connectivity

Limitations & Future Work

Current Limitations:

  • Voice Processing Accuracy: Browser-based analysis has limitations compared to specialized hardware
  • AI Model Training: Limited training data for vocal coaching specific AI models
  • Scalability: Current architecture needs optimization for large-scale deployment

Future Enhancements:

High Priority:

  1. Enhanced Voice Processing: Integrate professional-grade voice analysis libraries
  2. Advanced AI Models: Fine-tune models specifically for vocal coaching contexts
  3. Mobile Applications: Native iOS/Android apps with enhanced voice processing

Medium Priority:

  1. Social Features: Enhanced community aspects with vocal challenges and peer learning
  2. Integration Ecosystem: Connect with music learning platforms and DAWs
  3. Offline Capabilities: Voice analysis and basic coaching without internet connection

Built With

Core Technologies

  • React 18 with TypeScript for modern, type-safe frontend development
  • FastAPI for high-performance Python backend with automatic API documentation
  • Supabase for PostgreSQL database, authentication, and real-time features
  • Tailwind CSS for responsive, utility-first styling

AI & Voice Technologies

  • Letta for stateful conversational AI with long-term memory
  • Fetch.ai for autonomous agent systems and proactive analysis
  • VAPI for real-time voice AI conversations
  • Web Audio API for browser-based voice processing

DevOps & Deployment

  • Docker for containerized backend deployment
  • Google Cloud Run for scalable, serverless backend hosting
  • Netlify for frontend hosting with automatic deployments

VocalAIAgent demonstrates the transformative potential of AI in music education, combining cutting-edge voice processing, conversational AI, and personalized coaching to create a comprehensive vocal training platform.

Built With

Share this project:

Updates