Inspiration
Throughout my childhood, I struggled with many communication issues. I could not talk to other kids fluidly, I mumbled and mispronounced words, and worst of all, I had no tone in my voice. I remember the hopeless feeling of trying to make conversation, only to fail. I was lucky enough to be enrolled into speech therapy, where I built speaking skills over a long period of time. Even with this intervention, my speech still struggles to this day. Recently, I had a virtual speech assignment. I struggled with this assignment for hours, because my delivery was off. I realized that without someone to actively coach you, it is really difficult to see your own shortcomings in speeches. I also started to think about those who did not have access to speech therapy, and how difficult it must be for them too. I then understood what needed to be made. A program that grants universal speech therapy, through emotion analysis, speaking speed, and speech eligibility. A program like this would grant accessibility for anybody to improve their speaking skills at no cost. I became increasingly passionate about this idea, and decided this hackathon was the perfect place to bring it to life. Thus, Speechably was born.
What it does
This project revolutionizes speech therapy through speech analysis, transcription, and AI models. The user can upload an MP4 video file <200 MB, and the video will be processed by the program. The user can view which emotions were prevalent during different timestamps, as well as their tailored feedback through Gemini regarding their speech. If the user needs further help, there is an AI-Coach ready to answer any questions!
How I built it
This program was built with Python, using Streamlit for an interactive front-end. I used a combination of technologies to analyze different aspects of speech:
- FFmpeg for video processing and audio extraction
- OpenAI's Whisper for accurate speech transcription
- Speech emotion recognition to detect emotional states throughout the speech
- Google's Gemini 1.5 Pro for generating personalized coaching and feedback
- Plotly for creating interactive visualizations of emotion patterns and speech metrics
- Streamlit for building an intuitive and responsive user interface
- Data processing pre-trained models (Speech emotion analysis, OpenAI-Whisper, Gemini) for analyzing speech patterns and generating feedback
The application follows a modular architecture with separate components for audio segmentation, speech analysis, and user interface, making it maintainable and extensible.
Challenges I ran into
I encountered several technical challenges during development:
Audio segmentation issues: One of the most frustrating bugs was with the audio segmentation logic where the final segment would occasionally not be created due to floating-point precision errors in duration calculations.
Emotion detection reliability: Getting consistent and accurate emotion detection across different speakers, accents, and recording qualities required extensive tuning and testing.
Integration of multiple AI models: Coordinating between the Whisper model for transcription and Gemini for coaching required careful handling of data formats and error cases.
Performance optimization: Processing large video files required optimizing our pipeline to prevent memory issues and reduce processing time.
UI responsiveness: Creating a responsive interface that could handle the dynamic nature of speech analysis results while maintaining a smooth user experience was challenging.
Cross-platform compatibility: Ensuring the application worked consistently across different operating systems required adapting our approach to file handling and path management.
Accomplishments that I'm proud of
Despite the challenges, I achieved several notable accomplishments:
Holistic speech analysis: Successfully combined emotion detection, speech rate analysis, and content transcription into a single cohesive application.
Intuitive visualization system: Created an interactive visualization system that helps users understand their emotional patterns and speaking tendencies over time.
Personalized AI coaching: Implemented an AI coaching system that provides customized feedback based on individual speech patterns and offers interactive guidance.
Accessible design: Built an application that makes professional-level speech analysis accessible to anyone with an internet connection and a device with a camera.
Real-time processing: Achieved reasonably fast processing times for video analysis, making the tool practical for regular use.
Privacy-first approach: All processing happens locally, ensuring user privacy while still providing advanced speech analysis capabilities.
What I learned
This project provided numerous learning opportunities:
AI model integration: I gained practical experience integrating multiple AI models (Whisper, emotion detection, Gemini) into a single application.
Audio processing techniques: I developed a deeper understanding of audio segmentation, feature extraction, and analysis.
UI/UX design principles: I learned how to create an intuitive interface for a complex analytical tool, balancing functionality with usability.
Speech metrics: I researched and implemented important speech quality metrics like words per second (WPS), emotional variety, and clarity.
Error handling in ML pipelines: I developed robust error handling approaches for machine learning pipelines where failures can occur at multiple stages.
Cross-functional collaboration: I improved our ability to work across different domains (audio processing, ML, UI development) in a cohesive manner.
What's next for Speechable
I have ambitious plans for the future of Speechable:
Mobile application: Develop a mobile version to make speech practice and analysis even more accessible.
Real-time feedback: Implement capabilities for providing feedback during live speeches or conversations.
Expanded metrics: Add additional speech quality metrics like filler word detection, pause analysis, and gesture recognition (for video).
Custom training programs: Create personalized training regimens based on a user's specific speech patterns and goals.
Community features: Build a supportive community where users can share experiences, practice together, and provide peer feedback.
Integration with presentation tools: Create plugins for PowerPoint, Google Slides, and other presentation tools to provide rehearsal feedback.
Multilingual support: Expand the system to support multiple languages and help non-native speakers improve their pronunciation and delivery.
Accessibility features: Add specialized analysis and coaching for users with speech impediments or other speech-related challenges.
By continuing to develop Speechable, I hope to make effective speech therapy and coaching available to everyone who needs it, regardless of their location, background, or resources.
Log in or sign up for Devpost to join the conversation.