Inspiration
Public speaking and debate are skills that can make or break opportunities — whether it's acing a presentation, winning a debate competition, or simply communicating ideas confidently. But most people don't have access to a personal coach who can give them real-time, objective feedback. We wanted to build an AI-powered debate coach that's available 24/7, analyzing not just what you say, but how you say it — tone, emotions, and body language. Polly AI was born from the idea that everyone deserves personalized coaching to become a better communicator.
What it does
Polly AI is a real-time debate coaching platform. When you connect, you're greeted and assigned a random debate topic. You can then practice your argument by typing or recording your voice. As you speak, Polly AI:
- Tracks your facial expressions frame-by-frame using your webcam to detect emotions (happy, sad, confident, nervous, etc.)
- Analyzes your voice for pitch, energy, confidence score, and tone characteristics
- Transcribes your speech to text
- Evaluates your argument structure, persuasiveness, and delivery using Google Gemini AI
You receive instant, personalized feedback on your performance — including what you did well, where you can improve, and actionable tips for your next practice session. It's like having a debate coach who never sleeps.
How we built it
Frontend: React.js with WebSocket integration for real-time communication, React Webcam for live video streaming, and React Markdown for formatted AI responses.
Backend: FastAPI handles WebSocket connections and concurrent processing. We built a modular service architecture:
- Emotion Service: Uses DeepFace and OpenCV to analyze facial expressions from video frames
- Voice Analysis Service: Leverages librosa for pitch detection, energy measurement, and confidence scoring
- Speech Service: Converts audio to text (currently mock, designed for Google Speech-to-Text integration)
- Chat Service: Integrates Google Gemini AI for intelligent coaching and feedback
- Topic Service: Generates random debate topics from a database
Challenges we ran into
WebSocket synchronization: Managing real-time video frame processing while keeping the chat interface responsive was tricky. We had to carefully balance frame processing intervals to avoid overwhelming the backend.
Audio encoding issues: Converting browser-recorded audio (WebM) to a format suitable for speech analysis required handling multiple codec formats and base64 encoding correctly.
Emotion detection accuracy: DeepFace sometimes struggled with varying lighting conditions and camera angles. We had to add robust error handling and fallback mechanisms when faces weren't detected.
Context management: Making sure the AI understood the full context of a debate session (topic, previous messages, emotion state) while generating relevant feedback required careful prompt engineering.
CORS and WebSocket configuration: Getting the frontend and backend to communicate smoothly across different ports during development took significant debugging.
Accomplishments that we're proud of
- Built a fully functional real-time coaching system with live emotion detection running at 1 frame per second
- Successfully integrated multiple AI services (computer vision, audio analysis, and LLM) into a cohesive user experience
- Created an intuitive chat interface with markdown support that makes AI feedback easy to read and actionable
- Implemented a complete speech-to-text pipeline ready for production API integration
- Designed a modular backend architecture that's scalable and easy to extend with new features
- Got the entire tech stack working together seamlessly — from webcam capture to AI-generated feedback in under 5 seconds
What we learned
Technical skills: We deepened our understanding of WebSocket architecture, asynchronous Python programming, real-time video processing, audio signal analysis, and LLM prompt engineering.
AI integration: We learned how to combine multiple AI models (computer vision, audio analysis, NLP) into a single application and handle their different latency requirements.
User experience: Real-time feedback is powerful, but it needs to be presented in a way that's encouraging rather than overwhelming. We learned to balance detailed metrics with actionable insights.
System design: Building a system that processes video, audio, and text simultaneously taught us about resource management, concurrent processing, and graceful error handling.
What's next for Polly-AI
- Real Speech-to-Text: Integrate Google Cloud Speech-to-Text API to replace mock transcription with production-grade accuracy.
- Performance tracking: Build a dashboard showing progress over time — track improvements in confidence, reduce filler words, and monitor emotion consistency with a simple UI.
- Advanced metrics: Add gesture recognition, body language analysis, and speaking pace visualization.
- Practice modes: Different coaching modes for debates, presentations, mock interviews, and casual conversation practice.
- Social features: Peer comparison, leaderboards, and the ability to share practice sessions with friends or coaches.
- Mobile app: iOS/Android apps for practicing on the go
- Custom topics: Allow users to create and practice with their own debate topics or upload presentation scripts that can be graded.


Log in or sign up for Devpost to join the conversation.