Inspiration

Public speaking and debate are skills that can make or break opportunities — whether it's acing a presentation, winning a debate competition, or simply communicating ideas confidently. But most people don't have access to a personal coach who can give them real-time, objective feedback. We wanted to build an AI-powered debate coach that's available 24/7, analyzing not just what you say, but how you say it — tone, emotions, and body language. Polly AI was born from the idea that everyone deserves personalized coaching to become a better communicator.

What it does

Polly AI is a real-time debate coaching platform. When you connect, you're greeted and assigned a random debate topic. You can then practice your argument by typing or recording your voice. As you speak, Polly AI:

  • Tracks your facial expressions frame-by-frame using your webcam to detect emotions (happy, sad, confident, nervous, etc.)
  • Analyzes your voice for pitch, energy, confidence score, and tone characteristics
  • Transcribes your speech to text
  • Evaluates your argument structure, persuasiveness, and delivery using Google Gemini AI

You receive instant, personalized feedback on your performance — including what you did well, where you can improve, and actionable tips for your next practice session. It's like having a debate coach who never sleeps.

How we built it

Frontend: React.js with WebSocket integration for real-time communication, React Webcam for live video streaming, and React Markdown for formatted AI responses.

Backend: FastAPI handles WebSocket connections and concurrent processing. We built a modular service architecture:

  • Emotion Service: Uses DeepFace and OpenCV to analyze facial expressions from video frames
  • Voice Analysis Service: Leverages librosa for pitch detection, energy measurement, and confidence scoring
  • Speech Service: Converts audio to text (currently mock, designed for Google Speech-to-Text integration)
  • Chat Service: Integrates Google Gemini AI for intelligent coaching and feedback
  • Topic Service: Generates random debate topics from a database

Challenges we ran into

WebSocket synchronization: Managing real-time video frame processing while keeping the chat interface responsive was tricky. We had to carefully balance frame processing intervals to avoid overwhelming the backend.

Audio encoding issues: Converting browser-recorded audio (WebM) to a format suitable for speech analysis required handling multiple codec formats and base64 encoding correctly.

Emotion detection accuracy: DeepFace sometimes struggled with varying lighting conditions and camera angles. We had to add robust error handling and fallback mechanisms when faces weren't detected.

Context management: Making sure the AI understood the full context of a debate session (topic, previous messages, emotion state) while generating relevant feedback required careful prompt engineering.

CORS and WebSocket configuration: Getting the frontend and backend to communicate smoothly across different ports during development took significant debugging.

Accomplishments that we're proud of

  • Built a fully functional real-time coaching system with live emotion detection running at 1 frame per second
  • Successfully integrated multiple AI services (computer vision, audio analysis, and LLM) into a cohesive user experience
  • Created an intuitive chat interface with markdown support that makes AI feedback easy to read and actionable
  • Implemented a complete speech-to-text pipeline ready for production API integration
  • Designed a modular backend architecture that's scalable and easy to extend with new features
  • Got the entire tech stack working together seamlessly — from webcam capture to AI-generated feedback in under 5 seconds

What we learned

Technical skills: We deepened our understanding of WebSocket architecture, asynchronous Python programming, real-time video processing, audio signal analysis, and LLM prompt engineering.

AI integration: We learned how to combine multiple AI models (computer vision, audio analysis, NLP) into a single application and handle their different latency requirements.

User experience: Real-time feedback is powerful, but it needs to be presented in a way that's encouraging rather than overwhelming. We learned to balance detailed metrics with actionable insights.

System design: Building a system that processes video, audio, and text simultaneously taught us about resource management, concurrent processing, and graceful error handling.

What's next for Polly-AI

  • Real Speech-to-Text: Integrate Google Cloud Speech-to-Text API to replace mock transcription with production-grade accuracy.
  • Performance tracking: Build a dashboard showing progress over time — track improvements in confidence, reduce filler words, and monitor emotion consistency with a simple UI.
  • Advanced metrics: Add gesture recognition, body language analysis, and speaking pace visualization.
  • Practice modes: Different coaching modes for debates, presentations, mock interviews, and casual conversation practice.
  • Social features: Peer comparison, leaderboards, and the ability to share practice sessions with friends or coaches.
  • Mobile app: iOS/Android apps for practicing on the go
  • Custom topics: Allow users to create and practice with their own debate topics or upload presentation scripts that can be graded.

Built With

Share this project:

Updates