StageWise: Master the stage before you step on it.
Inspiration
Public speaking is often cited as one of the most common phobias ($Glossophobia$). Whether it's a student defending a thesis, a startup founder pitching to VCs, or an employee giving a quarterly update, the pressure is immense.
Most people practice in one of two ways:
- The Mirror: Talking to a reflection that cannot give feedback.
- The Recording: Filming themselves, which is passive and often painful to watch back.
We were inspired to build StageWise to bridge this gap. We wanted to create a supportive, objective, and real-time coach that acts like a human expert sitting in the front row—someone who can see your body language, hear your tone, and gently guide you to improvement before the stakes are high.
What it does
StageWise is a real-time, multimodal AI presentation coach. It turns your browser into a private rehearsal studio.
- Listens: It captures your speech, analyzing tone, clarity, and pacing.
- Watches: It uses your webcam to analyze non-verbal cues like eye contact, posture, and facial expressions.
- Coaches: Unlike standard voice assistants, StageWise interrupts intelligently. If you are speaking too fast or slouching, it offers a gentle verbal nudge.
- Analyzes: After the session, it generates a comprehensive "Report Card" covering:
- Quantitative Metrics: Words Per Minute (WPM), filler word usage counts (um, uh, like), and weak language detection.
- Qualitative Insights: Specific feedback on strengths, growth areas, and pronunciation.
How we built it
We built StageWise using a modern web stack centered around the Google Gemini Multimodal Live API.
The Core Stack
- Frontend: React (TypeScript) for a type-safe, component-based UI.
- Styling: Tailwind CSS for a sleek, dark-mode "studio" aesthetic.
- AI Engine: Google Gemini (
gemini-2.5-flash-native-audio-preview-12-2025) via the@google/genaiSDK.
The Engineering
The biggest technical feat was establishing a low-latency, bidirectional audio/video stream directly from the browser.
Audio Processing (The Ears): Browsers capture audio as
Float32arrays. Gemini expects 16kHzInt16PCM data. We built a custom audio pipeline using the Web Audio API'sScriptProcessorNodeto downsample and convert the raw audio stream in real-time: $$ S_{pcm} = \max(-1, \min(1, S_{input})) \times 32767 $$Visual Context (The Eyes): We didn't just want a voice bot. We capture frames from the user's webcam via an off-screen HTML5
<canvas>, convert them to low-resolution JPEGs, and stream them to the model'srealtimeInputchannel at ~1 FPS. This allows Gemini to "see" a smile or a frown.Real-time Analytics: We calculate the "Pace" (Words Per Minute) on the fly using a sliding window algorithm: $$ WPM = \frac{\sum_{t=now-15s}^{now} words(t)}{15s} \times 60 $$
Challenges we ran into
- Audio Encoding/Decoding: The Gemini Live API sends raw PCM chunks without headers. We had to manually decode these binary chunks into
AudioBuffersand schedule them perfectly on the browser'sAudioContexttimeline to prevent "clicking" or gaps in the AI's voice. - Race Conditions: Managing the state between the React component lifecycle (renders) and the persistent WebSocket connection was tricky. We had to use extensive
useRefhooks to ensure the audio processor didn't send data to a closed socket. - Prompt Engineering: Getting the AI to be a "Coach" and not a "Chatbot" required careful system instructions. We had to tune it to interrupt less and listen more, only speaking up when necessary.
Accomplishments that we're proud of
- True Multimodality: We aren't just sending text. We are sending raw audio and images simultaneously, creating a genuinely immersive experience.
- The Pace Meter: Building a visual speedometer that reacts instantly to how fast the user is talking was a fun UI challenge that adds immediate value.
- Zero-Server Latency: By connecting directly from the client to Gemini, we eliminated the need for an intermediate backend server, reducing latency to the absolute minimum.
What we learned
- Web Audio API is powerful but raw: Working with raw PCM data taught us a lot about digital signal processing (DSP) in JavaScript.
- Multimodal is the future: Seeing the AI respond to a visual gesture (like a wave) while analyzing speech context proved that multimodal models are a paradigm shift from text-only LLMs.
- Feedback loops matter: Providing visual feedback (the WPM gauge) alongside verbal feedback creates a much more effective learning loop for the user.
What's next for StageWise
- Screen Sharing Analysis: Using
getDisplayMediato let the AI see your PowerPoint slides and critique text density or visual design. - Custom Personas: Allowing users to choose their coach's style—e.g., "The Drill Sergeant," "The Supportive Friend," or "The TED Talk Curator."
- Longitudinal Tracking: Adding user accounts to track improvement in public speaking metrics over weeks or months.
Built With
- google-gemini-api
- react
- tailwindcss
- typescript
- video-api
- web-audio-api
Log in or sign up for Devpost to join the conversation.