StageWise: Master the stage before you step on it.

Inspiration

Public speaking is often cited as one of the most common phobias ($Glossophobia$). Whether it's a student defending a thesis, a startup founder pitching to VCs, or an employee giving a quarterly update, the pressure is immense.

Most people practice in one of two ways:

The Mirror: Talking to a reflection that cannot give feedback.
The Recording: Filming themselves, which is passive and often painful to watch back.

We were inspired to build StageWise to bridge this gap. We wanted to create a supportive, objective, and real-time coach that acts like a human expert sitting in the front row—someone who can see your body language, hear your tone, and gently guide you to improvement before the stakes are high.

What it does

StageWise is a real-time, multimodal AI presentation coach. It turns your browser into a private rehearsal studio.

Listens: It captures your speech, analyzing tone, clarity, and pacing.
Watches: It uses your webcam to analyze non-verbal cues like eye contact, posture, and facial expressions.
Coaches: Unlike standard voice assistants, StageWise interrupts intelligently. If you are speaking too fast or slouching, it offers a gentle verbal nudge.
Analyzes: After the session, it generates a comprehensive "Report Card" covering:
- Quantitative Metrics: Words Per Minute (WPM), filler word usage counts (um, uh, like), and weak language detection.
- Qualitative Insights: Specific feedback on strengths, growth areas, and pronunciation.

How we built it

We built StageWise using a modern web stack centered around the Google Gemini Multimodal Live API.

The Core Stack

Frontend: React (TypeScript) for a type-safe, component-based UI.
Styling: Tailwind CSS for a sleek, dark-mode "studio" aesthetic.
AI Engine: Google Gemini (gemini-2.5-flash-native-audio-preview-12-2025) via the @google/genai SDK.

The Engineering

The biggest technical feat was establishing a low-latency, bidirectional audio/video stream directly from the browser.

Audio Processing (The Ears): Browsers capture audio as Float32 arrays. Gemini expects 16kHz Int16 PCM data. We built a custom audio pipeline using the Web Audio API's ScriptProcessorNode to downsample and convert the raw audio stream in real-time: $$ S_{pcm} = \max(-1, \min(1, S_{input})) \times 32767 $$
Visual Context (The Eyes): We didn't just want a voice bot. We capture frames from the user's webcam via an off-screen HTML5 <canvas>, convert them to low-resolution JPEGs, and stream them to the model's realtimeInput channel at ~1 FPS. This allows Gemini to "see" a smile or a frown.
Real-time Analytics: We calculate the "Pace" (Words Per Minute) on the fly using a sliding window algorithm: $$ WPM = \frac{\sum_{t=now-15s}^{now} words(t)}{15s} \times 60 $$

Challenges we ran into

Audio Encoding/Decoding: The Gemini Live API sends raw PCM chunks without headers. We had to manually decode these binary chunks into AudioBuffers and schedule them perfectly on the browser's AudioContext timeline to prevent "clicking" or gaps in the AI's voice.
Race Conditions: Managing the state between the React component lifecycle (renders) and the persistent WebSocket connection was tricky. We had to use extensive useRef hooks to ensure the audio processor didn't send data to a closed socket.
Prompt Engineering: Getting the AI to be a "Coach" and not a "Chatbot" required careful system instructions. We had to tune it to interrupt less and listen more, only speaking up when necessary.

Accomplishments that we're proud of

True Multimodality: We aren't just sending text. We are sending raw audio and images simultaneously, creating a genuinely immersive experience.
The Pace Meter: Building a visual speedometer that reacts instantly to how fast the user is talking was a fun UI challenge that adds immediate value.
Zero-Server Latency: By connecting directly from the client to Gemini, we eliminated the need for an intermediate backend server, reducing latency to the absolute minimum.

What we learned

Web Audio API is powerful but raw: Working with raw PCM data taught us a lot about digital signal processing (DSP) in JavaScript.
Multimodal is the future: Seeing the AI respond to a visual gesture (like a wave) while analyzing speech context proved that multimodal models are a paradigm shift from text-only LLMs.
Feedback loops matter: Providing visual feedback (the WPM gauge) alongside verbal feedback creates a much more effective learning loop for the user.

What's next for StageWise

Screen Sharing Analysis: Using getDisplayMedia to let the AI see your PowerPoint slides and critique text density or visual design.
Custom Personas: Allowing users to choose their coach's style—e.g., "The Drill Sergeant," "The Supportive Friend," or "The TED Talk Curator."
Longitudinal Tracking: Adding user accounts to track improvement in public speaking metrics over weeks or months.

Built With

google-gemini-api
react
tailwindcss
typescript
video-api
web-audio-api

Updates

Akshay Indalkar started this project — Feb 09, 2026 06:44 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.