PitchPerfect

Ready
Working
result

Inspiration

Public speaking is consistently ranked as one of the top human fears — yet professional pitch coaching costs hundreds of dollars per session and isn't accessible to most people. We saw an opportunity when Gemini 3.0 introduced powerful multimodal video understanding. What if anyone with a webcam could get expert-level pitch feedback instantly, for free? That's how PitchPerfect was born — an AI coach that actually watches your presentation and gives you honest, actionable feedback.

What it does

PitchPerfect lets you record a pitch video (up to 5 minutes) directly in your browser, then analyzes it using Gemini's multimodal AI. The analysis covers three core dimensions:

Delivery — voice tone, pacing, clarity, enthusiasm, and filler words
Body Language — eye contact, gestures, posture, facial expressions, and overall confidence
Content Structure — opening hook, logical flow, key messaging, storytelling, and call to action

After analysis, you receive:

An overall score (0–100) with a visual circular progress chart
Category breakdowns with specific, moment-referenced feedback
Top 2 strengths — what you're already doing well
Top 3 improvements — actionable steps you can apply immediately

How we built it

Frontend — Built with Next.js 14, TypeScript, and TailwindCSS. We used the browser's native MediaRecorder API to capture webcam video in WebM format. A custom React hook (useVideoRecorder) manages the full recording lifecycle including live preview, timer, auto-stop at 5 minutes, and blob management.

Backend — Python FastAPI server that accepts video uploads via multipart form data. The video is temporarily saved, uploaded to Gemini's File API, and then analyzed with a carefully crafted structured prompt. The response is parsed into a validated Pydantic model and returned as clean JSON.

AI Integration — We use Gemini 3.0 Flash with multimodal input. The entire video file is passed directly to the model — no frame extraction, no transcription preprocessing. Gemini watches the video natively and evaluates both verbal and non-verbal communication in a single inference call.

Architecture flow:

Browser records webcam → WebM blob
Blob uploaded to FastAPI backend
Backend uploads video to Gemini File API
Gemini analyzes video with structured coaching prompt
JSON response parsed and returned to frontend
Frontend renders scores, feedback, and recommendations

Challenges we ran into

Reliable JSON parsing from Gemini — The model occasionally wraps its JSON output in markdown code blocks (json ...). We implemented robust response cleaning logic to handle all formatting variations consistently.
Video upload pipeline — Handling large video files across three stages (browser → FastAPI → Gemini File API) required careful async management. We also had to implement proper cleanup to delete temporary files and Gemini uploads after analysis.
Gemini File API processing delays — Uploaded videos enter a "PROCESSING" state before they're ready for analysis. We built a polling mechanism that waits for processing to complete while handling failure states gracefully.
Browser codec compatibility — MediaRecorder codec support (VP9/Opus in WebM) varies across browsers. We had to account for encoding differences to ensure the output was consistently accepted by the Gemini API.

Accomplishments that we're proud of

Fully working end-to-end prototype — From clicking "Record" to seeing detailed AI feedback, the entire flow works seamlessly in one session.
True multimodal analysis — We leverage Gemini's native video understanding rather than extracting frames or transcribing audio separately. The model analyzes visual body language and audio delivery simultaneously, just like a real coach would.
Clean, intuitive UX — Real-time recording indicator with timer, animated progress bar, smooth state transitions, circular score visualization, and color-coded category cards make the results easy to understand at a glance.
Structured, actionable output — Rather than generic advice, the AI references specific behaviors observed in the video and provides immediately applicable suggestions.

What we learned

Gemini's multimodal capabilities are production-ready — The model's ability to understand both verbal delivery and physical body language from raw video exceeded our expectations. It catches subtle details like posture shifts and pacing changes.
Prompt engineering matters enormously — Requesting explicit JSON schemas with specific field descriptions and example formats dramatically improved the consistency and quality of the AI's structured output.
The File API enables new use cases — Being able to upload full video files and reference them in prompts opens up possibilities that aren't feasible with frame-based approaches.

What's next for PitchPerfect

Real-time streaming feedback using Gemini's Live API — get coaching while you're still presenting
Progress tracking — compare multiple pitch attempts side-by-side to visualize improvement over time
Industry-specific modes — tailored coaching for investor pitches, sales demos, job interviews, and academic presentations
Exportable PDF reports — shareable feedback summaries for mentors, teams, or personal records
Multi-language support — analyze pitches in any language Gemini understands

Built With

fastapi
google-gemini-api
mediarecorder
next.js
python
tailwindcss
typescript

Updates

학성 김 started this project — Feb 09, 2026 12:50 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.