Inspiration

Public speaking is consistently ranked as one of the top human fears — yet professional pitch coaching costs hundreds of dollars per session and isn't accessible to most people. We saw an opportunity when Gemini 3.0 introduced powerful multimodal video understanding. What if anyone with a webcam could get expert-level pitch feedback instantly, for free? That's how PitchPerfect was born — an AI coach that actually watches your presentation and gives you honest, actionable feedback.

What it does

PitchPerfect lets you record a pitch video (up to 5 minutes) directly in your browser, then analyzes it using Gemini's multimodal AI. The analysis covers three core dimensions:

  • Delivery — voice tone, pacing, clarity, enthusiasm, and filler words
  • Body Language — eye contact, gestures, posture, facial expressions, and overall confidence
  • Content Structure — opening hook, logical flow, key messaging, storytelling, and call to action

After analysis, you receive:

  • An overall score (0–100) with a visual circular progress chart
  • Category breakdowns with specific, moment-referenced feedback
  • Top 2 strengths — what you're already doing well
  • Top 3 improvements — actionable steps you can apply immediately

How we built it

Frontend — Built with Next.js 14, TypeScript, and TailwindCSS. We used the browser's native MediaRecorder API to capture webcam video in WebM format. A custom React hook (useVideoRecorder) manages the full recording lifecycle including live preview, timer, auto-stop at 5 minutes, and blob management.

Backend — Python FastAPI server that accepts video uploads via multipart form data. The video is temporarily saved, uploaded to Gemini's File API, and then analyzed with a carefully crafted structured prompt. The response is parsed into a validated Pydantic model and returned as clean JSON.

AI Integration — We use Gemini 3.0 Flash with multimodal input. The entire video file is passed directly to the model — no frame extraction, no transcription preprocessing. Gemini watches the video natively and evaluates both verbal and non-verbal communication in a single inference call.

Architecture flow:

  1. Browser records webcam → WebM blob
  2. Blob uploaded to FastAPI backend
  3. Backend uploads video to Gemini File API
  4. Gemini analyzes video with structured coaching prompt
  5. JSON response parsed and returned to frontend
  6. Frontend renders scores, feedback, and recommendations

Challenges we ran into

  • Reliable JSON parsing from Gemini — The model occasionally wraps its JSON output in markdown code blocks (json ...). We implemented robust response cleaning logic to handle all formatting variations consistently.

  • Video upload pipeline — Handling large video files across three stages (browser → FastAPI → Gemini File API) required careful async management. We also had to implement proper cleanup to delete temporary files and Gemini uploads after analysis.

  • Gemini File API processing delays — Uploaded videos enter a "PROCESSING" state before they're ready for analysis. We built a polling mechanism that waits for processing to complete while handling failure states gracefully.

  • Browser codec compatibility — MediaRecorder codec support (VP9/Opus in WebM) varies across browsers. We had to account for encoding differences to ensure the output was consistently accepted by the Gemini API.

Accomplishments that we're proud of

  • Fully working end-to-end prototype — From clicking "Record" to seeing detailed AI feedback, the entire flow works seamlessly in one session.

  • True multimodal analysis — We leverage Gemini's native video understanding rather than extracting frames or transcribing audio separately. The model analyzes visual body language and audio delivery simultaneously, just like a real coach would.

  • Clean, intuitive UX — Real-time recording indicator with timer, animated progress bar, smooth state transitions, circular score visualization, and color-coded category cards make the results easy to understand at a glance.

  • Structured, actionable output — Rather than generic advice, the AI references specific behaviors observed in the video and provides immediately applicable suggestions.

What we learned

  • Gemini's multimodal capabilities are production-ready — The model's ability to understand both verbal delivery and physical body language from raw video exceeded our expectations. It catches subtle details like posture shifts and pacing changes.

  • Prompt engineering matters enormously — Requesting explicit JSON schemas with specific field descriptions and example formats dramatically improved the consistency and quality of the AI's structured output.

  • The File API enables new use cases — Being able to upload full video files and reference them in prompts opens up possibilities that aren't feasible with frame-based approaches.

What's next for PitchPerfect

  • Real-time streaming feedback using Gemini's Live API — get coaching while you're still presenting
  • Progress tracking — compare multiple pitch attempts side-by-side to visualize improvement over time
  • Industry-specific modes — tailored coaching for investor pitches, sales demos, job interviews, and academic presentations
  • Exportable PDF reports — shareable feedback summaries for mentors, teams, or personal records
  • Multi-language support — analyze pitches in any language Gemini understands

Built With

Share this project:

Updates