Shark Vision helps presenters catch issues the moment they occur. It watches the slides and listens to the speaker simultaneously, using Google Gemini and Google Speech-to-Text to give timestamped feedback, a shared transcript, and a 10-point oral delivery score.
Inspiration
Friends kept asking for “one more dry run” before big pitches and thesis defenses. The pinch point was always twofold: Does the deck work? and Am I delivering it well? We wanted a single teammate that could flag typography mistakes and call out filler words without burning a human reviewer’s time
How We Built It
Real-time capture: the recorder page grabs webcam video and microphone audio every few seconds. Slide analysis: each frame goes to Gemini for slide clarity, design, and content feedback; only new issues are surfaced with timestamps. Speech coach: audio chunks flow to Google Speech-to-Text, producing an aligned transcript. Delivery scoring: the transcript feeds custom analytics for dialect clarity, grammar, filler words, and pace:
Challenges we ran into
Latency: keeping audio transcription and frame analysis in step without overwhelming API quotas. False positives: Gemini loves to speculate, so we built fingerprinting to suppress duplicate slide issues. Credential management: securely wiring Google’s SDK so local development is easy but keys never leak.
What we learned
Pairing multimodal AI (vision + speech) yields richer feedback than running them separately. Users trust feedback more when every deduction links to a timestamped recommendation. A shared manuscript becomes a bridge between slide critiques and delivery coaching. It's one source of truth that both systems can reason about. Shark Vision became the practice partner I wanted: relentless, specific, and available at midnight before demo day.
What's next for Shark Vision
Multi-speaker awareness: Detect hand-offs between presenters, attribute issues to the right speaker, and show collaboration metrics. Script-assist for practice: Preview the transcript as structured speaker notes, highlight sections that drift from the intended message, and suggest rewrites before going live. Team dashboards: Let teams manage shared presentations, assign reviewers, and sync feedback into project management tools (Notion, Jira, Slack). Inferred CTA alignment: Map the talk’s content and pacing to the desired call-to-action, flag black holes where the audience might disengage.
Built With
- fastapi
- gemini
- javascript
- mediapipe
- next.js
- numpy
- opencv
- python
- rag
- react
- tensorflow
- typescript
- uvicorn
- websockets


Log in or sign up for Devpost to join the conversation.