Valsea Snip & Sense — Project Story
Inspiration
As students from Southeast Asia, we've experienced the frustration of watching English lectures and missing key technical terms, or watching lectures in our native language but struggling to extract structured notes. Existing tools only transcribe — they don't understand the linguistic nuances of code-switching between Vietnamese and English, or Singlish and technical jargon. We wanted a tool that not only captures what's being said, but helps us truly learn from it.
What It Does
Valsea Snip & Sense is a Chrome extension that sits quietly in the corner of any lecture page (YouTube, Coursera, Udemy, Google Drive). With one click, it captures the tab audio, transcribes it with accent-aware AI, and then lets you:
- Extract and explain technical terms from the lecture
- Analyze the speaker's tone (positive, negative, neutral)
- Format raw transcripts into structured notes (key quotes, meeting minutes, action items)
- Translate transcripts to English (or vice versa)
- Measure speech emotion — frustration, stress, politeness, hesitation, urgency
- Chat with Gemini AI to ask questions about the lecture content
All without leaving the video — the UI floats directly on the page.
How We Built It
- Plasmo framework for the Chrome extension scaffold (React 18 + TypeScript)
- Tailwind CSS for the glassmorphism dark UI
- Valsea AI API for the full analysis pipeline:
/v1/audio/transcriptions— speech-to-text with language hints/v1/clarifications— fix misheard technical terms (e.g., "Zavas" → "JavaScript")/v1/annotations— extract semantic tags and explanations/v1/sentiment— tone analysis with reasoning/v1/formatting— turn transcripts into structured documents/v1/translations— cross-language translation/v1/prosody— emotion detection from raw audio (async job polling)
- Google Gemini 2.0 Flash for contextual Q&A — the transcript is sent as context, so answers are grounded in the lecture material
getDisplayMediafor tab audio capture,MediaRecorderfor WebM encoding
Challenges We Faced
Keyboard event leakage was a surprising challenge — typing in our extension's input fields would trigger YouTube's keyboard shortcuts (space for play/pause, i for theater mode, t for miniplayer). We solved this with native event listeners in the capture phase, calling stopImmediatePropagation() to prevent any keystroke from bubbling to the host page.
API referer restrictions — Google Gemini API keys with HTTP referrer restrictions block requests originating from YouTube pages. We added referrerPolicy: "no-referrer" to fetch calls and documented the API key configuration steps.
Singlish tag noise — the Valsea annotation model occasionally returned Singlish slang tags (intensity_emphasis, playful_exaggeration) even when the input was Vietnamese. We added a language-aware tag filter and ensured the language parameter was correctly wired through all three API endpoints.
Async prosody polling — the prosody endpoint is job-based (submit → poll → fetch result). We implemented a polling loop with timeout and loading states so the UI stays responsive during the ~5-10 second analysis window.
Technologies Used
| Category | Technology |
|---|---|
| Framework | Plasmo (Chrome Extension), React 18, TypeScript |
| Styling | Tailwind CSS |
| AI / APIs | Valsea AI (transcription, annotation, sentiment, formatting, translation, prosody), Google Gemini 2.0 Flash |
| Browser APIs | getDisplayMedia, MediaRecorder, fetch |
| Icons | Lucide React |
| Dev tools | Node.js, npm, Git, Prettier |
Built With
- plasmo-(chrome-extension)
- react-18
Log in or sign up for Devpost to join the conversation.