Presently.ai: Our Story

What Inspired Us

Students struggle with presentation skills, not because they lack ideas, but because they rarely get concrete, actionable feedback. Practice without feedback doesn’t improve performance; it reinforces habits like filler words and poor slide-speech alignment. We saw an opportunity to turn AI into a coach: real-time transcription plus slide context could give feedback on both how you speak and what you say.

We were motivated by a simple idea: practice with feedback is practice that works. Whether preparing for a class presentation or a job talk, learners need to know:

How often do they say “um,” “like,” long pause, and other filler words?
Whether their spoken content matches the key points on their slides.

What We Learned

We deepened our understanding across several areas:

AI and prompt design. Getting consistent, structured output from Gemini required careful prompt engineering. We learned to specify exact JSON schemas, give clear evaluation criteria, and separate “filler words” vs. “content alignment” so the model returns usable feedback instead of free-form text.

Web APIs and browser constraints. The Web Speech API is free and works in-browser, but has limits. We explored trade-offs between Web Speech API and ElevenLabs for transcription, and chose Web Speech for real-time transcription to avoid extra API calls and cost.

Audio-only recording. Switching from video to audio-only reduced complexity (no camera, smaller payloads) and focused feedback on voice and content. We found that voice + slide text is enough to give meaningful coaching.

How We Built It

Our stack:

Layer	Technology	Purpose
Frontend	Next.js 16, React 19, Tailwind	Single-page app and UI
Recording	MediaRecorder API	Audio capture
Transcription	Web Speech API	Real-time speech-to-text
Slide parsing	JSZip	Extract text from PPTX
Analysis	Gemini 2.0 Flash	Filler words + content alignment
Voice feedback	ElevenLabs	TTS for feedback summary
Storage	MongoDB Atlas	Optional history (in-memory fallback)

Flow:

User uploads PPTX slides → text is extracted via JSZip.
User records voice → MediaRecorder captures audio, Web Speech API transcribes.
On stop → transcript and slide content are sent to Gemini.
Gemini returns structured feedback on filler words and content alignment.
Optional: ElevenLabs voice reads the summary aloud.

We used a stepper-style flow: Upload → Record → Analyze → Feedback, with a progress chart to show improvement over time.

For the overall tech stack, Presently.ai is built on a Next.js + React frontend where speakers upload their slide deck and audio recording, with session history optionally stored in MongoDB Atlas. The app sends extracted slide text and transcripts through Gemini (with AssemblyAI for enhanced speech analysis) to generate structured coaching, and ElevenLabs then delivers the feedback back to the user as a clear voice summary.

Challenges We Faced

1. Gemini API rate limits (429). During development, we hit the free-tier quota. We mitigated this with retry logic, clearer error messages, and by exploring alternative models (e.g., gemini-1.5-flash).

2. Recording start timing. Calling startRecording() immediately after startMicrophone() sometimes ran before the stream was available. We fixed this by moving the recording start into a useEffect that runs only when stream is present, avoiding race conditions.

3. PPTX parsing. node-pptx-parser pulled in AWS SDK and caused build issues. We switched to JSZip and regex over the slide XML to extract text, which keeps dependencies light and builds simply.

4. Loading state confusion. The Record button showed “loading” even when nothing was loading. The problem was using isReady from the mic hook before the mic was requested. We only show loading when an actual async operation is in progress.

Reflection

We started with a broad vision (video + body language + voice) and narrowed it to voice + slides. That scope reduction let us ship something useful and focused: filler-word feedback and content alignment are the two things we believe matter most for student presenters.

In the future, we’d add: practice mode with real-time “um” alerts, timestamps for filler words, and tighter MongoDB integration for persistent history.