Inspiration
Many English learners from India, China, and the Spanish-speaking world want to speak clearer, more neutral English—for work, travel, or daily life—but lack affordable, personalized feedback. Generic apps rarely account for which sounds and patterns are hardest for their first language. We wanted an AI coach that listens, transcribes, scores, and gives accent-aware tips (e.g. /v/–/w/ for Hindi speakers, /θ/–/ð/ for Mandarin) while respecting their identity. Vocal AI is that coach: record once, get instant, tailored feedback and a native pronunciation to mimic.
What it does
Vocal AI is an AI pronunciation coach for Hindi, Mandarin, and Spanish speakers. Users can:
- Practice: Pick a sentence, set "My first language," record, and get a score (0–100), transcription, tips (e.g. stress, /v/–/w/, rhythm), corrected text, and a practice sentence targeting their mistakes. They can play "Your recording" and "Play native pronunciation" (ElevenLabs TTS with Web Speech fallback).
- Passage: Read longer passages aloud, then record. The app highlights mispronounced words in the passage and gives the same structured feedback and score.
- My Tips: Persisted, personalized tips from past sessions.
- Profile: Time practiced, day streak, pronunciation scores, and recent recordings (with labels) when signed in.
Sign-in (MySQL) lets progress, sessions, and audio be saved; guests can still practice without persistence.
How we built it
- Frontend: Next.js 14 (App Router), React, TypeScript. In-browser recording with
MediaRecorder(OGG/Opus when supported, otherwise WebM). We convert WebM → WAV client-side via the Web Audio API (decodeAudioData+ manual 16‑bit mono WAV) because Gemini does not accept WebM. - Feedback pipeline: Audio (base64 WAV/OGG/MP3) is sent to Gemini (
gemini-2.5-flash) withresponseMimeType: "application/json"and aresponseSchemafortranscription,feedback,correctedText,practiceSentence,score, anderrorWords. A system instruction and accent profile (Hindi, Mandarin, Spanish) tailor the model to L1-specific patterns (e.g. retroflex /t/ /d/, /θ/ /ð/, syllable timing). - Native playback: ElevenLabs TTS for the corrected or practice sentence; we fall back to the Web Speech API and a visible
<audio>control if ElevenLabs fails. - Backend: Next.js API routes for
/api/feedback,/api/speak,/api/session,/api/stats,/api/audio/[id], and auth (login, signup, logout, me). MySQL (languageai) storesuser,practice_sessions, anduser_audio; sessions and stats aggregate time, scores, and recent recordings.
Challenges we ran into
- Gemini and WebM: Gemini supports WAV, MP3, OGG, etc., but not WebM. Browsers often record in WebM. We tried server-side ffmpeg; it failed (
spawn ffmpeg ENOENT) in our environment. We switched to client-side WebM→WAV using the Web Audio API so no server binaries are needed. - Structured JSON from Gemini: Despite
responseMimeType: "application/json"andresponseSchema, we occasionally saw "Gemini returned invalid JSON"—e.g. markdown wrapping, extra text, or unescaped characters in string fields. We added extraction (stripblocks, slice from first{to last}) and instrumentation to debug; further hardening (e.g. schema validation, retries) is in progress. - Audio formats across browsers: We prefer OGG/Opus; when only WebM is available, we convert to WAV. We also normalize MIME (e.g.
audio/ogg,audio/mp3,audio/wav) before sending to Gemini.
Accomplishments that we're proud of
- L1-aware feedback: Accent profiles for Hindi, Mandarin, and Spanish so tips target real trouble spots (e.g. /v/–/w/, /θ/–/ð/, rhythm) instead of generic advice.
- End-to-end flow: Record → transcribe → score → tips → corrected text → native playback, with a clean Practice and Passage UX.
- Profile and persistence: Time practiced, streaks, scores, and recent recordings with labels, all backed by MySQL and file storage.
- Client-side WebM→WAV: No ffmpeg or extra server deps; works in Chrome and Firefox with the Web Audio API.
What we learned
- Model vs. format: Always check an API's supported inputs (e.g. Gemini's audio MIME list); browser defaults (WebM) may not match, so a conversion layer is essential.
- LLM JSON reliability:
responseMimeType: "application/json"helps but isn't perfect; defensive parsing, extraction heuristics, and logging are needed when the app depends on structured output. - Fallbacks for TTS: Relying only on ElevenLabs can break in free-tier or network issues; layering in Web Speech and a visible
<audio>control improved robustness.
What's next for Vocal Ai
- Harden JSON handling: Improve extraction and validation for Gemini's response; add retries or a fallback prompt when JSON is malformed.
- More L1s: Add accent profiles for more first languages (e.g. Arabic, Portuguese, French).
- Voice cloning (ElevenLabs): Use a cloned version of the user's voice with a softened accent for "Play native"—so they hear themselves, but clearer.
- Mobile and PWA: A mobile-friendly UI and optional PWA for practice on the go.
- Spaced repetition for tips: Surface past tips and
practiceSentences in a review flow to reinforce weak sounds.
Built With
- app
- bcryptjs.
- browser.-database:-mysql.-apis:-google-gemini-(gemini-2.5-flash)-for-audio?transcription-and-feedback
- deploy:
- digitalocean
- elevenlabs-for-tts.-browser-apis:-web-audio-(client-side-webm?wav)
- javascript
- languages:-typescript
- mediarecorder
- mysql2
- react-18.-platform:-node.js
- sql.-frameworks:-next.js-14-(app-router)
- web-speech-(tts-fallback).-libraries:-jose-(jwt)
Log in or sign up for Devpost to join the conversation.