Vocal Ai

Vocal Ai

Inspiration

Many English learners from India, China, and the Spanish-speaking world want to speak clearer, more neutral English—for work, travel, or daily life—but lack affordable, personalized feedback. Generic apps rarely account for which sounds and patterns are hardest for their first language. We wanted an AI coach that listens, transcribes, scores, and gives accent-aware tips (e.g. /v/–/w/ for Hindi speakers, /θ/–/ð/ for Mandarin) while respecting their identity. Vocal AI is that coach: record once, get instant, tailored feedback and a native pronunciation to mimic.

What it does

Vocal AI is an AI pronunciation coach for Hindi, Mandarin, and Spanish speakers. Users can:

Practice: Pick a sentence, set "My first language," record, and get a score (0–100), transcription, tips (e.g. stress, /v/–/w/, rhythm), corrected text, and a practice sentence targeting their mistakes. They can play "Your recording" and "Play native pronunciation" (ElevenLabs TTS with Web Speech fallback).
Passage: Read longer passages aloud, then record. The app highlights mispronounced words in the passage and gives the same structured feedback and score.
My Tips: Persisted, personalized tips from past sessions.
Profile: Time practiced, day streak, pronunciation scores, and recent recordings (with labels) when signed in.

Sign-in (MySQL) lets progress, sessions, and audio be saved; guests can still practice without persistence.

How we built it

Frontend: Next.js 14 (App Router), React, TypeScript. In-browser recording with MediaRecorder (OGG/Opus when supported, otherwise WebM). We convert WebM → WAV client-side via the Web Audio API (decodeAudioData + manual 16‑bit mono WAV) because Gemini does not accept WebM.
Feedback pipeline: Audio (base64 WAV/OGG/MP3) is sent to Gemini (gemini-2.5-flash) with responseMimeType: "application/json" and a responseSchema for transcription, feedback, correctedText, practiceSentence, score, and errorWords. A system instruction and accent profile (Hindi, Mandarin, Spanish) tailor the model to L1-specific patterns (e.g. retroflex /t/ /d/, /θ/ /ð/, syllable timing).
Native playback: ElevenLabs TTS for the corrected or practice sentence; we fall back to the Web Speech API and a visible <audio> control if ElevenLabs fails.
Backend: Next.js API routes for /api/feedback, /api/speak, /api/session, /api/stats, /api/audio/[id], and auth (login, signup, logout, me). MySQL (languageai) stores user, practice_sessions, and user_audio; sessions and stats aggregate time, scores, and recent recordings.

Challenges we ran into

Gemini and WebM: Gemini supports WAV, MP3, OGG, etc., but not WebM. Browsers often record in WebM. We tried server-side ffmpeg; it failed (spawn ffmpeg ENOENT) in our environment. We switched to client-side WebM→WAV using the Web Audio API so no server binaries are needed.
Structured JSON from Gemini: Despite responseMimeType: "application/json" and responseSchema, we occasionally saw "Gemini returned invalid JSON"—e.g. markdown wrapping, extra text, or unescaped characters in string fields. We added extraction (strip blocks, slice from first { to last }) and instrumentation to debug; further hardening (e.g. schema validation, retries) is in progress.
Audio formats across browsers: We prefer OGG/Opus; when only WebM is available, we convert to WAV. We also normalize MIME (e.g. audio/ogg, audio/mp3, audio/wav) before sending to Gemini.

Accomplishments that we're proud of

L1-aware feedback: Accent profiles for Hindi, Mandarin, and Spanish so tips target real trouble spots (e.g. /v/–/w/, /θ/–/ð/, rhythm) instead of generic advice.
End-to-end flow: Record → transcribe → score → tips → corrected text → native playback, with a clean Practice and Passage UX.
Profile and persistence: Time practiced, streaks, scores, and recent recordings with labels, all backed by MySQL and file storage.
Client-side WebM→WAV: No ffmpeg or extra server deps; works in Chrome and Firefox with the Web Audio API.

What we learned

Model vs. format: Always check an API's supported inputs (e.g. Gemini's audio MIME list); browser defaults (WebM) may not match, so a conversion layer is essential.
LLM JSON reliability: responseMimeType: "application/json" helps but isn't perfect; defensive parsing, extraction heuristics, and logging are needed when the app depends on structured output.
Fallbacks for TTS: Relying only on ElevenLabs can break in free-tier or network issues; layering in Web Speech and a visible <audio> control improved robustness.

What's next for Vocal Ai

Harden JSON handling: Improve extraction and validation for Gemini's response; add retries or a fallback prompt when JSON is malformed.
More L1s: Add accent profiles for more first languages (e.g. Arabic, Portuguese, French).
Voice cloning (ElevenLabs): Use a cloned version of the user's voice with a softened accent for "Play native"—so they hear themselves, but clearer.
Mobile and PWA: A mobile-friendly UI and optional PWA for practice on the go.
Spaced repetition for tips: Surface past tips and practiceSentences in a review flow to reinforce weak sounds.

Built With

app
bcryptjs.
browser.-database:-mysql.-apis:-google-gemini-(gemini-2.5-flash)-for-audio?transcription-and-feedback
deploy:
digitalocean
elevenlabs-for-tts.-browser-apis:-web-audio-(client-side-webm?wav)
javascript
languages:-typescript
mediarecorder
mysql2
react-18.-platform:-node.js
sql.-frameworks:-next.js-14-(app-router)
web-speech-(tts-fallback).-libraries:-jose-(jwt)

Submitted to

HackHive 2026

Created by

copartner

Fazal Sheikh
Co partner

Gibran Alam
co partner

Aditya Ramjas
I was a co-partner. Worked on the database and implementing the external APIs. Also helped my fellow members on different aspects of the project.

mohid sohail

Updates

Fazal Sheikh started this project — Jan 25, 2026 06:35 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.