ReVoice

About the Project

ReVoice is an AI-powered real-time speech assistant designed to help people who stutter communicate fluently — without losing their natural voice. As the user speaks, ReVoice listens, detects stutters or speech blocks, and instantly regenerates the same words in the speaker’s own cloned voice, preserving tone, rhythm, and emotion. The regenerated audio is then seamlessly lip-synced to the person’s face, so it looks and sounds as if they spoke it themselves.

Inspiration

We were inspired by how people with speech disorders are often interrupted or misunderstood — not because of what they say, but how it comes out. Most speech tools try to “fix” speech by replacing it with robotic text-to-speech voices. We wanted something more human — a system that doesn’t erase someone’s identity, but amplifies it. Because when you hear someone’s real voice — without the struggle — it can completely change how they’re perceived, and how they feel about speaking.

What it does

Here’s how ReVoice works — step by step:

You speak or upload an audio or video file.
ReVoice listens and analyzes your speech in real time (or from the uploaded clip) using Whisper — understanding your words, timing, and rhythm.
It detects moments of stuttering, blocks, or repeated syllables — and identifies the type of stutter you’re experiencing.
If it’s a video, ReVoice lip-syncs your fluent voice seamlessly to your face, so it looks as if you said it yourself.

You choose how to use it:

 - Fluency Mode: ReVoice instantly regenerates your full sentence in your own cloned voice, keeping your natural tone and emotion.
 - Practice Mode: The built-in AI assistant joins you, pronouncing difficult words or phrases with you in real time using your ElevenLabs voice — helping you practice and build confidence.

You can listen or watch your fluent, expressive version instantly — or use the AI assistant anytime for guided speech practice.

How we built it

ReVoice brings together real-time speech AI, voice cloning, and lip-sync technology into one seamless system:

Frontend (React): The user interface was built with React, allowing people to record or upload audio/video files. The Flask backend handles all processing and connects to AI services through custom API endpoints.

Speech Recognition (Whisper): We used OpenAI’s Whisper to transcribe speech and detect timing, pauses, and repeated syllables. This data helps pinpoint the exact moments where stuttering occurs.

Voice Cloning (ElevenLabs): Using ElevenLabs, ReVoice clones each user’s natural voice — keeping their tone, rhythm, and emotional depth. This cloned voice is used both for fluent re-synthesis and the interactive assistant.

Fluency Assistant (Powered by ElevenLabs): The built-in AI voice assistant from ElevenLabs listens to the user’s speech, detects stutter patterns, and speaks alongside them using their cloned voice. It pronounces difficult words or phrases in real time — helping users practice, improve fluency, and build confidence.

Lip Syncing (Wav2Lip ONNX): For video inputs, we used Wav2Lip-ONNX to synchronize the regenerated fluent voice perfectly with the user’s face, making it appear as if they spoke the corrected version themselves.

Backend Integration (Flask): A Python-based pipeline orchestrates Whisper, Gemini, and ElevenLabs APIs, automatically cleaning speech, generating new audio, and handling video post-processing.

The final audio or video is sent back through Flask to the React frontend for playback or download.

Challenges we faced

Real-time processing: Getting Whisper and ElevenLabs to work together fast enough for near real-time feedback was tough. We had to optimize audio streaming, caching, and backend response times to keep latency low.

Lip-sync accuracy: Wav2Lip-ONNX sometimes struggled when the user’s face wasn’t centered or lighting conditions changed. Fine-tuning the frame extraction and syncing outputs took multiple iterations.

Maintaining emotion and tone: Many TTS systems sound flat, but we wanted users to sound like themselves. Preserving natural expression in the regenerated voice using ElevenLabs’ emotional models required experimentation with model parameters and timing alignment.

Detecting different stutter types: Not all stutters are the same — blocks, repetitions, and prolongations need different handling. We had to design logic that could detect these subtle differences and guide the assistant’s response appropriately.

Integrating multiple APIs: Whisper, Gemini, ElevenLabs, and Wav2Lip all have different formats and dependencies. Coordinating audio/video handoffs between them without breaking the pipeline was one of the biggest engineering challenges. Wav2Lip also has an open-source API, but we had to run the model locally because the documentation was outdated.

Designing a calm, human experience: People who stutter often feel anxious about hearing themselves. We wanted ReVoice to feel like a supportive companion — not a correction tool — so designing that balance in tone, pacing, and interface was crucial.