Ruva – AI-Powered Speech Tutor
Inspiration
Public speaking is universally terrifying. Whether it's the pressure of a group discussion, the sudden panic of going blank during a presentation, or the frustration of stuttering when the spotlight hits — speech anxiety holds brilliant people back.
We realized that existing solutions either just record your voice or offer generic advice. Even typical Discord-style practice communities lack structure and flexibility, especially for people with little to no exposure to public speaking.
So we asked:
What if practice felt like training with a coach who actually knows you?
Ruva was inspired by the idea of creating a safe, adaptive, intelligent sandbox where users could practice real-world speaking scenarios — from one-on-one debates to rapid-fire JAM sessions — with an AI that remembers their struggles and tracks their growth over time.
What It Does
Ruva is a modern AI-powered speech tutor designed to dismantle speech anxiety through personalized coaching powered by a native RAG architecture.
Instead of static feedback, Ruva:
- Tracks historical strengths and weaknesses
- Identifies filler word usage patterns
- Detects drops in vocal intensity
- Measures pacing and pauses
- Monitors improvement across sessions
Users can practice inside four distinct training rooms:
Debate Mode
Face off against an AI opponent or debate another human while an AI acts as the judge.
Group Discussion Mode
Join multiplayer rooms (2+ participants) guided by an AI facilitator that manages flow and engagement.
JAM (Just-A-Minute) Mode
A high-pressure single-player mode designed to improve spontaneous speaking ability.
Reading Mode
Practice pronunciation, pacing, and clarity in a structured solo environment.
Behind the scenes, Ruva performs real-time analysis of:
- Speech transcription
- Prosody (pitch, jitter, shimmer)
- Pauses
- Sentiment
- Speaking confidence indicators
All to generate actionable, personalized coaching feedback.
How We Built It
We redesigned the system architecture from the ground up to support real-time, low-latency interactions.
Frontend
Built using:
- React
- TypeScript
- Vite
- Redux (state management)
The UI supports responsive multiplayer sessions and live feedback visualization.
Backend
Powered by:
- Python
- FastAPI
- WebSockets
WebSockets enable real-time bidirectional communication required for:
- multiplayer rooms
- live transcription
- AI facilitation
- audio streaming pipelines
Data & Memory Layer
We implemented a hybrid storage architecture:
- MongoDB → persistent storage for user progress (core to RAG memory)
- Redis → high-speed session state caching during live rooms
AI & Audio Engine
Ruva’s intelligence stack includes:
- Google Gemini API (core reasoning engine)
- Whisper (speech-to-text transcription)
- Silero VAD (voice activity detection)
- Parselmouth (scientific prosody analysis)
Together, they enable real-time speech understanding and personalized coaching.
Challenges We Ran Into
Handling real-time audio streaming was one of the toughest challenges.
We had to:
- synchronize frontend audio streams through WebSockets
- segment speech efficiently using Silero VAD
- pipeline audio into Whisper transcription
- minimize latency without breaking conversation flow
Another major challenge was building multiplayer facilitation logic.
For Group Discussion and Debate Mode, Gemini needed to:
- listen to multiple speakers
- track conversation context
- identify speaker turns
- intervene naturally as a moderator or judge
All without disrupting human interaction dynamics.
Accomplishments We're Proud Of
Our biggest achievement is the native RAG-based coaching memory system.
Instead of analyzing speech in isolation, Ruva remembers things like:
“You struggled with filler words last Tuesday — let's check improvement today.”
That transforms Ruva from a tool into a mentor-like experience.
We're also proud of:
- migrating to a scalable React + FastAPI + WebSocket architecture
- enabling real-time multiplayer speaking environments
- implementing experimental body-language tracking using periodic visual snapshots
What We Learned
This project became a masterclass in real-time system engineering.
We gained hands-on experience with:
- WebSocket lifecycle management
- distributed real-time state synchronization
- audio streaming pipelines
- low-latency speech processing architectures
- advanced prompt engineering with Gemini
We also explored designing specialized AI personas that act as:
- judges
- facilitators
- coaches
inside different speaking environments.
What's Next for Ruva
Our immediate roadmap includes:
- integrating Gemini Live Multimodal APIs
- reducing response latency
- supporting interruption-aware conversation handling
- introducing additional structured speaking rooms
- Implementing natural voice support using third party providers for more personlisation
Our long-term vision:
Launch Ruva as a full mobile application and make personalized speech coaching accessible anywhere.
Log in or sign up for Devpost to join the conversation.