Ruva – AI-Powered Speech Tutor

Inspiration

Public speaking is universally terrifying. Whether it's the pressure of a group discussion, the sudden panic of going blank during a presentation, or the frustration of stuttering when the spotlight hits — speech anxiety holds brilliant people back.

We realized that existing solutions either just record your voice or offer generic advice. Even typical Discord-style practice communities lack structure and flexibility, especially for people with little to no exposure to public speaking.

So we asked:

What if practice felt like training with a coach who actually knows you?

Ruva was inspired by the idea of creating a safe, adaptive, intelligent sandbox where users could practice real-world speaking scenarios — from one-on-one debates to rapid-fire JAM sessions — with an AI that remembers their struggles and tracks their growth over time.


What It Does

Ruva is a modern AI-powered speech tutor designed to dismantle speech anxiety through personalized coaching powered by a native RAG architecture.

Instead of static feedback, Ruva:

  • Tracks historical strengths and weaknesses
  • Identifies filler word usage patterns
  • Detects drops in vocal intensity
  • Measures pacing and pauses
  • Monitors improvement across sessions

Users can practice inside four distinct training rooms:

Debate Mode

Face off against an AI opponent or debate another human while an AI acts as the judge.

Group Discussion Mode

Join multiplayer rooms (2+ participants) guided by an AI facilitator that manages flow and engagement.

JAM (Just-A-Minute) Mode

A high-pressure single-player mode designed to improve spontaneous speaking ability.

Reading Mode

Practice pronunciation, pacing, and clarity in a structured solo environment.

Behind the scenes, Ruva performs real-time analysis of:

  • Speech transcription
  • Prosody (pitch, jitter, shimmer)
  • Pauses
  • Sentiment
  • Speaking confidence indicators

All to generate actionable, personalized coaching feedback.


How We Built It

We redesigned the system architecture from the ground up to support real-time, low-latency interactions.

Frontend

Built using:

  • React
  • TypeScript
  • Vite
  • Redux (state management)

The UI supports responsive multiplayer sessions and live feedback visualization.

Backend

Powered by:

  • Python
  • FastAPI
  • WebSockets

WebSockets enable real-time bidirectional communication required for:

  • multiplayer rooms
  • live transcription
  • AI facilitation
  • audio streaming pipelines

Data & Memory Layer

We implemented a hybrid storage architecture:

  • MongoDB → persistent storage for user progress (core to RAG memory)
  • Redis → high-speed session state caching during live rooms

AI & Audio Engine

Ruva’s intelligence stack includes:

  • Google Gemini API (core reasoning engine)
  • Whisper (speech-to-text transcription)
  • Silero VAD (voice activity detection)
  • Parselmouth (scientific prosody analysis)

Together, they enable real-time speech understanding and personalized coaching.


Challenges We Ran Into

Handling real-time audio streaming was one of the toughest challenges.

We had to:

  • synchronize frontend audio streams through WebSockets
  • segment speech efficiently using Silero VAD
  • pipeline audio into Whisper transcription
  • minimize latency without breaking conversation flow

Another major challenge was building multiplayer facilitation logic.

For Group Discussion and Debate Mode, Gemini needed to:

  • listen to multiple speakers
  • track conversation context
  • identify speaker turns
  • intervene naturally as a moderator or judge

All without disrupting human interaction dynamics.


Accomplishments We're Proud Of

Our biggest achievement is the native RAG-based coaching memory system.

Instead of analyzing speech in isolation, Ruva remembers things like:

“You struggled with filler words last Tuesday — let's check improvement today.”

That transforms Ruva from a tool into a mentor-like experience.

We're also proud of:

  • migrating to a scalable React + FastAPI + WebSocket architecture
  • enabling real-time multiplayer speaking environments
  • implementing experimental body-language tracking using periodic visual snapshots

What We Learned

This project became a masterclass in real-time system engineering.

We gained hands-on experience with:

  • WebSocket lifecycle management
  • distributed real-time state synchronization
  • audio streaming pipelines
  • low-latency speech processing architectures
  • advanced prompt engineering with Gemini

We also explored designing specialized AI personas that act as:

  • judges
  • facilitators
  • coaches

inside different speaking environments.


What's Next for Ruva

Our immediate roadmap includes:

  • integrating Gemini Live Multimodal APIs
  • reducing response latency
  • supporting interruption-aware conversation handling
  • introducing additional structured speaking rooms
  • Implementing natural voice support using third party providers for more personlisation

Our long-term vision:

Launch Ruva as a full mobile application and make personalized speech coaching accessible anywhere.

Built With

Share this project:

Updates