Echoes

Inspiration

Speech therapy is often repetitive and inaccessible outside clinics. Patients who struggle with articulation or post-stroke recovery, Parkinson’s disease, Traumatic Brain Injury, Apraxia of Speech, Autism etc need feedback even when a therapist isn’t physically present. We wanted to create an AI companion that helps them practice pronunciation visually, track emotions, and automatically share progress with their doctor.

What it does

Our system captures the patient’s live video feed, detects the lip region using MediaPipe, and compares extracted mouth embeddings with reference templates through cosine similarity based matching. We integrated the HuBERT-433h pretrained lip-reading model, augmented with our own 20-word custom dataset, to enhance recognition accuracy for therapy-specific vocabulary.

Each session logs: Words spoken and their frequency. Real-time emotional state via DeepFace. Session duration and confidence scores. At the end of a session, the app auto-generates a detailed therapy report (words practiced, accuracy, emotional timeline) and emails it directly to the doctor and patient.

How We Built It

Frontend (React+Tailwind): Handles camera input, overlay bounding boxes, real-time feedback, and report visualization. Backend (Flask): Processes each video frame, performs face and mouth detection, runs the lip-reading model, and manages email automation using SMTP. Machine Learning: Uses pretrained visual speech recognition (LRS3) for phoneme-to-word decoding, and a DeepFace model for emotional classification. Reports: Aggregated analytics (accuracy, emotion distribution) exported to TXT/PDF and emailed via secure SMTP microservice.

Challenges

Pre-trained checkpoint was 1.2GB+ downloading and setting it up locally was time-consuming and made iteration slower, especially when switching model versions.
Multiple dependencies (fairseq, av-hubert, sentencepiece) managing compatibility between these research libraries required careful environment setup and version pinning.
Checkpoint format incompatible with standard PyTorch loading we had to write custom loading functions to properly map the model weights and avoid deserialization errors.
Path issues between model files and vocabulary relative path mismatches often broke loading scripts, forcing us to manually realign model configs and tokenizer files.
AV-HuBERT’s custom transformer encoder architecture understanding its multi-stream fusion layers was essential to integrate it properly with our simpler pipeline.
Masked prediction pretraining approach learning how HuBERT predicts masked speech tokens required studying its original pretraining logic to adapt it for inference-only use.
Strict input format (Batch, Channels, Frames, Height, Width) reshaping and normalizing video tensors into the model’s expected 5D format was nontrivial for live webcam frames.
Fairseq is Meta’s research framework, not production-ready documentation gaps and rapidly changing APIs meant we had to debug directly through the source code.
Breaking changes between versions even minor upgrades caused runtime errors or shape mismatches, so we froze the environment to a stable working state.
Accurate lip detection under variable lighting and webcam quality ensuring consistent face and mouth tracking across users and lighting conditions was one of our biggest real-world challenges.

Accomplishments that we're proud of

Integrated a 1.2 GB pretrained HuBERT-433h lip-reading model with our own custom 20-word dataset for therapy-specific vocabulary.
Achieved cosine-similarity-based template matching for reliable word recognition even without full GPU acceleration.
Designed a real-time web interface in React that provides visual feedback (lip bounding boxes, live emotions, and predictions) at under 200 ms latency.
Implemented an auto-generated clinical report system that analyzes spoken words, session duration, and emotional states and securely emails it to both doctor and patient.
Overcame major dependency and compatibility issues with Fairseq and AV-HuBERT, customizing the architecture for single-CPU inference on macOS.
Created a platform that could genuinely assist patients recovering from stroke, Parkinson’s, cleft palate, or speech apraxia, giving them accessible practice from home.

What We Learned

Integrating real-time MediaPipe with a web client requires efficient frame throttling and backpressure management to avoid lag. Apple’s MPS/Metal backend behaves differently than CUDA, debugging GPU graph placeholders taught us about cross-platform ML deployment. Building accessible UIs for therapy showed us the value of clear visual feedback (bounding boxes, emotion icons, session summaries).

What's next for Echoes

Expand the dataset beyond for multilingual speech patterns, enabling support for regional languages and diverse patient groups.
Integrate real-time audio–visual fusion using AV-HuBERT fine-tuned with synchronized audio input to boost prediction accuracy.
Deploy lightweight inference with ONNX or Core ML optimization, ensuring the system runs efficiently on mobile and low-power devices.
Collaborate with speech therapists and medical institutions to collect anonymized feedback data and refine our model for real-world clinical use.
Introduce personalized progress dashboards and emotion-based therapy recommendations, allowing doctors to tailor exercises for each patient.
Eventually release Echoes as an open-source toolkit for digital health researchers and rehabilitation centers, encouraging further innovation in speech therapy accessibility.