Omoi: Unlocking the Silent World

Inspiration

40% of autistic children never develop functional speech. 1 in 36 US children have autism. Yet existing AAC devices cost $300-$10,000 and don't adapt to users.

We saw a mother at a grocery store, unable to understand her non-verbal son's distress. He was crying not from pain, but from the frustration of having no voice.

Communication is a human right. That's why we built Omoi.

What it does

Omoi is a free AI-powered AAC platform that works in 3 steps:

Record - Caregiver records their voice for personalized synthesis
Select - User taps icons in any order to build sentences
Speak - AI constructs the sentence and speaks it in the caregiver's voice with appropriate emotion

Unlike traditional AAC systems that rely on rigid menus, Omoi uses AI to predict intent, recognize gestures through computer vision, and detect emotions in real-time.

How we built it

Tech Stack:

Google Cloud Platform for backend
Gemini AI for natural language prediction
CNN models for gesture recognition
Custom emotion detection trained on RAVDESS dataset
Eye-tracking integration using WebGazer.js
Support Vector Machines for emotion classification

Key Innovation: We're building the world's first dataset of non-verbal autistic communication gestures from platform usage (with consent). Every interaction makes the system smarter - more users means more data, better models, and more accurate predictions.

Architecture:

React frontend with offline-first design
Real-time voice synthesis using transfer learning
Computer vision pipeline processing at 30fps
Hybrid approach: lightweight processing on-device, complex AI in cloud

Challenges we ran into

Eye-tracking computational demands: Required massive processing power. Initial implementation only hit 15fps and drained batteries in 20 minutes. Solved with hybrid approach, model quantization (75% size reduction), and frame skipping.

Voice quality from limited data: Commercial systems need hours of audio; we had 10-15 minutes. Used data augmentation and transfer learning to improve quality from 2.1/5 to 4.2/5.

Gesture recognition accuracy: Started at 67% due to false positives from involuntary movements (stimming). Implemented multi-frame verification, confidence thresholding, and personalized calibration to reach 94% accuracy.

Platform-scale data collection: Building proprietary gesture dataset required solving privacy (COPPA compliance), data quality, and annotation pipeline challenges while maintaining user trust.

Accomplishments that we're proud of

Completely free when competitors charge $300-$10,000
94% gesture recognition accuracy matching expensive specialized hardware
340% increase in communication attempts during beta testing (23 users, 3 schools)
One 8-year-old constructed their first multi-word sentence after 3 years of single words
World's first dataset of real non-verbal autistic communication patterns - our competitive moat
Sub-100ms prediction latency for natural conversation

What we learned

Edge AI is hard: Model quantization, pruning, and knowledge distillation are essential. 80% of processing time was memory allocation, not computation.

Voice synthesis is complex: Intelligible speech ≠ natural speech. Prosody, phoneme timing, and emotional modulation took months to get right.

Real users are unpredictable: Lab accuracy doesn't translate to real-world performance. Lighting, background noise, and device diversity required adaptive solutions.

Communication is prediction: Traditional AAC treats it like navigating menus. We realized it's an AI prediction problem - and built accordingly.

Most important: We learned from speech therapists, special educators, autism advocates, and families. Technology alone isn't the solution - empathy-driven design is.

What's next

Near-term (6 months):

Native iOS/Android apps
Personalized communication analytics dashboard
Multi-caregiver voice support
Reduce training audio from 15 to 5 minutes

Long-term vision:

Neuralink integration for direct thought-to-speech
MND Association partnership for progressive conditions like ALS
Voice preservation for users before speech loss
Cross-platform expansion (smart home, wearables)

Goal: 100,000 users by 2026 through school partnerships, multi-language support, and open-source developer toolkit.

Our mission: Communication for everyone, everywhere. Making voice a fundamental right, not a luxury.

Built With

google-cloud gemini-ai tensorflow python javascript react cnn machine-learning computer-vision svm eye-tracking accessibility aac emotion-detection