Inspiration

Communication is not just a utility; it is a fundamental human right. Yet, for millions of individuals living with motor speech disorders—such as Dyspraxia, Apraxia, ALS, and Parkinson's—the simple act of being understood is a daily, often agonizing, battle. Traditional speech therapy is the gold standard, but it faces critical limitations: it is expensive, geographically inaccessible for many, and limited to brief scheduled sessions. We asked ourselves: What if we could build a speech therapist that never sleeps? And further, what if we could build a bridge that allows someone to speak with their own voice, fluently and clearly, in real-time? VoxLift was born from this vision. We wanted to leverage the latest advancements in Multimodal AI to create a dual-purpose platform: one that rehabilitates speech over time, and one that empowers immediate communication today.

What it does

VoxLift is a comprehensive, accessibility-first platform designed to restore the power of voice. It functions through two distinct, high-impact modes:

  1. Therapy Mode (The AI Clinician):
    • Acts as an intelligent, always-available speech therapist.
    • Utilizes advanced audio processing to provide real-time, multimodal feedback on pronunciation, rhythm, and clarity.
    • Offers visual cues for tongue and lip placement, gamifying the rehabilitation process to encourage consistent practice.
    • Delivers actionable, phoneme-level insights to help users rebuild their motor speech planning pathways.
  2. Bridge Mode (The Real-Time Companion):
    • Serves as an assistive communication layer for daily interactions.
    • Users speak naturally—regardless of stuttering, slurring, or pauses—and our AI engine instantaneously interprets the semantic intent.
    • It reconstructs the fragmented speech into fluent, grammatically correct sentences.
    • The corrected text is immediately synthesized into ultra-realistic, empathetic speech using ElevenLabs, allowing the user to participate in conversations with confidence and dignity.

How we built it

VoxLift is not just a wrapper; it is a complex orchestration of state-of-the-art AI technologies, built on a robust, scalable architecture:

  • Core Intelligence (Google Cloud Vertex AI): We architected our NLP pipeline effectively using Gemini 1.5 Pro. Its multimodal capabilities allow us to process audio and text simultaneously, enabling a deeper understanding of "disordered" speech patterns that traditional STT models miss.
  • Voice Synthesis (ElevenLabs): To ensure the voice output feels human and personal, we integrated ElevenLabs' low-latency API. This transforms the corrected text into emotive audio that captures the user's intended tone, not just their words.
  • Frontend Engineering (Next.js 14 & Tailwind): We built a highly responsive, accessible interface using Next.js App Router. The UI is designed with Framer Motion to provide calming, fluid interactions that reduce the anxiety often associated with speech therapy tools.
  • Data Layer (Prisma & PostgreSQL): A secure backend manages sensitive user data, therapy progress tracking, and personalized configuration settings.

Challenges we ran into

  • The "Latency vs. Quality" Trade-off: In Bridge Mode, every millisecond counts. Orchestrating a pipeline that captures audio, transcribes it, corrects it via LLM, and synthesizes audio—all while maintaining conversational speed—required aggressive optimization of our API chains and edge function implementation.
  • Deciphering Non-Standard Speech: Standard Speech-to-Text models are trained on fluent speech. They frequently fail for our target demographic. We had to engineer a prompt strategy that instructs Gemini to look for intent amidst phonological errors, effectively using context implementation to "repair" the broken speech input.
  • Designing Empathetic UI: Medical interfaces are often sterile and discouraging. Our challenge was to design a UI that felt premium and empowering—"consumer-grade," not "clinical-grade"—while still providing dense technical feedback on speech performance.

Accomplishments that we're proud of

  • Multimodal Integration: Successfully implementing a pipeline where audio input directly informs the LLM's context, significantly rewriting the rules for accessibility tools.
  • Real-Time Performance: achieving a usable latency for the Bridge Mode, making it a viable tool for actual face-to-face conversation.
  • Empowering Design: Creating a product that users want to use. The feedback on our visual cues and "glowing" interaction design has proven that accessibility tools can be beautiful.

What we learned

  • The Era of Intent-Based Computing: We learned that accessibility isn't just about recognizing words; it's about recognizing intent. LLMs are the key to unlocking this, serving as a translation layer between different cognitive and motor capabilities.
  • Simplicity is Complex: Hiding the immense complexity of our AI pipeline behind a single "Record" button was our biggest UX lesson. The user shouldn't care about the tech stack; they just want to be heard.

What's next for Voxlift

  • Personal Voice Cloning: We plan to integrate voice cloning so users with degenerative conditions (like ALS) can "bank" their healthy voice and use it in Bridge Mode forever.
  • Therapist Portal: We are building a dashboard for certified speech language pathologists (SLPs) to remotely assign exercises and monitor their patients' progress through VoxLift.
  • Mobile Native: Porting our optimized web core to React Native to ensure VoxLift is accessible anywhere, even offline.

Built With

  • amazon-web-services
  • elevenlabs`
  • gemini`
  • google-cloud`
  • next.js`
  • prisma`
  • tailwindcss`
  • typescript`
  • vertex-ai`
  • vox
Share this project:

Updates