Vocal Agnet-Stutter to fluent speech convertor

1. Project Name

FluencyNet: Real-Time Stutter-to-Fluent Speech & Clinical AI

2. Elevator Pitch

A real-time AI assistant that instantly converts stuttered speech into fluent audio and generates clinical SOAP notes for therapy, supporting Indian languages and code-mixing (Hinglish/Tanglish).

3. Project Story

Over 79.5 million people globally suffer from stuttering, a condition often leading to social isolation and severe communication anxiety. While Speech-Language Pathologists (SLPs) provide life-changing care, therapy is expensive, time-consuming, and inaccessible to many. We noticed a critical gap: existing ASR tools (like standard Siri or Google Assistant) fail people who stutter by either cutting them off or misinterpreting disfluencies as "noise". Furthermore, for non-English speakers—especially those in India who code-mix (e.g., "Hinglish")—there are virtually no accessible tools. We built FluencyNet to bridge this gap: an AI that listens with patience and speaks with fluency.

What it does

FluencyNet is a dual-purpose AI platform designed for both People Who Stutter (PWS) and Clinicians:

Real-Time Fluency Conversion: It listens to dysfluent speech (blocks, prolongations, repetitions) and instantly synthesizes a fluent, natural-sounding version of the user's intended message.

Automated Clinical Analysis: It acts as a virtual SLP, analyzing speech patterns to generate SOAP Notes (Subjective, Objective, Assessment, Plan) and calculating disfluency metrics (e.g., %SS - Percentage Syllables Stuttered).

Multilingual Support: Unlike standard models, it is optimized for Indian languages (Hindi, Telugu, Kannada, Bengali) and handles code-switching seamlessly. #### How we built it We moved away from the error-prone "sequential pipeline" concept and built a robust, low-latency architecture:

Transcription (The Ear): We implemented Faster-Whisper (Large-v3) with Int8 quantization. This allows us to capture the verbatim speech—including stutters—without the model "hallucinating" or filtering them out prematurely.

The Brain (Reasoning): We integrated Llama 3.1 (8B) via Ollama. This LLM serves two roles:

Semantic Correction: It filters disfluencies to create an "Intended Speech Transcript".

Clinical Coding: It classifies specific events (blocks vs. repetitions) to generate the clinical SOAP notes. The Voice (Synthesis): For the output, we used Kokoro-ONNX for high-quality English synthesis and Microsoft Edge-TTS for natural-sounding Indian languages. Real-Time Core: The entire stack is wrapped in FastAPI with WebSockets to ensure minimal latency during live conversation. #### Challenges we ran into

The "Verbatim" vs. "Intended" Conflict: Standard ASR models try to "fix" speech automatically, which destroys the data needed for clinical analysis. We had to tune our implementation to capture the raw stuttering events (for the doctor) while simultaneously generating clean text (for the speaker).

Latency in Code-Mixed Speech: Processing "Hinglish" usually slows down models. We optimized our pipeline using Int8 quantization and Voice Activity Detection (VAD) to handle barge-ins and interruptions smoothly.

Data Scarcity: Finding high-quality labeled data for Indian language stuttering was difficult. We relied on the SEP-28k dataset structure to train our event detection logic. #### Accomplishments that we're proud of

Successfully implementing automated SOAP note generation, a feature rarely found even in expensive clinical software.

Achieving robust performance on South Indian languages (Telugu/Kannada), which are typically under-served by major tech platforms.

Building a latency-optimized pipeline that can run on consumer hardware (using quantized models) without needing a massive data center. #### What we learned We learned that stuttering is not noise; it's data. By treating disfluencies as valuable information rather than errors to be deleted, we could build a system that empowers the speaker rather than erasing them. We also discovered the power of Small Language Models (SLMs) and quantization—proving that complex medical AI can run efficiently on the edge. #### What's next for FluencyNet Multimodal Analysis (Fluency-Net-Vision): We plan to integrate video input to detect "secondary behaviors" (like eye blinking or jaw tension) which often accompany stuttering blocks. On-Device Privacy: Moving the entire pipeline to a distilled SLM (like Distil-Whisper) to run fully offline on a smartphone for HIPAA compliance. End-to-End E2E Model: Transitioning from our current pipeline to a single "StutterFormer" style model that performs transcription, analysis, and conversion in a single forward pass. ### 4. Built With (Select these tags in the submission form)

Python

FastAPI

WebSockets

Ollama

Llama-3.1 8b

Faster-Whisper

Kokoro-ONNX

Edge-TTS

Docker

Agno *web audio APIs,media recorder APIs

HTML5/Tailwind/Jinja

Built With

a-real-time-ai-assistant-that-instantly-converts-stuttered-speech-into-fluent-audio-and-generates-clinical-soap-notes-for-therapy
and
code-mixing
indian
languages
supporting

Updates

SaiGayathriGudla-1184 Gudla started this project — Feb 12, 2026 03:17 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.