Dementia, particularly Alzheimer's Disease (AD), is a growing global health crisis. According to the World Health Organization, over 55 million people live with dementia worldwide. The traditional path to diagnosis is often invasive (lumbar punctures), expensive (PET scans), or requires lengthy cognitive tests administered by specialists. Consequently, diagnosis often comes too late for effective intervention or lifestyle management.

Speech as a Biomarker, speech production involves complex cognitive planning. Even in early stages, neurodegenerative diseases manifest subtle acoustic and linguistic anomalies—such as hesitations, changes in speech rate, and intonation flattening—that are imperceptible to the human ear but detectable by AI

Our goal was to build a robust, non-invasive screening tool capable of detecting dementia from short audio clips of spontaneous speech. We aimed to overcome the "small data" limitation by engineering a Hybrid Architecture that combines the feature extraction power of Self-Supervised Learning (SSL) with the stability of classical Machine Learning.

Data Scarcity in Medical AI While Deep Learning models like Transformers have revolutionized audio analysis, they typically require massive datasets to learn effectively. In the medical domain, high-quality labeled data is scarce. In our initial experiments, we found that standard fine-tuning of large Transformer models (like Wav2Vec2) on small datasets (<300 samples) leads to severe instability and overfitting, resulting in poor generalization (accuracy hovering around ~53%, equivalent to random guessing).

This project demonstrates that Wav2Vec2 embeddings combined with SVM provide a powerful method for detecting dementia from spontaneous speech, achieving 75% accuracy on a challenging "in-the-wild" dataset.

In our initial experiments, standard fine-tuning of Transformers (Wav2Vec2) on small datasets (<300 samples) failed due to severe overfitting, yielding near-random accuracy (~53%). Our objective is to bridge this gap by engineering a Hybrid Architecture (Wav2Vec2 + SVM). This approach leverages the powerful feature extraction of Self-Supervised Learning while maintaining the stability required for robust screening on limited medical data.

Multimodal Analysis: Combining acoustic features with text transcripts (NLP) to analyze linguistic complexity (vocabulary richness). Edge Deployment: Distilling the model to run on mobile devices for accessible at-home screening. Explainability: Analyzing which specific acoustic features (e.g., pauses vs. pitch) contribute most to the model's decision to aid clinical interpretation.

Built With

Share this project:

Updates