ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Voice-Based Dysphagia Detection: Leveraging Self-Supervised Speech Representation

Injune Hwang, Jung-Min Kim, Ju Seok Ryu, Kyogu Lee

This study introduces a framework for diagnosing dysphagia using self-supervised speech representation learning (SSL) models. Previously reported methods typically rely on mel spectrograms; however, due to the limited amount of medical data, they struggle to accurately diagnose dysphagia from low-dimensional features. However, SSL models, trained on large-scale speech data, are well suited for tasks with smaller dataset. Employing SSL features significantly enhances model performance, allowing for the model’s size reduction while outperforming larger models based on mel spectrograms. Although a decrease in specificity was observed, recall, a crucial metric for disease diagnosis, showed a marked improvement, leading to a general improvement in diagnostic accuracy. Among the SSL models evaluated, the features of the 10th layer of WavLM had the highest performance. Additionally, increasing the size of the filter in the convolutional layers does not contribute to performance gains.