ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Whister: Using Whisper’s representations for Stuttering detection

Vrushank Changawala, Frank Rudzicz

In this paper, we empirically investigate the influence of different factors on the performance of dysfluency detection. Specifically, we examine the impact of data splits, data quality, and learned representations of large pre-trained models. To conduct our experiments, we use the frozen Whisper model with two trainable heads, along with MFCC features extracted from input audio. We train on different data splits and evaluate performance using a cross-corpora testing strategy. We find that longer audio segments, specifically 5 seconds as opposed to the conventional 3-second segments, leads to improved performance. We also show that our architecture design is generalizable for multilingual data. We attain a 9.3% and 22% relative improvement in the average F1 score for FluencyBank and KSoF-test datasets, respectively, surpassing the previous state-of-the-art.