ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Comparing ASR Systems in the Context of Speech Disfluencies

Maria Teleki, Xiangjue Dong, Soohwan Kim, James Caverlee

In this work, we evaluate the disfluency capabilities of two automatic speech recognition systems - Google ASR and WhisperX - through a study of 10 human-annotated podcast episodes and a larger set of 82,601 podcast episodes. We employ a state-of-the-art disfluency annotation model to perform a fine-grained analysis of the disfluencies in both the scripted and non-scripted podcasts. We find, on the set of 10 podcasts, that while WhisperX overall tends to perform better, Google ASR outperforms in WIL and BLEU scores for non-scripted podcasts. We also find that Google ASR’s transcripts tend to contain closer to the ground truth number of edited-type disfluent nodes, while WhisperX’s transcripts are closer for interjection-type disfluent nodes. This same pattern is present in the larger set. Our findings have implications for the choice of an ASR model when building a larger system, as the choice should be made depending on the distribution of disfluent nodes present in the data.