ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

The Processing of Stress in End-to-End Automatic Speech Recognition Models

Martijn Bentum, Louis ten Bosch, Tom Lentz

Listeners use stress to facilitate word recognition and speech segmentation. Classical ASR systems did not incorporate stress in their recognition process. In contrast, end-to-end ASR systems may use the information carried by stress. The present study shows that Wav2vec 2.0 is indeed sensitive to stress, and that this sensitivity is not a mere reflection of acoustic correlates of stress. Diagnostic classifiers of the CNN output reveal vowel-specific stress representations, that perform on par with acoustic features. Stress classifiers trained on transformer layers outperform classifiers based on acoustic correlates, but degrade when context is removed, showing that higher layers take the relative nature of stress into account. Results obtained by testing a stress classifier on a vowel it is not trained on, show that stress processing is to some extent abstract, i.e., the classifier does not simply detect a set of stressed vowel representations but rather, their common denominator.