ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Rich speech signal: exploring and exploiting end-to-end automatic speech recognizers’ ability to model hesitation phenomena

Vincenzo Norman Vitale, Loredana Schettino, Francesco Cutugno

Modern automatic speech recognition systems can achieve remarkable performances. However, they usually neglect speech characteristic phenomena such as fillers ( ) or segmental prolongations (the) which are still only considered as disrupting objects to be detected and removed, despite their acknowledged regularity and procedural value. This study investigates the ability of state-of-the-art systems based on end-to-end models (E2E-ASRs) to model distinctive features of hesitation phenomena. Two types of pre-trained systems with the same Conformer-based encoding architecture but different decoders are evaluated: the Connectionist Temporal Classification (CTC) decoder and a Transducer decoder. E2E-ASRs ability to model the acoustic information tied to such phenomena can be exploited rather than disregarded as a noise source, which would not only improve transcription and support linguistic annotation processes, but also deepen our understanding of the systems’ working.