ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition

Lars Rumberg, Christopher Gebauer, Hanna Ehlert, Maren Wallbaum, Ulrike Lüdtke, Joern Ostermann

Predictive uncertainty estimation of deep neural networks is important when their outputs are used for high stakes decision making. We investigate token-level uncertainty of connectionist temporal classification (CTC) based automatic speech recognition models. We propose an approach, which considers that not all changes at frame-level lead to a change at token-level after CTC decoding. The approach shows promising performance for prediction of recognition errors on TIMIT, Mozilla Common Voice (MCV) and kidsTALC, a corpus of children's speech, using two different model architectures, while introducing only negligible computational overhead. Our approach identifies over 80% of a wav2vec2.0 model's errors on MCV by selecting 10% of the tokens. We further show, that the predictive uncertainty estimate relates to the uncertainty of a human annotator, by re-annotating 500 utterances of kidsTALC.