ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

A Neural Time Alignment Module for End-to-End Automatic Speech Recognition

Dongcheng Jiang, Chao Zhang, Philip C. Woodland

End-to-end trainable (E2E) automatic speech recognition (ASR) systems have low word error rates, but they do not model timings or silence by default unlike hidden Markov model (HMM)-based systems. In this paper, an extra neural aligner module is proposed for E2E ASR models, which labels the word timings in a post-processing stage. Pre-trained neural transducer and attention-based encoder-decoder models are adopted as the ASR backbones for experiments. The aligner module uses self-attention and cross-attention and takes the hidden representations from the backbone to predict the durations of each word and the possible silences. A novel loss is proposed for aligner training with the backbone frozen. Experimental results showed that when trained using the references from an existing HMM-based forced aligner, the proposed methods can make time predictions at accuracy about 95% for matched recognised words, and about 99% for utterances up to 10 s with reference text, with 200 ms tolerance.