ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Towards Efficiently Learning Monotonic Alignments for Attention-based End-to-End Speech Recognition

Chenfeng Miao, Kun Zou, Ziyang Zhuang, Tao Wei, Jun Ma, Shaojun Wang, Jing Xiao

Inspired by EfficientTTS, a recent proposed speech synthesis model, we propose a new way to train end-to-end speech recognition models with an additional training objective, allowing the models to learn the monotonic alignments effectively and efficiently. The introduced training objective is differential, computationally cheap and most importantly, of no constraint on network structures. Thus, it is quite convenient to be incorporated into any speech recognition model. Through extensive experiments, we observed that the performance of our models significantly outperform baseline models. Specifically, our best performing model achieves WER (Word Error Rate) 3.18% on LibriSpeech test-clean benchmark and 8.41% on test-other. Comparing with a strong baseline obtained by WeNet, the proposed model gets 7.6% relative WER reduction on test-clean and 6.9% on test-other.