ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Automatic Speech Recognition Transformer with Global Contextual Information Decoder

Yukun Qian, Xuyi Zhuang, Mingjiang Wang

Most current automatic speech recognition (ASR) models use decoders that do not have access to global contextual information at the token level. Therefore, we propose a decoder structure with text-level global contextual information. We construct the global information encoder based on non-autoregressive recognition. To eliminate the non-autoregressive independence assumption, we add a self-attention layer with rotary position encoding. The obtained text-level global contextual information and the decoder are fused as cross-attention to construct a decoder with contextual information. Our model can achieve a character error rate of 3.92% on the AISHELL-1 validation set and 4.35% on the test set, reducing the error rate by 1.72%(dev)/2.13%(test) compared to the baseline model, achieving SOTA performance. Finally, we also use visualization techniques to explain the role of global information in the decoder.