ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASR

Takashi Maekaku, Yuya Fujita, Yifan Peng, Shinji Watanabe

Transformer-based encoder-decoder models have so far been widely used for end-to-end automatic speech recognition. However, it has been found that the self-attention weight matrix could be too peaky and biased toward the diagonal component. Such attention weight matrix contains little useful context information, which may result in poor speech recognition performance. Therefore, we propose the following two attention weight smoothing methods based on the hypothesis that an attention weight matrix whose diagonal components are not peaky can capture more context information. One is a method to linearly interpolate the attention weight using a learnable truncated prior distribution. The other uses the attention weight from a previous layer as a prior distribution given that lower-layer weights tend to be less peaky and diagonal. Experiments on LibriSpeech and Wall Street Journal show that the proposed approach achieves 2.9% and 7.9% relative improvement, respectively, over a vanilla Transformer model.