ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Bayesian Transformer Using Disentangled Mask Attention

Jen-Tzung Chien, Yu-Han Huang

Transformer conducts self attention which has achieved state-of-the-art performance in many applications. Multi-head attention in transformer basically gathers the features from individual tokens in input sequence to form the mapping to output sequence. There are twofold weaknesses in transformer. First, due to the natural property that attention mechanism would mix up the features of different tokens in input and output sequences, it is likely that the representation of input tokens contains redundant information. Second, the patterns of attention weights between different heads tend to be similar, the model capacity is bounded. To strengthen the sequential learning, this paper presents a variational disentangled mask attention in transformer where the redundant features are enhanced with semantic information. Latent disentanglement in multi-head attention is learned. The attention weights are filtered by a mask which is optimized by semantic clustering. The proposed attention mechanism is then implemented according to a Bayesian learning for clustered disentanglement. The experiments on machine translation show the merit of the disentangled mask attention.