ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

WA-Transformer: Window Attention-based Transformer with Two-stage Strategy for Multi-task Audio Source Separation

Yang Wang, Chenxing Li, Feng Deng, Shun Lu, Peng Yao, Jianchao Tan, Chengru Song, Xiaorui Wang

The standard Conformer adopts convolution layers to exploit local features. However, the one-dimensional convolution ignores the correlation of adjacent time-frequency features. In this paper, we design a two-dimensional window attention block with dilation, and then we propose a window attention-based Transformer network (named WA-Transformer) for multi-task audio source separation. The proposed WA-Transformer adopts self-attention and window attention blocks to model global dependencies and local correlation in a parameter-efficient way. Besides, it follows a two-stage pipeline, in which the first stage separates the mixture and outputs the three types of audio signals, and the second stage performs signal compensation. Experiments demonstrate the effectiveness of WA-Transformer. WA-Transformer achieves 13.86 dB, 12.22 dB, 11.21 dB signal-to-distortion ratio improvement on speech, music, noise track, respectively, and advantages over several well-known models.