The standard Conformer adopts convolution layers to exploit local features. However, the one-dimensional convolution ignores the correlation of adjacent time-frequency features. In this paper, we design a two-dimensional window attention block with dilation, and then we propose a window attention-based Transformer network (named WA-Transformer) for multi-task audio source separation. The proposed WA-Transformer adopts self-attention and window attention blocks to model global dependencies and local correlation in a parameter-efficient way. Besides, it follows a two-stage pipeline, in which the first stage separates the mixture and outputs the three types of audio signals, and the second stage performs signal compensation. Experiments demonstrate the effectiveness of WA-Transformer. WA-Transformer achieves 13.86 dB, 12.22 dB, 11.21 dB signal-to-distortion ratio improvement on speech, music, noise track, respectively, and advantages over several well-known models.