In this study, we propose a transformer-based architecture for talker-independent audiovisual speaker separation in the time-domain. Inputs to the proposed architecture are the noisy mixtures of multiple talkers and their corresponding cropped faces. Using a cross-attention mechanism, these two streams are fused together. The fusion layer is followed by a masking net that estimates one mask per talker and multiplies the mixed feature matrix by these masks to separate speaker features. Finally, the separated features are converted to the time domain at the decoder layer. Moreover, we propose a novel training strategy to increase the role of the video stream which starts with a relatively noisy condition and gradually increases audio stream quality during training. Experimental results demonstrate that the proposed method outperforms existing techniques according to multiple metrics on several commonly used audiovisual datasets.