Audio-only speech separation methods cannot fully exploit audio-visual correlation information of speaker, which limits separation performance. Additionally, audio-visual separation methods usually adopt traditional idea of feature splicing and linear mapping to fuse audio-visual features, this approach requires us to think more about fusion process. Therefore, in this paper, combining with the changes of speaker mouth landmarks, we propose a time-domain audio-visual temporal convolution attention speech separation method (AVTA). In AVTA, we design a multiscale temporal convolutional attention (MTCA) to better focus on contextual dependencies of time sequences. We then use sequence learning and fusion network composed of MTCA to build a separation model for speech separation task. On different datasets, AVTA achieves competitive performance, and compared to baseline methods, AVTA is better balanced in training cost, computational complexity and separation performance.