Visual features offer important cues that can be used in noisy backgrounds. Audio-visual speech enhancement (AVSE) improves speech quality and intelligibility by combining audio and visual features, leveraging their complementary nature for effective SE. The transformer architecture demonstrates an impressive ability to learn long-term relationships and performs effectively across various domains. This paper presents a multi-model dual-transformer that uses the attention mechanism to capture correlations between features for audio-visual speech enhancement. The transformers independently process the audio and visual features before fusing them in a self-supervised manner. The experiments on the AVSEC-3 noise dataset demonstrate the success of the dual-transformers for AVSE. Further on the GRID dataset, the proposed AVSE model in this study achieved 0.76, 16%, and 6.73dB improvements in STOI, PESQ, and SI-SDR compared to the noisy mixtures.