Binaural sound source localization (BSSL) aims to locate sound as the way human does, but it falls short due to acoustic interferences. While Convolutional Neural Networks (CNNs) have shown promise in localizing sounds corrupted by noise, their large parameter and training data requirements make them unsuitable for real-time processing on devices like hearing aids and robots. In this paper, we propose an adapted Vision Transformer (ViT) model for BSSL in noisy environments. Inspired by the Duplex Theory, our model uses selective attention mechanisms to the frequency range of binaural features to aid in sound localization. Our model outperformed recent CNNs and standard audio ViT models in localizing speech in unseen noises and speakers, even in challenging conditions with low training data and parameters. The attention heatmap results suggest differences in how humans and machines process binaural cues, opening up for further investigation.