ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Binaural Sound Localization in Noisy Environments Using Frequency-Based Audio Vision Transformer (FAViT)

Waradon Phokhinanan, Nicolas Obin, Sylvain Argentieri

Binaural sound source localization (BSSL) aims to locate sound as the way human does, but it falls short due to acoustic interferences. While Convolutional Neural Networks (CNNs) have shown promise in localizing sounds corrupted by noise, their large parameter and training data requirements make them unsuitable for real-time processing on devices like hearing aids and robots. In this paper, we propose an adapted Vision Transformer (ViT) model for BSSL in noisy environments. Inspired by the Duplex Theory, our model uses selective attention mechanisms to the frequency range of binaural features to aid in sound localization. Our model outperformed recent CNNs and standard audio ViT models in localizing speech in unseen noises and speakers, even in challenging conditions with low training data and parameters. The attention heatmap results suggest differences in how humans and machines process binaural cues, opening up for further investigation.


doi: 10.21437/Interspeech.2023-2015

Cite as: Phokhinanan, W., Obin, N., Argentieri, S. (2023) Binaural Sound Localization in Noisy Environments Using Frequency-Based Audio Vision Transformer (FAViT). Proc. INTERSPEECH 2023, 3704-3708, doi: 10.21437/Interspeech.2023-2015

@inproceedings{phokhinanan23_interspeech,
  author={Waradon Phokhinanan and Nicolas Obin and Sylvain Argentieri},
  title={{Binaural Sound Localization in Noisy Environments Using Frequency-Based Audio Vision Transformer (FAViT)}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
  pages={3704--3708},
  doi={10.21437/Interspeech.2023-2015},
  issn={2308-457X}
}