ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

On the Usefulness of Speaker Embeddings for Speaker Retrieval in the Wild: A Comparative Study of x-vector and ECAPA-TDNN Models

Erfan Loweimi, Mengjie Qian, Kate Knill, Mark Gales

In this paper, we investigate the efficacy of the widely-used x-vector and ECAPA-TDNN speaker embeddings for speaker retrieval on the BBC Rewind corpus. In this archival collection each file is briefly described by a synopsis. Our objective is to develop a speaker retrieval system, treating the names mentioned in the synopses as speakers. However, the provided labels exhibit significant noise, posing challenges for model training. Further, the dataset encompasses diverse acoustic conditions, ranging from clean to highly noisy environments. To address these challenges, we develop a speaker retrieval system ``in the wild'' via leveraging pre-trained x-vector and ECAPA-TDNN embeddings models from the SpeechBrain and NeMo toolkits. We assess the effectiveness of these embeddings and explore the usefulness of their combination. Additionally, we evaluate the models' robustness against additive noise and reverberation as well as variations in bit-depth and sampling rate.