ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Centroid Estimation with Transformer-Based Speaker Embedder for Robust Target Speaker Extraction

Woon-Haeng Heo, Joongyu Maeng, Yoseb Kang, Namhyun Cho

Target speaker extraction (TSE) is a technique for separating the target speaker from mixed speech using speaker embedding. However, speaker embeddings may contain, in addition to speaker information, text dependent information and environmental information, such as noise, microphone characteristics, and reverberation, which can decrease TSE performance, especially when the enrollment and target utterances are in different environments. To address this issue, we propose a Transformer-based embedder for centroid estimation, and a role division training method to enhance the training stability of the TSE separator. This embedder estimates the speaker centroid from the enrollment utterance, aiding the separator in extracting the target speaker. The proposed methods considerably improve speech quality and speech recognition performance compared to the baseline.