ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

SEF-Net: Speaker Embedding Free Target Speaker Extraction Network

Bang Zeng, Suo Hongbin, Yulong Wan, Ming Li

Most target speaker extraction methods use the target speaker embedding as reference information. However, the speaker embedding extracted by a speaker recognition module may not be optimal for the target speaker extraction tasks. In this paper, we proposes Speaker Embedding Free target speaker extraction Network (SEF-Net), a novel target speaker extraction model without relying on speaker embedding. SEF-Net uses cross multi-head attention in the transformer decoder to implicitly utilize the speaker information in the reference speech's conformer encoding outputs. Experimental results show that our proposed model achieves comparable performance to other target speaker extraction models. SEF-Net provides a feasible new solution to perform target speaker extraction without using a speaker embedding extractor or speaker recognition loss function.