ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Speaker Separation for an Unknown Number of Speakers with Encoder-Decoder-Based Contextual Information Module

Xue Yang, Guiru Shen, Yu Yang

Many speaker separation methods impractically assume that the number of speakers is known in advance. To tackle this issue, an encoder-decoder-based attractor module was proposed to generate multiple speaker attractors. However, the attractors are compact vectors that discard the contextual information. In this paper, an encoder-decoder-based contextual information module is proposed. During training, the LSTM decoder performs multiple iterations to sequentially derive the frame-level representations of different speakers and the mixed signal. During inference, a sufficiently large number of iterations is assumed. The LSTM decoder can autonomously determine at which iteration to derive the frame-level representation of the mixed signal, thereby determining the appropriate channel for its estimation. The number of speakers can be effectively determined through the similarity computation. Experimental results show that high performance is achieved in both speaker counting and separation.