ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Investigation of Training Mute-Expressive End-to-End Speech Separation Networks for an Unknown Number of Speakers

Younggwan Kim, Hyungjun Lim, Kiho Yeom, Eunjoo Seo, Hoodong Lee, Stanley Jungkyu Choi, Honglak Lee

In speech separation, there have been a limited number of prior works for an unknown number of speakers in a speech mixture. To address this situation, one simple solution is to constitute the sufficient number of output channels greater than or equal to the expected number of speakers and ignore invalid outputs containing meaningless signals when the number of speakers is less than the output channels. To detect such invalid outputs, it is an ideal scenario for the meaningless signals to be muted. In this paper, we investigate several training methods by which separation models can mute the invalid outputs. We first introduce an on-the-fly data mixing scheme adding small random noises to the speech mixtures. As a training criterion, we analyze why the well-known scale-invariant signal-to-noise ratio is not suitable for muting the invalid outputs because of its power amplification problem and also explain why we use the signal-to-noise ratio criterion to avoid the problem.