ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Using Semi-supervised Learning for Monaural Time-domain Speech Separation with a Self-supervised Learning-based SI-SNR Estimator

Shaoxiang Dang, Tetsuya Matsumoto, Yoshinori Takeuchi, Hiroaki Kudo

Speech separation aims to decompose mixed speeches into independent signals. Prior research on monaural timedomain speech separation has made great progress in supervised manners. Almost all of these works are trained on simulated mixed speech signals since obtaining ground truth for real-world mixed signals is problematic. To this end, we propose a novel semi-supervised learning method for speech separation (SSLM-SS), which leverages mixed speeches without ground truth. In particular, for this type of data, we further put forward a non-intrusive separated speech quality prediction network (SSQP-Net) based on self-supervised learning. According to the results, the linear correlation coefficient between the predicted results of SSQP-Net and the ground truth achieves 0.9. Moreover, the performance of SSLM-SS equipped with SSQP-Net exhibits an improvement of 0.2 dB and 1.1 dB compared to the mixture invariant training (MixIT) in the conditions of involving 10% and 50% labeled data respectively, and rivals supervised learning.