ISCA Archive Interspeech 2020
ISCA Archive Interspeech 2020

A Unified Framework for Low-Latency Speaker Extraction in Cocktail Party Environments

Yunzhe Hao, Jiaming Xu, Jing Shi, Peng Zhang, Lei Qin, Bo Xu

Speech recognition technology in single-talker scenes has matured in recent years. However, in noisy environments, especially in multi-talker scenes, speech recognition performance is significantly reduced. Towards cocktail party problem, we propose a unified time-domain target speaker extraction framework. In this framework, we obtain a voiceprint from a clean speech of the target speaker and then extract the speech of the same speaker in a mixed speech based on the previously obtained voiceprint. This framework uses voiceprint information to avoid permutation problems. In addition, a time-domain model can avoid the phase reconstruction problem of traditional time-frequency domain models. Our framework is suitable for scenes where people are relatively fixed and their voiceprints are easily registered, such as in a car, home, meeting room, or other such scenes. The proposed global model based on the dual-path recurrent neural network (DPRNN) block achieved state-of-the-art under speaker extraction tasks on the WSJ0-2mix dataset. We also built corresponding low-latency models. Results showed comparable model performance and a much shorter upper limit latency than time-frequency domain models. We found that performance of the low-latency model gradually decreased as latency decreased, which is important when deploying models in actual application scenarios.