ISCA Archive Interspeech 2020
ISCA Archive Interspeech 2020

Learning Better Speech Representations by Worsening Interference

Jun Wang

Can better representations be learnt from worse interfering scenarios? To verify this seeming paradox, we propose a novel framework that performed compositional learning in traditionally independent tasks of speech separation and speaker identification. In this framework, generic pre-training and compositional fine-tuning are proposed to mimic the bottom-up and top-down processes of a human’s cocktail party effect. Moreover, we investigate schemes to prohibit the model from ending up learning an easier identity-prediction task. Substantially discriminative and generalizable representations can be learnt in severely interfering conditions. Experiment results on downstream tasks show that our learnt representations have superior discriminative power than a standard speaker verification method. Meanwhile, RISE achieves higher SI-SNRi consistently in different inference modes over DPRNN, a state-of-the-art speech separation system.