Can better representations be learnt from worse interfering scenarios? To verify this seeming paradox, we propose a novel framework that performed compositional learning in traditionally independent tasks of speech separation and speaker identification. In this framework, generic pre-training and compositional fine-tuning are proposed to mimic the bottom-up and top-down processes of a human’s cocktail party effect. Moreover, we investigate schemes to prohibit the model from ending up learning an easier identity-prediction task. Substantially discriminative and generalizable representations can be learnt in severely interfering conditions. Experiment results on downstream tasks show that our learnt representations have superior discriminative power than a standard speaker verification method. Meanwhile, RISE achieves higher SI-SNRi consistently in different inference modes over DPRNN, a state-of-the-art speech separation system.