ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Speaker-Aware Multi-Task Learning for Speech Emotion Recognition

Xiaohan Shi, Xingfeng Li, Tomoki Toda

Speaker representations play a crucial role in achieving accurate speech emotion recognition (SER). Previous studies have primarily relied on generic speaker recognition (SR) models to extract speaker representations. However, these approaches struggle in speaker-dependent SER tasks, as they fail to capture speaker-specific characteristics effectively. To address this limitation, we propose a Speaker-Aware Multi-Task (SAMT) model, which is designed to effectively model speaker-specific and emotion-specific representations for SER. Additionally, we introduce a speaker-emotion disentanglement loss to explicitly separate speaker and emotion information, further enhancing speaker representation. Extensive experiments demonstrate the effectiveness of our approach, achieving performance gains of 1.76% in speaker-dependent and 1.99% in speaker-independent settings over the baseline. Moreover, the speaker-emotion disentanglement loss further improves SER performance.