ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Enhancing Speech Emotion Recognition with Multi-Task Learning and Dynamic Feature Fusion

Honghong Wang, Jing Deng, Fanqin Meng, Rong Zheng

This study investigates fine-tuning self-supervised learning (SSL) models using multi-task learning (MTL) to enhance speech emotion recognition (SER). The framework simultaneously handles four related tasks: emotion recognition, gender recognition, speaker verification, and automatic speech recognition. An innovative co-attention module is introduced to dynamically capture the interactions between features from the primary emotion classification task and auxiliary tasks, enabling context-aware fusion. Moreover, We introduce the Sample Weighted Focal Contrastive (SWFC) loss function to address class imbalance and semantic confusion by adjusting sample weights for difficult and minority samples. The method is validated on the Categorical Emotion Recognition task of the Speech Emotion Recognition in Naturalistic Conditions Challenge, showing significant performance improvements.