ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Advancing Emotion Recognition via Ensemble Learning: Integrating Speech, Context, and Text Representations

Xiaohan Shi, Jinyi Mi, Xingfeng Li, Tomoki Toda

Speech Emotion Recognition (SER) in real-world scenarios aims to identify a speaker's emotional states from spontaneous speech. While prior research has focused on noise reduction techniques within individual domains, integrating multi-domain noise-robust representations for SER remains underexplored. To address this challenge, we propose a novel Speech-Context-Text (SCT) model, which integrates speech, context, and text representations via ensemble learning. Specifically, we introduce the Mamba method for speech representation, employ a layer adapter to capture context representation, and adopt ASR correction to refine text representation. Extensive experiments demonstrate the effectiveness of SCT, achieving a 7.4% Macro-F1 improvement over the official baseline of the Speech Emotion Recognition in Naturalistic Conditions Challenge at INTERSPEECH 2025, securing 6th place in the competition. Additionally, SCT yields 7.37% and 7.95% gains on MSP-Podcast and IEMOCAP, respectively.