ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

EmoJudge: LLM Based Post-Hoc Refinement for Multimodal Speech Emotion Recognition

Prabhav Singh, Jesus Villalba

In SER, a significant challenge lies in building systems that can accurately interpret emotions in naturalistic conditions. To address this, we present EMOJUDGE, our submission to the SER in Naturalistic Conditions Challenge. For the categorical SER task, we propose a novel LLM-refined multimodal approach, while for the dimensional SER task, we propose a robust multimodal architecture. In both submissions, WavLM-Large is combined with attentive pooling aided by residual networks to extract acoustic features. For text, RoBERTa-Large captures linguistic nuances. Experimentation identifies late fusion with logistic regression as the optimal method for integrating modalities. For the categorical challenge, our novel contribution includes using transcripts, speaker indicators, and audio descriptions as input to an LLM for post-hoc correction of conflicting predictions. Results demonstrate improvements over the baseline in both tasks, highlighting the effectiveness of our proposed approach.