ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Multimodal Emotion Diarization: Frame-Wise Integration of Text and Audio Representations

Ziv Tamir, Thomas Thebaud, Jesus Villalba, Najim Dehak, Oren Kurland

Speech emotion diarization (SED) is the task of segmenting an audio stream into time-continuous emotional states, akin to speaker diarization but for emotions. While traditional speech emotion recognition (SER) assigns a single emotion label to a given utterance, real-world conversations exhibit dynamic emotional transitions that require a more granular approach. In this work, we propose a novel multimodal SED framework that uses frame-wise integration of text and audio embeddings using temporal synchronization and direct concatenation, followed by a context-aware sliding window smoothing mechanism. Audio representations are extracted using WavLM, and EmoBERTa generates text embeddings aligned to spoken words. We evaluate our approach using Emotion Diarization Error Rate (EDER), a metric designed for SED. Experimental results show that our proposed method significantly improves diarization performance, with respect to score fusion and cross-attention methods yielding an EDER of 25%.