Speech emotion diarization (SED) is the task of segmenting an audio stream into time-continuous emotional states, akin to speaker diarization but for emotions. While traditional speech emotion recognition (SER) assigns a single emotion label to a given utterance, real-world conversations exhibit dynamic emotional transitions that require a more granular approach. In this work, we propose a novel multimodal SED framework that uses frame-wise integration of text and audio embeddings using temporal synchronization and direct concatenation, followed by a context-aware sliding window smoothing mechanism. Audio representations are extracted using WavLM, and EmoBERTa generates text embeddings aligned to spoken words. We evaluate our approach using Emotion Diarization Error Rate (EDER), a metric designed for SED. Experimental results show that our proposed method significantly improves diarization performance, with respect to score fusion and cross-attention methods yielding an EDER of 25%.