ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

DiffEmotionVC: A Dual-Granularity Disentangled Diffusion Framework for Any-to-Any Emotional Voice Conversion

Xiaosu Su, BoWen Yang, Xiaowei Yi, Yun Cao

Emotional Voice Conversion (EVC) plays a vital role in improving human-computer interaction but faces challenges due to the complexity of emotion features, which are entangled with speaker and content characteristics. To overcome these challenges, we propose DiffEmotionVC, a diffusion-based framework for any-to-any EVC. Our approach integrates a dual-granularity emotion encoder that captures both utterance-level emotional context and frame-level acoustic details. It also employs an orthogonality-constrained condition encoder that disentangles emotion features through gated cross-attention while preserving feature independence with an orthogonal loss. Additionally, multi-objective diffusion training enhances both reconstruction fidelity and emotion discriminability via contrastive learning. Experimental results show a UTMOS score of 4.04 and 80% emotion recognition accuracy, demonstrating the framework's effectiveness in speech quality and optimizing emotional expression.