ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

EATS-Speech: Emotion-Adaptive Transformation and Priority Synthesis for Zero-Shot Text-to-Speech

Jingyuan Xing, Zhipeng Li, Shuaiqi Chen, Xiaofen Xing, Xiangmin Xu

Zero-shot text-to-speech (TTS) supports diverse speech synthesis without speaker-specific data but struggles to accurately transfer emotions from reference to target text. Traditional approaches treat emotion as part of a global style, leading to inconsistent emotional expressiveness. To address this, we propose EATS-Speech, an Emotion-Adaptive Transformation Synthesis framework. EATS-Speech employs Emotion Priority Synthesis through a parallel pipeline that decomposes speech into non-emotion style, emotion, and content. It prioritizes emotion generation to enhance expressiveness. Furthermore, it introduces Emotion-Adaptive Transformation Synthesis, where an LLM-based converter learns text-emotion mapping patterns from the reference speech and transfers them to the target text. Experiments on the LibriTTS dataset demonstrate the improvements in emotional expressiveness and accurate emotion adaptation.