ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Text-aware and Context-aware Expressive Audiobook Speech Synthesis

Dake Guo, Xinfa Zhu, Liumeng Xue, Yongmao Zhang, Wenjie Tian, Lei Xie

Recent advances in text-to-speech have significantly improved the expressiveness of synthetic speech.However, a major challenge remains in generating speech that captures the diverse styles exhibited by professional narrators in audiobooks,without relying on manual labele or reference speech. To address this, we propose a text-aware and context-aware(TACA)style modeling approach for expressive audiobook speech synthesis. We first establish a text-aware style space to cover diverse styles via contrastive learning with the supervision of the speech-style space. Meanwhile, we adopt a context encoder to incorporate cross-sentence information and the style embedding obtained from text. Finally, we introduce the context encoder to two typical TTS models, including VITS-based TTS and language model-based TTS. Experimental results show that our proposed approach can effectively capture diverse styles and coherent prosody,and thus improve naturalness and expressiveness in audiobook speech synthesis