ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Generating Consistent Prosodic Patterns from Open-Source TTS Systems

Ha Eun Shim, Olivia Yung, Paige Tuttösí, Boey Kwan, Angelica Lim, Yue Wang, H. Henny Yeung

Text-to-Speech (TTS) systems now closely approximate human speech prosody. Yet, current deep learning-based TTS systems may struggle to accurately represent some prosodic patterns, like phrase boundaries used to signal syntactic distinctions. Such prosodic parsing can reflect differences in meaning, hence, inconsistencies in synthesis can lead to miscommunications. In this study, we conduct a qualitative assessment of five open-source TTS systems and reveal that they fail to produce acoustic signals that accurately convey distinct prosodic boundaries when given punctuation contrasts (Study 1). To mitigate this gap, we propose a pipeline for improving output using a customized dataset (Study 2), which successfully generates predictable acoustic cues, but only for certain cases. Results suggest that TTS systems require additional training to effectively capture the prosodic subtleties. We conclude by discussing how TTS systems can better generate fine prosodic distinctions.