ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

PL-TTS: A Generalizable Prompt-based Diffusion TTS Augmented by Large Language Model

Shuhua Li, Qirong Mao, Jiatong Shi

With the increasing demand for style-controlled speech synthesis, traditional TTS methods for controlling acoustic features clearly have significant limitations. Therefore, using text style descriptions to achieve style-controlled TTS has become a current hot topic. However, existing methods often have unsatisfactory results when dealing with unseen style descriptions and ignore the issue of adding various style conditions to the model, which can lead to poor training performance of the original model. In this context, we propose PL-TTS, an enhanced diffusion-based TTS combined with prompts embedded by a large language model. In order to improve synthesis quality and style control ability, an enhanced diffusion-based framework and a method for fine-tuning large language models have been proposed. Experimental results in LibriTTS-R validate the effectiveness of PL-TTS in fine grained style control and generalization.