ISCA Archive SSW 2023
ISCA Archive SSW 2023

Improving the quality of neural TTS using long-form content and multi-speaker multi-style modeling

Tuomo Raitio, Javier Latorre, Andrea Davis, Tuuli Morrill, Ladan Golipour

Neural text-to-speech (TTS) can provide quality close to naturalspeech if an adequate amount of high-quality speech material isavailable for training. However, acquiring speech data for TTStraining is costly and time-consuming, especially if the goal isto generate different speaking styles. In this work, we showthat we can transfer speaking style across speakers and improve the quality of synthetic speech by training a multi-speakermulti-style (MSMS) model with long-form recordings, in addition to regular TTS recordings. In particular, we show that 1)multi-speaker modeling improves the overall TTS quality, 2) theproposed MSMS approach outperforms pre-training and finetuning approach when utilizing additional multi-speaker data,and 3) long-form speaking style is highly rated regardless of thetarget text domain.