ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Adversarial and Sequential Training for Cross-lingual Prosody Transfer TTS

Min-Kyung Kim, Joon-Hyuk Chang

This study presents a method for improving the performance of the text-to-speech (TTS) model by using three global speech-style representations: language, speaker, and prosody. Synthesizing different languages and prosody in the speaker's voice regardless of their own language and prosody is possible. To construct the embedding of each representation conditioned in the TTS model such that it is independent of the other representations, we propose an adversarial training method for the general architecture of TTS models. Furthermore, we introduce a sequential training method that includes rehearsal-based continual learning to train complex and small amounts of data without forgetting previously learned information. The experimental results show that the proposed method can generate good-quality speech and yield high similarity for speakers and prosody, even for representations that the speaker in the dataset does not contain.