ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Advanced Speaker Embedding with Predictive Variance of Gaussian Distribution for Speaker Adaptation in TTS

Jaeuk Lee, Joon-Hyuk Chang

Speaker adaptation in text-to-speech (TTS) has three goals: high-quality audio, requirement of a small amount of data for adapting to a new speaker, and fine-tuning few parameters for storage efficiency in commercial service of custom voice. In this paper, we introduce a novel adaptation method to achieve the aforementioned three goals. First, we estimate variances from a speaker embedding and add them back to the speaker embedding. Through this operation, the distribution of each speaker in latent space increases. Moreover, we design a prediction model that could generate a speaker embedding that approximately represents the new speaker's timbre. We can obtain a new speaker embedding well representing the timbre of a new speaker by the search process to the starting point of fine-tuning and the prediction model. We observe the performance change according to the number of fine-tuning parameters. Finally, we evaluate the proposed method using the mean opinion score (MOS) to demonstrate the remarkable performance of our proposed method.