ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Improving Multilingual Text-to-Speech with Mixture-of-Language-Experts and Accent Disentanglement

Jing Wu, Ting Chen, Minchuan Chen, Wei Hu, Shaojun Wang, Jing Xiao

Code-switching and accent control is particularly valuable in multilingual text-to-speech (TTS) systems as both of them contribute to improving the authenticity and comprehensibility. However, the issues of seamless integration of languages within a single utterance and the thorough disentanglement of different attributes without bilingual data remains to be solved. To conquer these problems, a computation-efficient model is proposed in this paper. Firstly, the Mixture of Language Experts (MoLE) module is introduced as the encoder to extract language-specific features and fuse intra-utterance semantic information. Secondly, embedding methods together with several regularization strategies and speaker consistency constraints are utilized to ensure that the generated speech aligns with the desired accent. Experiments show that the proposed model can improve the performance of code-switching accent-controllable multilingual TTS over the baseline model in terms of fluency and naturalness.