ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Bilingual and Code-switching TTS Enhanced with Denoising Diffusion Model and GAN

Huai-Zhe Yang, Chia-Ping Chen, Shan-Yun He, Cheng-Ruei Li

In this paper, we propose a Mandarin-English bilingual and code-switching text-to-speech (TTS) system featuring a diffusion model and generative adversarial network (GAN) to improve the output speech. To address speaker consistency, we employ a feature separation architecture that converts language and speaker IDs into embeddings as input to the encoder. Subsequently, we employ two adversarial classifiers and two classifiers to separate language and speaker features. We integrate a modified diffusion model and discriminators to push for better speech quality and speaker consistency, especially for code-swtiching scenarios. On the MOS measure, the performance of the proposed TTS system differs only slightly from the ground truth data in monolingual speech and achieves MOS of 3.83 in the synthesis of code-switching speech.