ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

STEN-TTS: Improving Zero-shot Cross-Lingual Transfer for Multi-Lingual TTS with Style-Enhanced Normalization Diffusion Framework

Chung Tran, Chi Mai Luong, Sakriani Sakti

The prevalence of personalized multilingual tools plays an important role in learning aids and virtual assistants. The existing works on multilingual adaptive text-to-speech (TTS) mainly focus on fine-tuning models or extracting personal styles, such as prosody, emotion, and identity, with the aim of adapting to new speakers. This paper introduces the Style-Enhanced Normalization TTS (STEN-TTS) approach to synthesizing multilingual voice and maintaining personal styles with only 3 seconds of input reference. By presenting an integrated module (STEN) into the diffusion model, the proposed method can simulate the speaker's style and eliminate white noise in the synthesized speech. The experimental results show that our model achieves good performance, at above 3.5 on SMOS for cross-lingual switching. Furthermore, when using speaker verification to assess the similarity between the ground truth and synthesized voices, the accuracy reaches 82.4% with 3 seconds of audio reference.