ISCA Archive Interspeech 2020
ISCA Archive Interspeech 2020

Dynamic Soft Windowing and Language Dependent Style Token for Code-Switching End-to-End Speech Synthesis

Ruibo Fu, Jianhua Tao, Zhengqi Wen, Jiangyan Yi, Chunyu Qiang, Tao Wang

Most of current end-to-end speech synthesis assumes the input text is in a single language situation. However, code-switching in speech occurs frequently in routine life, in which speakers switch between languages in the same utterance. And building a large mixed-language speech database is difficult and uneconomical. In this paper, both windowing technique and style token modeling are designed for the code-switching end-to-end speech synthesis. To improve the consistency of speaking style in bilingual situation, compared with the conventional windowing techniques that used fixed constraints, the dynamic attention reweighting soft windowing mechanism is proposed to ensure the smooth transition of code-switching. To compensate the shortage of mixed-language training data, the language dependent style token is designed for the cross-language multi-speaker acoustic modeling, where both the Mandarin and English monolingual data are the extended training data set. The attention gating is proposed to adjust style token dynamically based on the language and the attended context information. Experimental results show that proposed methods lead to an improvement on intelligibility, naturalness and similarity.