ISCA Archive Interspeech 2020
ISCA Archive Interspeech 2020

Cross-Lingual Text-To-Speech Synthesis via Domain Adaptation and Perceptual Similarity Regression in Speaker Space

Detai Xin, Yuki Saito, Shinnosuke Takamichi, Tomoki Koriyama, Hiroshi Saruwatari

We present a method for improving the performance of cross-lingual text-to-speech synthesis. Previous works are able to model speaker individuality in speaker space via speaker encoder but suffer from performance decreasing when synthesizing cross-lingual speech. This is because the speaker space formed by all speaker embeddings is completely language-dependent. In order to construct a language-independent speaker space, we regard cross-lingual speech synthesis as a domain adaptation problem and propose a training method to let the speaker encoder adapt speaker embedding of different languages into the same space. Furthermore, to improve speaker individuality and construct a human-interpretable speaker space, we propose a regression method to construct perceptually correlated speaker space. Experimental result demonstrates that our method could not only improve the performance of both cross-lingual and intra-lingual speech but also find perceptually similar speakers beyond languages.