ISCA Archive Interspeech 2020
ISCA Archive Interspeech 2020

Investigating Effective Additional Contextual Factors in DNN-Based Spontaneous Speech Synthesis

Yuki Yamashita, Tomoki Koriyama, Yuki Saito, Shinnosuke Takamichi, Yusuke Ijima, Ryo Masumura, Hiroshi Saruwatari

In this paper, we investigate the effectiveness of using rich annotations in deep neural network (DNN)-based statistical speech synthesis. General text-to-speech synthesis frameworks for reading-style speech use text-dependent information referred to as context. However, to achieve more human-like speech synthesis, we should take paralinguistic and nonlinguistic features into account. We focus on adding contextual features to the input features of DNN-based speech synthesis using spontaneous speech corpus with rich tags including paralinguistic and nonlinguistic features such as prosody, disfluency, and morphological features. Through experimental evaluations, we investigate the effectiveness of additional contextual factors and show which factors enhance the naturalness as spontaneous speech. This paper contributes as a guide to data collection for speech synthesis.