ISCA Archive SSW 2023
ISCA Archive SSW 2023

Improving robustness of spontaneous speech synthesis with linguistic speech regularization and pseudo-filled-pause insertion

Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi, Hiroshi Saruwatari

We present a training method with linguistic speech regularization that improves the robustness of spontaneous speech synthesis methods with filled pause (FP) insertion. Spontaneousspeech synthesis is aimed at producing speech with human-likedisfluencies, such as FPs. Because modeling the complex datadistribution of spontaneous speech with a rich FP vocabulary ischallenging, the quality of FP-inserted synthetic speech is oftenlimited. To address this issue, we present a method for synthesizing spontaneous speech that improves robustness to diverseFP insertions. Regularization is used to stabilize the synthesis ofthe linguistic speech (i.e., non-FP) elements. To further improverobustness to diverse FP insertions, it utilizes pseudo-FPs sampled using an FP word prediction model as well as ground-truthFPs. Our experiments demonstrated that the proposed methodimproves the naturalness of synthetic speech with ground-truthand predicted FPs by 0.24 and 0.26, respectively.