ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Production characteristics of obstruents in WaveNet and older TTS systems

Ayushi Pandey, Sébastien Le Maguer, Julie Carson-Berndsen, Naomi Harte

Segmental properties of Text-to-Speech (TTS) synthesizers have been studied for their influence on various perceived attributes of synthesized speech. However, they have received very limited attention for modern, neural vocoder-based TTS. In this paper, we compare segmental properties of WaveNET vocoder voices with a natural voice, and the best-performing non-neural synthesizers of the 2013 Blizzard Challenge. We extended the 2013 dataset with two new voices generated using a WaveNET vocoder. Acoustic-phonetic features of obstruent consonants and their neighbouring vowels were compared between the natural voice and each of these TTS systems. Statistical analysis was conducted using the Kruskal-Wallis test, and Dunn's test. Compared to the reference natural voice, we find that the WaveNET vocoder performs very well in modelling vowels, but features like F0 at onset and spectral tilt show significant deviations from the natural voice. Among consonants, neural voices deviate most from natural in the context of voiceless fricatives. Compared to other TTS systems, several features (like vowel dispersions, and consonant duration) which had shown strong deviations from natural, were found to not differ from natural in the WaveNET vocoder systems.