ISCA Archive Interspeech 2015
ISCA Archive Interspeech 2015

Data-driven foot-based intonation generator for text-to-speech synthesis

Mahsa Sadat Elyasi Langarani, Jan van Santen, Seyed Hamidreza Mohammadi, Alexander Kain

We propose a method for generating F0 contours for text-to-speech synthesis. Training speech is automatically annotated in terms of feet, with features indicating start and end times of syllables, foot position, and foot length. During training, we fit a foot-based superpositional intonation model comprising accent curves and phrase curves. During synthesis, the method searches for stored, fitted accent curves associated with feet that optimally match to-be-synthesized feet in the feature space, while minimizing differences between successive accent curve heights. We tested the proposed method against the HMM-based Speech Synthesis System (HTS) by imposing contours generated by these two methods onto natural speech, and obtaining quality ratings. Test sets varied in how well they were covered by the training data. Contours generated by the proposed method were preferred over HTS-generated contours, especially for poorly-covered test items. To test the new method's usefulness for processing marked-up text input, we compared its ability to convey contrastive stress with that of natural speech recordings, and found no difference. We conclude that the new method holds promise for generating comparatively high-quality F0 contours, especially when training data are sparse and when mark-up is required.