ISCA Archive SpeechProsody 2004
ISCA Archive SpeechProsody 2004

Synthesis by recombination of segmental and prosodic information

Jan P. H. van Santen, Alexander Kain, Esther Klabbers

Generating meaningful and natural sounding prosody is a central challenge in text-to-speech synthesis (TTS). In traditional synthesis, the challenge consists of how to generate natural target prosodic contours and how to impose these contours on recorded speech without causing audible distortions. In corpus based synthesis, the challenge is the sheer size of the speech corpus that is needed to cover all combinations of phone sequences and prosodic contexts that can occur in a given language. A new method is proposed based on the following concepts. The set of phone sequences in a language can be partitioned in terms of the manner of production of their constituent phonemes. For each sub-class in this partition (e.g., vowel-nasal-unvoiced fricative), a representative sequence is chosen (e.g., [e]-[n]-[s]), and recorded in a wide variety of prosodic contexts. The remaining sequences in this subclass are recorded in a much smaller number of contexts, potentially only one context. The method describes a procedure for generating sequences in prosodic contexts in which they have not been recorded, by transplanting the prosodic contours of sequences in the same sub-class that have been recorded in these contexts. The method uses time warp algorithms in a superpositional framework.