Frame-by-frame representation is not appropriate for prosodic features, which are tightly related to speech units spreading a wide time span, such as words, phrases and so on. This causes an inherit problem in fundamental frequency (F0) contour generation by HMM-based speech synthesis. Our formerlydeveloped method, which modify generated F0 contours in the framework of the generation process model, is improved to allow plural phrase components in a breath group. Since the model can clearly relate its commands with linguistic (and para-/non- linguistic) information, the method further enables flexible controls of prosody through manipulating model commands. Prosodic focus is realized in HMM-based speech synthesis as a supplemental process; viewing the differences of command magnitudes/amplitudes between utterances without and with focus. Validity of the method was confirmed by listening experiments of synthetic speech.
Index Terms: fundamental frequency contour, generation process model, HMM-based speech synthesis, prosodic focus