In an HMM-based Text-To-Speech system, contextual features, including phonetic and prosodic factors have a significant influence to the spectrum, F0 and duration of the synthetic voice. This paper proposes prosodic features aiming at improving the naturalness of an HMM-based TTS system (VTed) for a tonal language, Vietnamese. The ToBI (Tones and Break Indices) features are used to learn two crucial prosodic cues i.e. intonation (boundary tones) and pause (break indices), concurrently with another set of features. The result of MOS test showed that the general quality of synthetic voice is rather good, 1.21 point lower than the natural voice. About 55% of the voice trained with ToBI boundary tone feature are perceived as similar to the voice trained without this feature, while a 10% difference in favour of the voice trained without this ToBI feature is observed. This may be linked with F0 contour lowering or raising regardless of lexical tones. This brought two main problems in the synthetic voice: discontinuity in spectrum and F0 or unexpected voice quality. This paper then concluded the need of much more work on intonation modeling that should take into account the Vietnamese tones. A new prosody model can be designed, which may consider the ToBI model, with respect to lexical tones and the syntactic structure of Vietnamese.
Index Terms: Text-to-speech (TTS), speech synthesis, tonal language, Vietnamese, HMM-based speech synthesis, intonation, ToBI