Emphasis is an important form of expressiveness in speech. Hidden Markov model (HMM) based synthesis has shown great flexibility in generating expressive speech. This paper proposes a hierarchical model based on HMM aiming at synthesizing emphatic speech of both high emphasis quality and high naturalness with limited data. The decision tree (DT) is constructed with non-emphasis-questions using both neutral and emphasis corpora. We classify the data in each leaf of the DT into 6 emphasis categories according to the emphasis-related questions. The data of the same emphasis category are grouped into one sub-node and are used to train one HMM. As there might be no data of some specific emphasis categories in the leaves of the DT, a method based on the cost calculation is proposed to select a suitable HMM trained from the data of other sub-node in the same leaf for predicting parameters. Further a compensation model is proposed to adjust the predicted parameters. Experiments show that the proposed hierarchical model can synthesize emphatic speech with high quality for both naturalness and emphasis, using limited amount of training data.
Index Terms: emphatic speech synthesis, hidden Markov model (HMM), hierarchy, compensation model