A tone generation method by maximizing the joint likelihood of syllabic HMMs is proposed to improve the Mandarin speech synthesis. F0 sequence is generated by jointly maximizing the likelihood of the state-level F0 model and syllable-level tone model under the constraint of mean F0 of the adjacent units. The optimal weight of the tone component is searched in terms of the parameter generation error and correlation coefficients. Objective and subjective evaluations both prove the positive effects of this method. The generation error is reduced by 26.7%, the correlation coefficient is increased by 6.5%, and the prosody perception is significantly improved.
Index Terms: speech synthesis, F0 contour, tone generation, speech prosody