In parametric text-to-speech synthesis using Hidden Markov Model (HMM), the fundamental frequency (F0) parameter modelling is important because it has a direct effect on the prosody of synthetic speech. F0 is typically modelled by a discrete distribution for unvoiced speech and a continuous distribution for voiced, by using a multi-space distribution (MSD). However, F0 modelling using MSD-HMM is not accurate around the voiced-unvoiced (V-UV) and (UV-V) transitions and it is affected by voicing decision errors of the F0 estimation algorithm. In order to reduce this problem, HMM-based speech synthesisers have been proposed that model F0 using continuous HMM. This approach usually obtains the continuous F0 contours by interpolating F0 in unvoiced regions. The problem with this method is that it is affected by voiced decision errors during speech analysis. For example, if voiced speech segments are incorrectly classified as unvoiced, the F0 contour in this region is obtained by interpolation which might be a poor estimate of the natural F0. This paper proposes to use an F0 estimation method that does not require a hard voiced/unvoiced classification and produces a reasonable smooth F0 contour. The robustness of this method was studied in the conditions of high-quality recorded speech and recorded speech with additive noise. The motivation for using noisy speech was to study the effect of voiced decision errors on the quality of the synthetic speech.
Index Terms: continuous F0 modelling, voicing strength, HMM-based speech synthesis