ISCA Archive ICSLP 2002
ISCA Archive ICSLP 2002

Acoustic measures vs. phonetic features as predictors of audible discontinuity in concatenative speech synthesis

Hisashi Kawai, Minoru Tsuzaki

Most concatenative speech synthesizers employ both acoustic measures and phonetic features to predict the perceptual damage caused by concatenating two waveform segments because no reliable acoustic measure has been found so far. This paper compares the predicting ability of the two kinds of predictor variables. We first conduct a perceptual experiment to measure the naturalness degradation due to signal discontinuity introduced by concatenating waveform segments. Secondly, we predict the score of naturalness degradation from acoustic measures derived from MFCC and/or phonetic features using statistical models such as a multiple regression model. Based on an investigation of the multiple regression coeffi- cients, we found that (1) the phonetic features are more effective and that (2) the acoustic measures do not provide useful information in addition to the phonetic features.