ISCA Archive Interspeech 2015
ISCA Archive Interspeech 2015

Double-ended prediction of the naturalness ratings of the blizzard challenge 2008-2013

Lukas Latacz, Werner Verhelst

In this paper we describe a double-ended (i.e. reference-based or intrusive) approach to objective quality estimation of synthetic speech that uses a linear regression model whose parameters can easily be interpreted. The model was trained and evaluated on English data from the 2008 to 2013 Blizzard Challenges (BC) [1], which is the largest publically available resource of listener-evaluated synthetic speech. To our knowledge, this is the first attempt to train and evaluate a speech quality predictor on the whole data set. Predicting the naturalness of the different participating systems in the BC is not an easy task because some of the systems are quite close in quality. Our best results correspond to a Pearson correlation coefficient of 0.60 and 0.84 for sentences and systems, respectively, using a leave-one-system-out evaluation, which by far outperformed the ITU-T standard PESQ [2] for double-ended speech quality evaluation on this data.