This paper presents an evaluation protocol for the subjective assessement of text-to-speech in audiobook reading tasks. We developed a questionaire with 11 scales an tested it on TTS data from 4 different synthetic voices, plus one optimized version.
A MANOVA on the data gathered with the questionnaire showed that the text type has a significant influence on 7 of the 11 scales. Moreover, the level of familiarity does not have any influence on the ratings.
A subsequent Principal Axis Factor (PAF) analysis with Promax rotation resulted in 2 underlying dimensions. The first factor represents the listening pleasure the tested systems achieved. The second dimension comprises scales that evaluate the prosody of the synthesized speech signal.
After the analysis of the results we propose to perform slight modifications to the developed questionaire.