Listening tests and Mean Opinion Scores (MOS) are the most commonly
used techniques for the evaluation of speech synthesis quality and
naturalness. These are invaluable in the assessment of subjective qualities
of machine generated stimuli. However, there are a number of challenges
in understanding the MOS scores that come out of listening tests.
Primarily, we advocate for the use of non-parametric statistical
tests in the calculation of statistical significance when comparing
listening test results.
Additionally, based on the
results of 46 legacy listening tests, we measure the impact of two
sources of bias. Bias introduced by individual participants and synthesized
text can a dramatic impact on observed MOS scores. For example, we
find that on average the mean difference between the highest and lowest
scoring rater is over 2 MOS points (on a 5 point scale). From this
observation, we caution against using any statistical test without
adjusting for this bias, and provide specific non-parametric recommendations.