ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Back to the Future: Extending the Blizzard Challenge 2013

Sébastien Le Maguer, Simon King, Naomi Harte

Nowadays, speech synthesis technology is synonymous with the use of Deep Learning. To understand more about how synthesis systems have progressed with the advent of Deep Learning requires open-sourced speech resources that connect past and present technologies. This would allow direct comparisons. This paper presents such a resource by extending the 2013 edition of the Blizzard Challenge. Using this extension, we compare top-tier systems from the past to modern technologies in a controlled setting. From this edition, we selected the best representative of each historical synthesis technology, to which we added four systems representing combinations of modern acoustic models and neural vocoders. A large scale subjective evaluation was conducted to evaluate naturalness. Our results show that, as expected, modern technologies generate more natural synthetic speech. However, these systems are still not perceived to be as natural as the human voice. Crucially, we also observed that the Mean Opinion Score (MOS) of the historical systems dropped a full MOS point from their scores in the original edition. This demonstrates the relative nature of MOS: it should generally not be reported as an absolute value despite its origin as an absolute category rating.