ISCA Archive Interspeech 2014
ISCA Archive Interspeech 2014

Intelligibility analysis of fast synthesized speech

Cassia Valentini-Botinhao, Markus Toman, Michael Pucher, Dietmar Schabus, Junichi Yamagishi

In this paper we analyse the effect of speech corpus and compression method on the intelligibility of synthesized speech at fast rates. We recorded English and German language voice talents at a normal and a fast speaking rate and trained an HSMM-based synthesis system based on the normal and the fast data of each speaker. We compared three compression methods: scaling the variance of the state duration model, interpolating the duration models of the fast and the normal voices, and applying a linear compression method to generated speech. Word recognition results for the English voices show that generating speech at normal speaking rate and then applying linear compression resulted in the most intelligible speech at all tested rates. A similar result was found when evaluating the intelligibility of the natural speech corpus. For the German voices, interpolation was found to be better at moderate speaking rates but the linear method was again more successful at very high rates, for both blind and sighted participants. These results indicate that using fast speech data does not necessarily create more intelligible voices and that linear compression can more reliably provide higher intelligibility, particularly at higher rates.

doi: 10.21437/Interspeech.2014-197

Cite as: Valentini-Botinhao, C., Toman, M., Pucher, M., Schabus, D., Yamagishi, J. (2014) Intelligibility analysis of fast synthesized speech. Proc. Interspeech 2014, 2922-2926, doi: 10.21437/Interspeech.2014-197

  author={Cassia Valentini-Botinhao and Markus Toman and Michael Pucher and Dietmar Schabus and Junichi Yamagishi},
  title={{Intelligibility analysis of fast synthesized speech}},
  booktitle={Proc. Interspeech 2014},