ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Controlling formant frequencies with neural text-to-speech for the manipulation of perceived speaker age

Ziya Khan, Lovisa Wihlborg, Cassia Valentini-Botinhao, Oliver Watts

In this paper, we present a framework for formant-controllable neural text-to-speech. We train a model that predicts formant frequencies which then condition melspectrogram generation. We apply this to manipulate perceived speaker age in an indirect fashion, by modifying the predicted formants in a manner that affects perceived vocal tract length. Our ultimate goal is to allow for the control of perceived ageing in children's text-to-speech voices, since ageing in natural child speech is strongly linked to the growth of a child's vocal tract. However, our experiments indicate that our method shows strong age control capabilities for adult speech as well.