ISCA Archive issp 2024
ISCA Archive issp 2024

Articulatory speech synthesis without phones?

Konstantin Sering, Harald Baayen

With this work we show how speech production can be modelled on the word level without any symbolic units, neither on the acoustic side like phonemes, nor on the semantic side like word types, nor on the motor side like gestures or articulatory targets. We present and discuss a computational model of articulatory speech production, which implements a predictive planning approach, known from hand and arm movements, into the articulatory domain. This computational model is named Predictive Articulatory speech synthesis Utilizing Lexical Embeddings (PAULE). As articulatory speech synthesizer the VocalTractLab speech synthesizer is used, which simulates the human speech system on a geometrical level with 30 different control parameters (channels) and with a time resolution of 401 Hertz. As the synthesis quality of the PAULE shows decent results, we conclude that human speech production can be modelled without the use of any symbolic units like phones and gestures on the word level.