ISCA Archive SpeechProsody 2002
ISCA Archive SpeechProsody 2002

Soft input feature selection within neural prosody generation

Caglayan Erdem, Hans Georg Zimmermann

The analysis and selection of input features within machine learning techniques is an important problem if a new system has to be established or the system has to be trained for a new task. Within a Text-to-Speech (TTS) application this task has to be handled while adapting a system to a new language or a new speaker. In this paper a parameterized data-driven weight decay [1] is presented and applied in order to systematically analyze phonetic and linguistic input features of a neural network (NN). The NN models an acoustic prosody generation module of our TTS system Papageno. The original NN is enhanced by an additional preprocessing unit. The input features are propagated by a diagonal matrix to a preprocessing cluster. This diagonal matrix is the only one which utilizes the weight decay technique. So the elements of this matrix describe a weighing of the input features. The application resulted in an evaluation of input parameters and a strong reduction of input features without performance loss. Within our F0-contour generation module the squared error of the NN is remarkably reduced by 13%.