ISCA Archive AVSP 2013
ISCA Archive AVSP 2013

Predicting head motion from prosodic and linguistic features

Angelika Hönemann, Diego Evin, Alejandro J. Hadad, Hansjörg Mixdorff, Sascha Fagel

This paper describes an approach to predict non-verbal cues from speech-related features. Our previous investigations of audiovisual speech showed that there are strong correlations between the two modalities. In this work we developed two models using different kinds of Recurrent Artificial Neural Networks: Elman and NARX, to predict parameters of activity for head motion using linguistic and prosodic inputs, and compared their performance. Prosodic inputs included F0 and intensity, while linguistic parameters included the former plus additional information such as the type of syllables, phrases, and different relations between them. Using speaker specific models for six subjects, performance measures in terms of root mean square error (RMSE) showed that there are significant differences between the models with respect to the input parameters, and that NARX network outperformed the Elman network on the prediction task.

Index Terms: predicting head motion, audiovisual speech, timedelayed NARX, Elman NN, linguistic vs. prosodic features