ISCA Archive Interspeech 2014
ISCA Archive Interspeech 2014

Speech-driven head motion synthesis using neural networks

Chuang Ding, Pengcheng Zhu, Lei Xie, Dongmei Jiang, Zhong-Hua Fu

This paper presents a neural network approach for speech-driven head motion synthesis, which can automatically predict a speaker's head movement from his/her speech. Specifically, we realize speech-to-head-motion mapping by learning a multi-layer perceptron from audio-visual broadcast news data. First, we show that a generatively pre-trained neural network significantly outperforms a randomly initialized network and the hidden Markov model (HMM) approach. Second, we demonstrate that the feature combination of log Mel-scale filter-bank (FBank), energy and fundamental frequency (F0) performs best in head motion prediction. Third, we discover that using long context acoustic information can further improve the performance. Finally, extra unlabeled training data used in the pre-training stage can achieve more performance gain. The proposed speech-driven head motion synthesis approach increases the CCA from 0.299 (the HMM approach) to 0.565 and it can be effectively used in expressive talking avatar animation.

doi: 10.21437/Interspeech.2014-186

Cite as: Ding, C., Zhu, P., Xie, L., Jiang, D., Fu, Z.-H. (2014) Speech-driven head motion synthesis using neural networks. Proc. Interspeech 2014, 2303-2307, doi: 10.21437/Interspeech.2014-186

  author={Chuang Ding and Pengcheng Zhu and Lei Xie and Dongmei Jiang and Zhong-Hua Fu},
  title={{Speech-driven head motion synthesis using neural networks}},
  booktitle={Proc. Interspeech 2014},