ISCA Archive AVSP 2013
ISCA Archive AVSP 2013

Objective and subjective feature evaluation for speaker-adaptive visual speech synthesis

Dietmar Schabus, Michael Pucher, Gregor Hofer

This paper describes an evaluation of a feature extraction method for visual speech synthesis that is suitable for speaker-adaptive training of a Hidden Semi-Markov Model (HSMM)-based visual speech synthesizer. An audio-visual corpus from three speakers was recorded. While the features used for the auditory modality are well understood, we propose to use a standard Principal Component Analysis (PCA) approach to extract suitable features for training and synthesis of the visual modality. A PCA-based approach provides dimensionality reduction and component de-correlation on the 3D facial marker data which was recorded using a facial motion capturing system. Enabling visual average “voice” training and speaker-adaptation brings a key strength of the HMM framework into both the visual and the audio-visual domain. An objective evaluation based on reconstruction error calculations, as well as a perceptual evaluation with 40 test subjects, show that PCA is well suited for feature extraction from multiple speakers, even in a challenging adaptation scenario where no data from the target speaker is available during PCA.