Natural movement plays a significant role in realistic speech animation.
Numerous studies have demonstrated the contribution visual cues make
to the degree we, as human observers, find an animation acceptable.
Rigid head motion is one visual mode that universally co-occurs
with speech, and so it is a reasonable strategy to seek a transformation
from the speech mode to predict the head pose. Several previous authors
have shown that prediction is possible, but experiments are typically
confined to rigidly produced dialogue. Natural, expressive, emotive
and prosodic speech exhibit motion patterns that are far more difficult
to predict with considerable variation in expected head pose.
Recently, Long Short
Term Memory (LSTM) networks have become an important tool for modelling
speech and natural language tasks. We employ Deep Bi-Directional LSTMs
(BLSTM) capable of learning long-term structure in language, to model
the relationship that speech has with rigid head motion. We then extend
our model by conditioning with prior motion. Finally, we introduce
a generative head motion model, conditioned on audio features using
a Conditional Variational Autoencoder (CVAE). Each approach mitigates
the problems of the one to many mapping that a speech to head pose
model must accommodate.