The present work is part of a framework to design and implement a language laboratory for speech reading/lip reading for multiple languages. It is based on the interdisciplinary project LIPPS at Technical University of Berlin, Germany, which aims to develop a training-aid for speech reading by employing a text-driven facial animation from a single passport photo with the help of 2D image morphing. The LIPPS system may be particularly helpful for patients with a sudden profound hearing-loss, enabling them to start learning speech reading already in the hospital after operation or during subsequent rehabilitation.
The present project uses dynamic models for the changes of important visual features. We apply the ideas of i) specific characteristic images being related to the sounds or phonemes of an utterance and ii) visemes being related to the phonemes and represented by the dynamics of linear secondorder models.
We aim to extend the idea that visemes are related to single characteristic images or poses of the face towards temporally varying units, as it is the case for the correlating auditory units, the phonemes.
We analyzed video clips with moving faces and modeled the prediction of certain visual features at locations of the characteristic images (the characteristic instances) as well as of transitional changes of the feature sets between neighboring characteristic instances. Contextual modulations of the visual features are described with the help of a dominance model. High dominance is given to visemes with indispensable features as, for instance, complete or partial lip closure (e.g., bilabial or fricative visemes), whereas low dominance is given to practically invisible visemes (e.g., guttural visemes), when the lips mainly prepare the transition towards later dominant phonemes.
The described method may also be applied to other types of facial animation systems as to the control parameters of anatomical face models.