We describe preliminary work towards an objective method for identifying visemes. Active appearance model (AAM) features are used to parameterise a speaker’s lips and jaw during speech. The temporal behaviour of AAM features between automatically identified salient points is used to represent visual speech gestures, and visemes are created by clustering these gestures using dynamic time warping (DTW) as a costfunction. This method produces a significantly more structured model of visual speech than if a typical phoneme-to-viseme mapping is assumed.
Index Terms: Visemes, visual speech encoding