ISCA Archive AVSP 2008
ISCA Archive AVSP 2008

Patch-based analysis of visual speech from multiple views

Patrick Lucey, Gerasimos Potamianos, Sridha Sridharan

Obtaining a robust feature representation of visual speech is of crucial importance in the design of audio-visual automatic speech recognition systems. In the literature, when visual appearance based features are employed for this purpose, they are typically extracted using a "holistic" approach. Namely, a transformation of the pixel values of the entire region-of-interest (ROI) is obtained, with the ROI covering the speaker’s mouth and often surrounding facial area. In this paper, we instead consider a "patch" based visual feature extraction approach, within the appearance based framework. In particular, we conduct a novel analysis to determine which areas (patches) of the mouth ROI are the most informative for visual speech. Furthermore, we extend this analysis beyond the traditional frontal views, by investigating profile views as well. Not surprisingly, and for both frontal and profile views, we conclude that the central mouth patches are the most informative, but less so than the holistic features of the entire ROI. Nevertheless, fusion of holistic and the best patch based features further improves visual speech recognition performance, compared to either feature set alone. Finally, we discuss scenarios where the patch based approach may be preferable to holistic features.