In this paper, we propose a speech pattern discovery approach using audio visual information fusion. We first align the audio and visual feature sequences using canonical correlation analysis (CCA) to account for the temporal asynchrony between audio and visual speech modalities. We then search for potential patterns, called paths, using unbounded dynamic time warping (UDTW) on the inter-utterance audio and visual similarity matrices, individually. Audio paths and visual paths are finally integrated and the reliable ones are reserved as the discovered speech patterns. Experiments on an audio-visual corpus has shown for the first time that the performance of speech pattern discovery can be improved by the use of visual information when the speaker's facial information is avaliable. Specifically, the proposed path fusion approach shows superior performance as compared to feature concatenation and similarity weighting. CCA-based audio-visual synchronization plays an important role in the performance improvement.
Index Terms: Speech pattern discovery, canonical correlation analysis, audio-visual speech processing, dynamic time warping