ISCA Archive AVSP 2007
ISCA Archive AVSP 2007

An audio-visual speech recognition framework based on articulatory features

Tian Gan, Wolfgang Menzel, Shiqiang Yang

This paper presents an audio-visual speech recognition framework based on articulatory features, which tries to combine the advantages of both areas, and shows a better recognition accuracy compared to a phone-based recognizer. In our approach, we use HMMs to model abstract articulatory classes, which are extracted in parallel from both the speech signal and the video frames. The N-best outputs of these independent classifiers are combined to decide on the best articulatory feature tuples. By mapping these tuples to phones, a phone stream can be generated. A lexical search finally maps this phone stream to meaningful word transcriptions. We demonstrate the potential of our approach by a preliminary experiment on the GRID database, which contains continuous English voice commands for a small vocabulary task.