Voice activity detection (VAD) is one of the most critical issues on performance degradation of speech recognition in noisy environment applications. A real-time VAD was developed by using face parameters (eye and lip contours) as a front-end for the traditional speech and noise (audio) GMMbased method. Speech recognition performance of the audiovisual VAD is shown to be comparable with audio-only VAD, for a shopping mall background noise. Advantages and limitations of introducing the visual information are discussed.
Index Terms: voice activity detection, audio-visual, speech recognition, noisy environment, real-time.