ISCA Archive AVSP 2010
ISCA Archive AVSP 2010

Real-time audio-visual voice activity detection for speech recognition in noisy environments

Carlos T. Ishi, Miki Sato, Norihiro Hagita, Shihong Lao

Voice activity detection (VAD) is one of the most critical issues on performance degradation of speech recognition in noisy environment applications. A real-time VAD was developed by using face parameters (eye and lip contours) as a front-end for the traditional speech and noise (audio) GMMbased method. Speech recognition performance of the audiovisual VAD is shown to be comparable with audio-only VAD, for a shopping mall background noise. Advantages and limitations of introducing the visual information are discussed.

Index Terms: voice activity detection, audio-visual, speech recognition, noisy environment, real-time.