In this paper, an audio-visual speech corpus CENSREC-1-AV for noisy speech recognition is introduced. CENSREC-1-AV consists of an audio-visual database and a baseline system of bimodal speech recognition which uses audio and visual information. In the database, there are 3,234 and 1,963 utterances made by 42 and 51 speakers as a training and a test sets respectively. Each utterance consists of a speech signal as well as color and infrared pictures around a speaker’s mouth. A baseline system is built so that a user can evaluate a proposed bimodal speech recognizer. In the baseline system, multi-stream HMMs are obtained using training data. A preliminary experiment was conducted to evaluate the baseline using acoustically noisy testing data. The results show that roughly a 35% relative error reduction was achieved in low SNR conditions compared with an audio-only ASR method.
Index Terms: audio-visual database, bimodal speech recognition, noise robustness, eigenface, optical flow.