In this paper we present the Bilingual Audio-Visual Corpus with Depth information (BAVCD). The database contains utterances of connected digits, spoken by 15 subjects in English and 6 subjects in Greek, and collected employing multiple audio-visual sensors. Among them, of particular interest is the use of the Microsoft Kinect device, which is able to capture facial depth images using the structured light technique in addition to the traditional RGB video. The database allows conducting research on multiple aspects of small-vocabulary audio-visual automatic speech recognition, such as the use of visual depth information for speechreading, fusion of multiple video and audio streams, and language dependencies of the task. Preliminary results on the corpus are also presented.
Index Terms. Audiovisual speech recognition, corpora, multisensory fusion, depth information, languages.