ISCA Archive RSR 1997
ISCA Archive RSR 1997

Audio-visual talker localization for hand-free speech recognition

Harouna Kabre

A system for tracking a talker by sound and video is presented for the purpose of Hand-Free Speech Recognition. A 4-microphones array is used to locate the talker by sound using the Cross-Power Spectrum method for the estimation of Time Delays between the 4 microphones signals. An histogram model of the skin color is applied to locate the talker on the video path. The two estimations of the talker positions are then combined to improve the precision of talker localization before beamforming an acoustic signal for speech recognition. The system is evaluated on 50-French logatomes in a computer room with 0.7 seconds reverberation time and has shown a decrease of error of 30% compared to the audio only system. A description of the different methods and algorithms is given. Key Words: Localization, Hand-Free Speech Recognition, Audio-Visual.