ISCA Archive Interspeech 2009
ISCA Archive Interspeech 2009

An audio-visual attention system for online association learning

Martin Heckmann, Holger Brandl, Xavier Domont, Bram Bolder, Frank Joublin, Christian Goerick

We present an audio-visual attention system for speech based interaction with a humanoid robot where a tutor can teach visual properties/locations (e.g “left”) and corresponding, arbitrary speech labels. The acoustic signal is segmented via the attention system and speech labels are learned from a few repetitions of the label by the tutor. The attention system integrates bottom-up stimulus driven saliency calculation (delay-and-sum beamforming, adaptive noise level estimation) and top-down modulation (spectral properties, segment length, movement and interaction status of the robot). We evaluate the performance of different aspects of the system based on a small dataset.