ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Active Speaker Detection in Fisheye Meeting Scenes with Scene Spatial Spectrums

Xinghao Huang, Weiwei Jiang, Long Rao, Wei Xu, Wenqing Cheng

Active Speaker Detection (ASD) plays a crucial role in scene understanding tasks by determining whether an on-screen person in a given scene is speaking. In this work, to address the ASD in the context of multi-party roundtable meetings, we propose a novel approach that incorporates the fusion of spatial information of the scenes. To leverage the multiple data sources of the scenes, our method involves generating audio spatial spectrum heatmaps from the multi-channel audio and integrating them with the panoramic images. Additionally, we propose the novel FisheyeMeeting dataset, which combines fisheye panoramic video recordings with muti-channel audio captured from a six-channel circular microphone array. By enabling the multi-modal model to capture audio-visual cues in multi-party meeting scenes, our approach achieves an impressive 89.11% mAP on the FisheyeMeeting dataset. Notably, this outperforms the current SOTA methods by a significant 2.3% mAP improvement.