The goal of this work is to determine ‘who spoke when’
in real-world meetings. The method takes surround-view video and single
or multi-channel audio as inputs, and generates robust diarisation
outputs.
To achieve this, we propose a novel iterative approach that first
enrolls speaker models using audio-visual correspondence, then uses
the enrolled models together with the visual information to determine
the active speaker.
We show strong quantitative
and qualitative performance on a dataset of real-world meetings. The
method is also evaluated on the public AMI meeting corpus, on which
we demonstrate results that exceed all comparable methods. We also
show that beamforming can be used together with the video to further
improve the performance when multi-channel audio is available.