ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

End-to-End Audio-Visual Neural Speaker Diarization

Mao-Kui He, Jun Du, Chin-Hui Lee

In this paper, we propose a novel end-to-end neural-network-based audio-visual speaker diarization method. Unlike most existing audio-visual methods, our audio-visual model takes audio features (e.g., FBANKs), multi-speaker lip regions of interest (ROIs), and multi-speaker i-vector embbedings as multimodal inputs. And a set of binary classification output layers produces activities of each speaker. With the finely designed end-to-end structure, the proposed method can explicitly handle the overlapping speech and distinguish between speech and non-speech accurately with multi-modal information. I-vectors are the key point to solve the alignment problem caused by visual modality error (e.g., occlusions, off-screen speakers or unreliable detection). Besides, our audio-visual model is robust to the absence of visual modality, where the diarization performance degrades significantly using the visual-only model. Evaluated on the datasets of the first multi-model information based speech processing (MISP) challenge, the proposed method achieved diarization error rates (DERs) of 10.1%/9.5% on development/eval set with reference voice activity detection (VAD) information, while audio-only and video-only system yielded DERs of 27.9%/29.0% and 14.6%/13.1% respectively.