ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Leveraging Self-Supervised Learning Based Speaker Diarization for MISP 2025 AVSD Challenge

Zeyan Song, Tianchi Sun, Ronghui Hu, Kai Chen, Jing Lu

This paper presents the submission of our team to the audio-visual speaker diarization (AVSD) track of the Multimodal Information Based Speech Processing (MISP) 2025 Challenge. The submitted system is adapted from the DiariZen pipeline, with a primary focus on optimizing it for the challenge dataset. The pipeline consists of a WavLM based local end-to-end neural diarization module followed by two different clustering methods. To further refine the results, DOVER-Lap is employed to integrate results across different input channels and clustering methods. Our final submission system achieves a diarization error rate (DER) of 8.33% on the evaluation set, representing a relative improvement of 46.3% compared to the baseline and ranking 3rd in the AVSD track of this challenge.