ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Count Your Speakers! Multitask Learning for Multimodal Speaker Diarization

Prabhav Singh, Jesus Villalba, Najim Dehak

Recent advances in speaker diarization have explored diverse clustering methods, particularly in multimodal frameworks. However, a critical limitation lies in the clustering stage, where heuristic-based methods often fail to leverage the full potential of multimodal data. For example, threshold-based clustering frequently leads to over-clustering, causing incorrect speaker assignments and elevated DER. To address this, we propose CYS-MSD, a novel framework that fuses audio-visual modalities via a trainable cross-modal attention mechanism. The embeddings are fine-tuned with a multitask objective to jointly predict speaker counts and assign speaker labels, enabling data-driven clustering that adapts to varying speaker scenarios. Additionally, a modality-masking mechanism ensures robustness to missing inputs in real-world conditions. We evaluate CYS-MSD on the AVA-AVD corpus, reporting a 5% reduction in DER over the baseline and an average 2% reduction compared to various SOTA systems.