Recent advances in speaker diarization have explored diverse clustering methods, particularly in multimodal frameworks. However, a critical limitation lies in the clustering stage, where heuristic-based methods often fail to leverage the full potential of multimodal data. For example, threshold-based clustering frequently leads to over-clustering, causing incorrect speaker assignments and elevated DER. To address this, we propose CYS-MSD, a novel framework that fuses audio-visual modalities via a trainable cross-modal attention mechanism. The embeddings are fine-tuned with a multitask objective to jointly predict speaker counts and assign speaker labels, enabling data-driven clustering that adapts to varying speaker scenarios. Additionally, a modality-masking mechanism ensures robustness to missing inputs in real-world conditions. We evaluate CYS-MSD on the AVA-AVD corpus, reporting a 5% reduction in DER over the baseline and an average 2% reduction compared to various SOTA systems.