ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Unsupervised Adaptation with Quality-Aware Masking to Improve Target-Speaker Voice Activity Detection for Speaker Diarization

Shutong Niu, Jun Du, Maokui He, Chin-Hui Lee, Baoxiang Li, Jiakui Li

We propose an unsupervised adaptation approach to improve target-speaker voice activity detection (TS-VAD) in speaker diarization (SD) based on quality-aware masking (QM) in order to reduce potential errors in the generated pseudo-labels. Furthermore, the QM-TS-VAD adapted model can be used as a teacher model to fine-tune a student SD model through knowledge distillation (KD) to further mitigate the over-fitting issue. Evaluated on the eight different domains in the DIHARD-III evaluation corpus, our experimental results show that the proposed QM-TS-VAD approach effectively enhances SD performances, and the introduced KD method can further reduce errors in seven of the eight domains. Finally, the proposed framework outperforms the unsupervised adaptation approach in the top-ranked system submitted to the DIHARD-III Challenge.