ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Efficient Speaker Embedding Extraction Using a Twofold Sliding Window Algorithm for Speaker Diarization

Jeong-Hwan Choi, Ye-Rin Jeoung, Ilseok Kim, Joon-Hyuk Chang

This paper proposes an efficient speaker embedding (SE) extraction method that employs a twofold sliding window algorithm (SWA) for speaker diarization (SD) systems. Non-overlapping short segments are obtained through the first SWA and fed into the frame-level neural networks of a pre-trained SE model to extract frame-level representations. The neighboring frame-level representations are concatenated along the time axis through the second SWA, which enables an overlap between representations. The concatenated representations are used to extract multiple SEs. Additionally, we propose a fine-tuning strategy that employs a residual adapter and knowledge distillation techniques on a pre-trained SE model to refine the frame-level representation. Experimental results using two SD benchmarks show the effectiveness of the proposed extraction method with a fine-tuned SE model in terms of floating-point operations while maintaining the diarization error rate.