ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Self-Distillation into Self-Attention Heads for Improving Transformer-based End-to-End Neural Speaker Diarization

Ye-Rin Jeoung, Jeong-Hwan Choi, Ju-Seok Seong, Jehyun Kyung, Joon-Hyuk Chang

In this study, we explore self-distillation (SD) techniques to improve the performance of the transformer-encoder-based self-attentive (SA) end-to-end neural speaker diarization (EEND). We first apply the SD approaches, introduced in the automatic speech recognition field, to the SA-EEND model to confirm their potential for speaker diarization. Then, we propose two novel SD methods for the SA-EEND, which distill the prediction output of the model or the SA heads of the upper blocks into the SA heads of the lower blocks. Consequently, we expect the high-level speaker-discriminative knowledge learned by the upper blocks to be shared across the lower blocks, thereby enabling the SA heads of the lower blocks to effectively capture the discriminative patterns of overlapped speech of multiple speakers. Experimental results on the simulated and CALLHOME datasets show that the SD generally improves the baseline performance, and the proposed methods outperform the conventional SD approaches.