ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Specializing Self-Supervised Speech Representations for Speaker Segmentation

Séverin Baroudi, Thomas Pellegrini, Hervé Bredin

Self-supervised speech representation learning has been shown to be very effective for a wide range of speech processing downstream tasks. However, most of these models have been pretrained using clean pre-segmented single-speaker utterances, which is not representative of tasks involving realistic multi-speaker conversational speech like speaker diarization. WavLM pretraining mitigates this domain mismatch using artificial mixtures of single-speaker utterances, and outperforms other pretrained models such as wav2vec2 or HuBERT for speaker diarization. We propose to further specialize WavLM for speaker diarization in two ways: pretraining on real-world multi-speaker conversational speech, and crafting targets of pretraining pretext task to benefit the most to speaker diarization. When finetuned with recently proposed powerset multi-class cross entropy loss, we outperform, often by a large margin, the state-of-the-art on most speaker diarization benchmarks.