Self-supervised speech representation learning has been shown to be very effective for a wide range of speech processing downstream tasks. However, most of these models have been pretrained using clean pre-segmented single-speaker utterances, which is not representative of tasks involving realistic multi-speaker conversational speech like speaker diarization. WavLM pretraining mitigates this domain mismatch using artificial mixtures of single-speaker utterances, and outperforms other pretrained models such as wav2vec2 or HuBERT for speaker diarization. We propose to further specialize WavLM for speaker diarization in two ways: pretraining on real-world multi-speaker conversational speech, and crafting targets of pretraining pretext task to benefit the most to speaker diarization. When finetuned with recently proposed powerset multi-class cross entropy loss, we outperform, often by a large margin, the state-of-the-art on most speaker diarization benchmarks.