This paper proposes an efficient speaker embedding (SE) extraction method that employs a twofold sliding window algorithm (SWA) for speaker diarization (SD) systems. Non-overlapping short segments are obtained through the first SWA and fed into the frame-level neural networks of a pre-trained SE model to extract frame-level representations. The neighboring frame-level representations are concatenated along the time axis through the second SWA, which enables an overlap between representations. The concatenated representations are used to extract multiple SEs. Additionally, we propose a fine-tuning strategy that employs a residual adapter and knowledge distillation techniques on a pre-trained SE model to refine the frame-level representation. Experimental results using two SD benchmarks show the effectiveness of the proposed extraction method with a fine-tuned SE model in terms of floating-point operations while maintaining the diarization error rate.