This paper presents the submission of our team to the audio-visual speaker diarization (AVSD) track of the Multimodal Information Based Speech Processing (MISP) 2025 Challenge. The submitted system is adapted from the DiariZen pipeline, with a primary focus on optimizing it for the challenge dataset. The pipeline consists of a WavLM based local end-to-end neural diarization module followed by two different clustering methods. To further refine the results, DOVER-Lap is employed to integrate results across different input channels and clustering methods. Our final submission system achieves a diarization error rate (DER) of 8.33% on the evaluation set, representing a relative improvement of 46.3% compared to the baseline and ranking 3rd in the AVSD track of this challenge.