This paper describes the BIT(Beijing Institute of Technology) system submitted to the Conversational Speaker Diarization Challenge. We firstly present the details of the front-end system, which comprises a Speech Activity Detection (SAD) module and a speaker embedding extraction module. Then based on the results of the clustering-based module, two iterative back-end models with multi-scale similarity measure are investigated: Support Vector Classifier (SVC) system and U-Net system. Finally, DOVER algorithm is adopted for model fusion. Experimental results show that our system yields a DER of 5.18% in the challenge, a relative improvement of 34% over the baseline system provided by the organizer. Our system won the first place among all submitted systems without needing to use any of additional embedding extracting model.