The idea of developing unsupervised learning methods has received significant attention in recent years. An important application is whether one can train a high quality speaker verification model given large quantities of unlabeled speech data. Unsupervised learning methods such as data clustering often play a central role since they are able to analyze the underlying latent patterns without any supervision information. In this paper, we focus on developing an effective clustering method for speech data. We propose the locality constrained transitive distance, a distance measure which better models speech data with arbitrarily shaped clusters. We also propose a robust top-down clustering framework on top of the distance measure to generate accurate cluster labels. Experimental results show the good performance of the proposed method.