This paper describes a similar speaker selection technique based on distance metric learning. Our aim is selection of a perceptually similar speaker using acoustic features from a multi-speaker database. A novel point of the proposed technique is training a transform matrix using the perceptual voice quality similarity between many speakers obtained from a subjective evaluation to convert acoustic feature space. Given an input speech, acoustic features of the input speech are transformed using a trained transform matrix, after which speaker selection is performed based on the Euclidean distance on the transformed acoustic feature space. We perform speaker selection experiments and evaluate the performance results by comparing them with those of speaker selection on acoustic feature space without feature space transformation. The results indicate that transformation based on distance metric learning provides about 60% of the error reduction rate.
Index Terms: speaker selection, perceptual similarity, voice quality, distance metric learning