Speaker variability is a known challenge for emotion recognition, however little work has been done on speaker similarity in terms of its contribution to the performance in the emotion classification task. In this paper, we investigate this topic, and find a clear link between speaker proximity and the recognition accuracy. Motivated by this result, emotion based speaker clustering is proposed as a new strategy for speaker adaptation. It involves using speaker proximity to cluster individual speakers' emotion models in the training set on a per-emotion basis, and adapting the test speaker's emotion from the closest cluster. A series of tests were conducted to explore how system performance varies with clustering method, the number of clusters and the amount of adapting data. Results on the LDC Emotion Prosody and FAU Aibo Corpora show that this method outperforms speaker bootstrap, both in terms of relieving computation load and producing higher accuracy.
Index Terms: speaker clustering, emotion recognition, acoustic adaptation