In this paper, we consider both speaker dependent and listener dependent aspects in the assessment of emotions in speech. We model the speaker dependencies in emotional speech production by two parameters which describe the individuals emotional expression behavior. Similarly, we model the listeners emotion perception behavior by a simple parametric model. These models form a basis for improving current automatic emotion recognition schemes such as, for example, for manmachine interaction applications.
For this task, an emotional speech database of the four emotion categories angry, happy, neutral, and sad was evaluated by 18 human listeners. For each of the 680 sentences, the evaluators rated the values of three emotion primitives, valence, activation, and dominance, each on a 5-point scale. The assessment results were used to calculate the distributions (centroids, covariances) of the emotion classes in the space spanned by the three emotion primitives. The individual classes formed separable clusters in the emotion space. Based on these results, we analyzed the variations of the emotion clusters as a function of speaker and listener.
Across different speakers, we found that the main difference in the emotional speech was the position of the neutral cluster and the scaling of the emotions in the emotion primitives space. To capture this effect, we introduced the speaker-dependent parameters Emotion Expression Bias and Emotion Expression Amplification within this model representation and showed that the original class centroids could be reconstructed fairly accurately. From the perception viewpoint, we found that the listeners ratings of emotional speech could be described as a realization of a normally distributed random variable. Based on this result, we propose the correlation with the mean value of the ratings to be the listener-dependent parameter, which could be in turn incorporated within the model training for automatic recognition.