This paper investigates some non-F0 cues to emotional speech. Two speech samples were collected from spontaneous speech: the word "leave" - one sample spoken with emotion (sad) and the other, as not-emotional. Using the morphing algorithm of STRAIGHT [1], we morphed a series of 12 utterances, starting from the non-emotional "leave" to the emotional "leave", keeping F0 at 300 Hz. Perception test results show that the morphed speech sounds could be identified as sad, with stimulus 12 being heard as most emotional. The results of a simple correlation, together with a PCA analysis of listenersÂ’ perceptual behavior, suggest that formant frequencies, specifically, lowering F2, F3, and F4 are important cues for perception of emotional (sad) speech.
Kawahara, H.; Matsui, H., 2003. Auditory morphing based on an elastic perceptual distance metric in an interference-free time-frequency representation. Proc. IEEE ICASSP, 2003.