ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Speech Emotion Recognition using Decomposed Speech via Multi-task Learning

Jia-Hao Hsu, Chung-Hsien Wu, Yu-Hung Wei

In speech emotion recognition, most recent studies used powerful models to obtain robust features without considering the disentangled components, which contain diverse emotion-rich information helpful for speech emotion recognition. In this study, an autoencoder is used as the speech decomposition model to obtain the disentangled components, including content, timbre, pitch, and rhythm features, which are regarded as emotion-rich features, for speech emotion recognition. The mechanism of multi-task training is then used to train the tasks of speech emotion recognition, speaker recognition, speech recognition, and spectral reconstruction at the same time, while exploiting commonalities and differences across tasks. The model proposed in this study achieved an accuracy of 77.50% on the four-classes emotion recognition task of IEMOCAP. Experiments showed that the proposed methods can effectively improve speech emotion recognition performance, outperforming the SOTA approach.