ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Speech Emotion Recognition in the Wild using Multi-task and Adversarial Learning

Jack Parry, Eric DeMattos, Anita Klementiev, Axel Ind, Daniela Morse-Kopp, Georgia Clarke, Dimitri Palaz

Speech Emotion Recognition (SER) is an important and challenging task, especially when deploying systems in the wild i.e. on unseen data, as they tend to generalise poorly. One promising approach to improve the generalisation capabilities of SER systems is to incorporate attributes of the speech signal, such as corpus or speaker information, which can be a source of overfitting or confusion for the model. In this paper, we investigate using multi-task learning, where attribute prediction is given as an auxiliary task to the model, and adversarial learning, where the model is explicitly trained to incorrectly predict attributes. We compare two adversarial learning approaches: gradient reversal and an adversarial discriminator. We evaluate these approaches in a cross-corpus training setting using two unseen corpora as test sets. We use four attributes -- corpus, speaker, gender and language -- and evaluate all possible combinations of these attributes. We show that both multi-task learning and adversarial learning improve SER performance in the wild, with the gradient reversal approach being the most consistent across attributes and test sets.