ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Two-stage Finetuning of Wav2vec 2.0 for Speech Emotion Recognition with ASR and Gender Pretraining

Yuan Gao, Chenhui Chu, Tatsuya Kawahara

This paper addresses effective pretraining of automatic speech recognition (ASR) and gender recognition to improve wav2vec 2.0 embedding for speech emotion recognition (SER). Specifically, we propose a two-stage finetuning method, which first pretrains the self-supervised learning (SSL) model with ASR to learn the linguistic information and address the gradient conflict problem of conventional multi-task learning. Experimental results on the IEMOCAP dataset show that ASR pretraining can significantly outperform the simple MTL with ASR, and thus demonstrate the effectiveness of the two-stage finetuning method. We also investigate how to combine gender recognition with ASR pretraining to derive more effective embedding for SER. As the upper layers of the SSL model are focused on ASR, incorporating skip-connection can effectively embed the gender information. Compared with the single-task learning baseline, our method achieves a UA of 76.10% with an absolute improvement of 3.97%.