ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Exploiting Fine-tuning of Self-supervised Learning Models for Improving Bi-modal Sentiment Analysis and Emotion Recognition

Wei Yang, Satoru Fukayama, Panikos Heracleous, Jun Ogata

Speech-based multimodal affective computing has recently attracted significant research attention. Previous experimental results have shown that the audio-only approach exhibits inferior performance than the text-only approach in sentiment analysis and emotion recognition tasks. In this paper, we propose a new strategy to improve the performance of uni-modal and bi-modal affective computing systems via fine-tuning of two pre-trained self-supervised learning models (Text-RoBERTa and Speech-RoBERTa). We fine-tune the models on sentiment analysis and emotion recognition tasks using a shallow architecture, and apply crossmodal attention fusion to the models for further learning and final prediction or classification. We evaluate our proposed method on the CMU-MOSI, CMU-MOSEI and IEMOCAP datasets. The experimental results demonstrate that our approach exhibits superior performance for all benchmarks compared to existing state-of-the-art results, establishing the effectiveness of the proposed method.