ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Audio-Visual Domain Adaptation Feature Fusion for Speech Emotion Recognition

Jie Wei, Guanyu Hu, Xinyu Yang, Anh Tuan Luu, Yizhuo Dong

Speech emotion recognition has made significant progress in recent years, in which feature representation learning has been paid more attention, but discriminative emotional features extraction has remained unresolved. In this paper, we propose MDSCM - a Multi-attention based Depthwise Separable Convolutional Model for speech emotional feature extraction that can reduce the feature redundancy through separating spatial-wise convolution and channel-wise convolution. MDSCM also enhances the feature discriminability by the multi-attention module that focuses on learning features with more emotional information. In addition, we propose an Audio-Visual Domain Adaptation Learning paradigm (AVDAL) to learn an audio-visual emotion-identity space. A shared audio-visual representation encoder is built to transfer the emotional knowledge learned from the visual domain to complement and enhance the emotional features that only extracted from speech. Domain classifier and emotion classifier are used for encoder training to reduce the mismatching of domain features, and enhance the discriminability of features for emotion recognition. The experimental results on the IEMOCAP dataset demonstrate that our proposed method outperforms other state-of-the-art speech emotion recognition systems, achieving 72.43% on weighted accuracy and 73.22% on unweighted accuracy. The code is available at https://github.com/Janie1996/AV4SER.