ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Audio-Visual Scene Classification Based on Multi-modal Graph Fusion

Han Lei, Ning Chen

Audio-Visual Scene Classification (AVSC) task tries to achieve scene classification through joint analysis of the audio and video modalities. Most of the existing AVSC models are based on feature-level or decision-level fusion. The possible problems are: i) Due to the distribution difference of the corresponding features in different modalities is large, the direct concatenation of them in the feature-level fusion may not result in good performance. ii) The decision-level fusion cannot take full advantage of the common as well as complementary properties between the features and corresponding similarities of different modalities. To solve these problems, Graph Convolutional Network (GCN)-based multi-modal fusion algorithm is proposed for AVSC task. First, the Deep Neural Network (DNN) is trained to extract essential feature from each modality. Then, the Sample-to-Sample Cross Similarity Graph (SSCSG) is constructed based on each modality features. Finally, the DynaMic GCN (DM-GCN) and the ATtention GCN (AT-GCN) are introduced respectively to realize both feature-level and similarity-level fusion to ensure the classification accuracy. Experimental results on TAU Audio-Visual Urban Scenes 2021 development dataset demonstrate that the proposed scheme, called AVSC-MGCN achieves higher classification accuracy and lower computational complexity than state-of-the-art schemes.