Recently, a novel task called audio-visual segmentation (AVS) has emerged, focusing on pixel-wise segmentation of sounding objects in videos. This task is particularly challenging as it involves segmenting individual pixels based on objects in video frames accompanied by sound. We propose a Motion Based Audio-Visual Segmentation model, which incorporates optical flow maps with motion information into the AVS task for the first time. The Motion-Vision Attention Module (MVA) is proposed to facilitate the fusion of motion and visual features to exploit motion information. Additionally, the Cross-Modal Bilateral-Attention Module (CMBA) is introduced to integrate multimodal features through crossmodal attention. The proposed model is evaluated on two distinct datasets, S4 and MS3, the outperformance of which demonstrates its effectiveness and feasibility in addressing the AVS task.
Please correct the Acknowledgements section from "... in part by Beijing Natural Science Foundation (No. L223032 and No. L223033)" to "... in part by Beijing Natural Science Foundation (No. L233032 and No. L223033)."