ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Visually-aware Acoustic Event Detection using Heterogeneous Graphs

AMIR SHIRIAN, Krishna Somandepalli, Victor Sanchez, Tanaya Guha

Perception of auditory events is inherently multimodal relying on both audio and visual cues. The majority of existing multimodal approaches usually process each modality using modality-specific models and then fuse the embeddings to encode the joint information. Different from that, we employ a heterogeneous graph that explicitly captures the spatial and temporal relationships between the modalities captures detailed information and rich semantics. We propose a heterogeneous graph approach to address the task of visually-aware acoustic event detection which serves as a compact, efficient and scalable way to represent data is in the form of graphs. Through the heterogeneous graphs, we efficiently model the intra- and inter-modality relationship both at spatial and temporal domains. Our model can easily be adapted to different scales of events through relevant hyperparameters. Experiments on a large benchmark dataset, called AudioSet, shows that our model achieves state-of-the-art performance.