Perception of auditory events is inherently multimodal relying on both audio and visual cues. The majority of existing multimodal approaches usually process each modality using modality-specific models and then fuse the embeddings to encode the joint information. Different from that, we employ a heterogeneous graph that explicitly captures the spatial and temporal relationships between the modalities captures detailed information and rich semantics. We propose a heterogeneous graph approach to address the task of visually-aware acoustic event detection which serves as a compact, efficient and scalable way to represent data is in the form of graphs. Through the heterogeneous graphs, we efficiently model the intra- and inter-modality relationship both at spatial and temporal domains. Our model can easily be adapted to different scales of events through relevant hyperparameters. Experiments on a large benchmark dataset, called AudioSet, shows that our model achieves state-of-the-art performance.