ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Video-to-Audio Generation with Fine-grained Temporal Semantics

Yuchen Hu, Yu Gu, Chenxing Li, Rilin Chen, Dong Yu

With recent advances of AIGC, video generation gains a surge of research interest in both academia and industry (e.g., Sora). However, it remains a challenge to produce temporally aligned audio to match the generated video, considering its complex semantic information. In this work, inspired by the recent success of text-to-audio generation, we investigate the video-to-audio (VTA) generation based on latent diffusion model (LDM). Similar to pioneering explorations, our preliminary results show great potentials of LDM in VTA task, but challenges still exist in temporal consistence. To this end, we propose to enhance the temporal alignment of VTA with frame-level semantic information. With the popular grounding segment anything model (Grounding SAM), we extract the fine-grained semantics in video frames to enable VTA to produce better-aligned audio signal. Experiments show its effectiveness in both objective and subjective evaluation, improving both audio quality and temporal alignment.