ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

FoleyMaster: High-Quality Video-to-Audio Synthesis via MLLM-Augmented Prompt Tuning and Joint Semantic-Temporal Adaptation

Liming Liang, Luo Chen, Yuehan Jin, Xianwei Zhuang, Yuxin Xie, Yongkang Yin, Yuexian Zou

We study video-to-audio (V2A) generation, a critical task for automatically creating high-quality sound effects synchronized with silent video. Current V2A methods face three limitations: (1) inadequate textual annotations in existing datasets, (2) over-reliance on global video features, and (3) coarse temporal synchronization. To address these, We propose FoleyMaster with three key innovations: 1) We introduce VGGSound Plus dataset with 197,955 videos annotated by Qwen2-VL-7B for fine-grained event descriptions; 2) We develop a cross-attention semantic adapter integrating token-level text embeddings with global video features via prompt learning, enabling precise alignment between visual events and sound; 3) We develop a probabilistic temporal adapter that adjusts audio generation based on action prominence replacing binary synchronization. Extensive experiments demonstrate that FoleyMaster achieves state-of-the-art V2A performance across all metrics. Demo and dataset are available.