ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

TVC-MusicGen: Time-Varying Structure Control for Background Music Generation via Self-Supervised Training

Chenyu Yang, Hangting Chen, Shuai Wang, Haina Zhu, Haizhou Li

Current text-to-music generation models typically are not involved in generating music with specific structures, thus not meeting some customized needs in practical applications. To address this limitation, we propose a self-supervised Time-Varying Control method (TVC-MusicGen). By providing the temporal boundary and text descriptions for each segment, it can effectively generate music adhering to the corresponding structures. TVC-MusicGen supports generation from both text (text-to-music) and existing music clips (music-to-music), enabling structure editing or local style transfer. Additionally, we propose a generation-based approach to bridge the gap between text and audio modalities in cross-modal models, which are typically used as feature extractors in text-to-music systems. Experiments on both language and diffusion-based models demonstrate that our approach achieves effective control without compromising overall quality.