ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Large Dataset Generation of Synchronized Music Audio and Lyrics at Scale using Teacher-Student Paradigm

Cristian Chivriga, Rinita Roy

Large models (e.g., GPT-3, CLIP, DALL-E) show remarkable few-shot and zero-shot capabilities when trained on hundreds of millions of samples. Despite this trend, no publicly available synchronized music audio and lyrics dataset of sufficient scale exists, nor does a reliable evaluation benchmark to assess a model's performance. To address this issue, we build and release MusicLyric, a large public dataset with over 320k audio sequences and lyrics pairs for a total duration of 1,200 hours based on a collection of over 32,000 songs. The generation process is based on the teacher-student paradigm where the student seeks to outclass the teacher with more data available using the newly generated pseudo-alignments. The method is efficient and straightforward with at least 3 iterations needed to create high-quality data that can be scaled to a hundred thousand samples. We make our dataset, toolkit, and pre-trained models open-source.