ISCA Archive Interspeech 2020
ISCA Archive Interspeech 2020

Speech-to-Singing Conversion Based on Boundary Equilibrium GAN

Da-Yi Wu, Yi-Hsuan Yang

This paper investigates the use of generative adversarial network (GAN)-based models for converting a speech signal into a singing one, without reference to the phoneme sequence underlying the speech. This is achieved by viewing speech-to-singing conversion as a style transfer problem. Specifically, given a speech input, and the F0 contour of the target singing output, the proposed model generates the spectrogram of a singing signal with a progressive-growing encoder/decoder architecture. Moreover, the model uses a boundary equilibrium GAN loss term such that it can learn from both paired and unpaired data. The spectrogram is finally converted into wave with a separate GAN-based vocoder. Our quantitative and qualitative analysis show that the proposed model generates singing voices with much higher naturalness than an existing non adversarially-trained baseline.