ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

X-Singer: Code-Mixed Singing Voice Synthesis via Cross-Lingual Learning

Ji-Sang Hwang, Hyeongrae Noh, Yoonseok Hong, Insoo Oh

Singing voice synthesis (SVS) systems have exhibited a remarkable ability to synthesize natural singing voices. However, existing methods still depend on the phoneme annotation in a musical score (MS) and are limited in their ability to generate a code-mixed singing voice. Therefore, we propose X-Singer, a code-mixed SVS system that uses cross-lingual language learning. First, we introduce a MS encoder to handle a realistic MS comprising code-mixed lyrics without phoneme annotation. The MS encoder adopts language code-switching to encode code-mixed lyrics, and mixture alignment to reduce dependency on the phoneme annotation. Furthermore, we use a conditional flow matching-based decoder to achieve high-quality SVS in a few sampling steps. We observe that X-Singer outperformed the baseline models in terms of naturalness for intra- and cross-lingual SVS. Moreover, the proposed model can synthesize code-mixed SVS through cross-lingual learning using a mixture of monolingual SVS datasets.