ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Augment Mandarin to Cantonese Speech Databases via Retrieval-Augmented Generation and Speech Synthesis

Fan Liu, Cheng Gong, Boyu Zhu, Ruihao Jing, Chunyu Qiang, Tianrui Wang, Xiao-Lei Zhang, Xuelong Li

Using large-scale training data has significantly driven recent advances in speech recognition models. However, the lack of corpus for some low-resource languages (e.g., Cantonese) is still a bottleneck for speech processing. With the continuous development of large language models (LLMs) and speech synthesis technologies, it has become possible to expand Cantonese corpora using automatically generated text and speech. In this paper, we propose using LLMs and text-to-speech (TTS) techniques to augment Mandarin data to the Mandarin-Cantonese parallel speech database (MCPSD). We propose a novel retrieval-augmented generation (RAG) method that incorporates a Cantonese knowledge base to enhance both the diversity and the accuracy of text generation. We implement TTS models to generate speech and design a data filter to ensure quality. Experimental results show that the generated parallel database is effective in fine-tuning models in Cantonese speech recognition and translation.