ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Speech-guided Grapheme-to-Phoneme Conversion for Cantonese Text-to-Speech

Timothy Shin Heng Mak, King Yiu Suen, Albert Y.S. Lam

Grapheme-to-Phoneme (G2P) conversion is a crucial component in neural Text-to-Speech. Cantonese G2P is especially difficult and suffers from two pain points. First, a large number of commonly used characters have multiple correct pronunciations (jyutpings) that cannot be distinguished based on textual context. Secondly, there is a lack of accurate jyutping-labelled data that can be used for the training of character-to-jyutping (C2J) models. In this study, we propose a speech-guided C2J method based on augmenting an off-the-shelf Automatic Speech Recognition model with a speech-guided C2J module. To overcome the data scarcity problem, we trained the model on speech generated from a commercial Text-to-Speech model. We show that this simple approach achieved a jyutping error rate (JER) of 2% in unseen, clean, speech, improving the best text-based C2J method by 2.6% absolutely. For common polyphonic characters, the improvement was even greater, reducing the JER from 10.6% to 2.0%.