ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

G2PA: G2P with Aligned Audio for Mandarin Chinese

Xingxing Yang

During the training of a Mandarin Chinese Text-To-Speech (TTS) system, it is necessary to preprocess the training speech data, which mainly consists of audio and text pairs. One crucial preprocessing step involves converting Chinese graphemes to phonemes (G2P) to obtain phoneme representations when using phonemes as input. However, relying solely on the text for G2P conversion may lead to inaccurate results due to the pronunciation ambiguity of polyphones - characters with multiple pronunciations. Although previous research has attempted to address this issue, most approaches solely rely on text-based methods, disregarding the valuable audio information that can only be captured from the raw audio. To overcome this limitation, we propose a G2P pipeline that leverages both audio and text inputs to resolve pronunciation ambiguity. The code and model weights for our approach are publicly available at the following GitHub repository: https://github.com/iooops/G2PA.