ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Dual Audio Encoders Based Mandarin Prosodic Boundary Prediction by Using Multi-Granularity Prosodic Representations

Ruishan Li, Yingming Gao, Yanlu Xie, Dengfeng Ke, Jinsong Zhang

Prosodic boundary prediction plays an important role in speech synthesis, phonetic understanding, etc. In previous studies, supra-segmental features such as pitch, energy, and duration have been widely used to explicitly model Mandarin prosodic boundaries. In this paper, we propose to refine implicit prosodic representations with fine-grained information from complex acoustic features including mel-spectrogram and context vectors obtained from a pre-trained model. Pitch and energy are encoded as explicit prosodic representations. These two representations extracted by dual audio encoders are fused by the decoder mainly composed of cross-attention layers. Then the fused representations are used to predict Mandarin prosodic boundaries. The results indicate that our proposed method outperforms the baselines in the Mandarin prosodic boundary prediction task, particularly for the minor prosodic phrases (#2).