ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

PunCantonese: A Benchmark Corpus for Low-Resource Cantonese Punctuation Restoration from Speech Transcripts

Yunxiang Li, Pengfei Liu, Xixin Wu, Helen Meng

Punctuation restoration from unsegmented speech transcripts is an essential task to improve the readability of transcripts and can facilitate various downstream NLP tasks. However, there is still lack of systematic studies on punctuation restoration for Cantonese as a low-resource language. This paper introduces a new Cantonese punctuation corpus named PunCantonese, which consists of annotated spoken transcripts and written-style Wikipedia sentences, covering the major punctuations such as “,.?!” and code-switched sentences in Cantonese and English. We also propose a Transformer-based punctuation model which exploits pre-trained multilingual language models, adopts multitask learning for style and punctuation prediction, and introduces a novel Jyutping embedding layer to inject the phonetic features not explicitly available in Cantonese characters. Experimental results show that these methods are effective in improving punctuation restoration, and the Jyutping embedding layer brings an absolute F1 increase by more than 2%.