ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Prompting Large Language Models with Mispronunciation Detection and Diagnosis Abilities

Minglin Wu, Jing Xu, Xixin Wu, Helen Meng

Large Language Models (LLMs) have demonstrated significant achievements across diverse modalities. In this paper, we propose ATP-LLM, a framework that utilizes Audio and Text to Prompt LLMs to perform mispronunciation detection and diagnosis (MDD) tasks in second language (L2) English. ATP-LLM consists of an audio encoder and an LLM decoder. The audio encoder converts L2 English speech into speech representations digestible for LLMs. These speech representations, along with the corresponding canonical pronunciation, serve as audio and text prompts that enable the LLM decoder to generate the phones articulated by L2 English learners. Experiments show that our proposed ATP-LLM achieves a new state-of-the-art (SOTA) performance on the CU-CHLOE corpus with a Phone Error Rate (PER) of 8.56% and an F1 of 82.02%, outperforming the existing wav2vec2-CTC method whose PER and F1 are 8.98% and 80.93%, respectively.