ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

A Joint Model for Pronunciation Assessment and Mispronunciation Detection and Diagnosis with Multi-task Learning

Hyungshin Ryu, Sunhee Kim, Minhwa Chung

Empirical studies report a strong correlation between pronunciation proficiency scores and phonetic errors in non-native speech assessments of human evaluators. However, the existing system of computer-assisted pronunciation training (CAPT) regards automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD) as independent and focuses on individual performance improvement. Motivated by the correlation between two tasks, we propose a novel architecture that jointly tackles APA and MDD using CTC and cross-entropy criteria with a multi-task learning scheme to benefit both tasks. To leverage additional knowledge transfer, Wav2Vec2-robust finetuned on TIMIT is used for the joint optimization. The integrated model significantly outperforms single-task learning, with a mean of 0.057 PCC increase for APA and 0.004 F1 increase for MDD on Speechocean762, which reveals that proficiency scores and phonetic errors are correlated for both human and model assessments.