ISCA Archive SLaTE 2023
ISCA Archive SLaTE 2023

Multi-task wav2vec2 Serving as a Pronunciation Training System for Children

Yaroslav Getman, Ragheb Al-Ghezi, Tamas Grosz, Mikko Kurimo

Computer-assisted learning tools (CAPT) are increasingly reliant on AI tools. Recent studies demonstrated how neural systems pre-trained in a self-supervised fashion, such as wav2vec2, can overcome the data scarcity problem of most CAPT systems, especially if the target users are young children. In most current works, however, the focus lies on fine-tuning these models on a single task, which often leads to catastrophic forgetting and severely limits the capabilities of the fine-tuned model. In this work, we propose the usage of multi-task learning and demonstrate how a single wav2vec2 model can simultaneously generate transcript and assess pronunciation of Swedish children with speech sound disorder and child second language learners of Finnish. We also investigate which layer is the most informative for the rating task. Our multi-task solutions provide higher pronunciation classification performance and competitive ASR accuracy in comparison to the corresponding single-task systems.