ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

LingWav2Vec2: Linguistic-augmented wav2vec 2.0 for Vietnamese Mispronunciation Detection

Tuan Nguyen, Huy Dat Tran

Pronunciation error detection algorithms rely on both acoustic and linguistic information to identify errors. However, these algorithms face challenges due to limited training data, often just a few hours, insufficient for building robust phoneme recognition models. This has led to the adoption of self-supervised learning models like wav2vec 2.0. We propose an innovative approach that combine canonical text and audio inputs to balance the trade-off between accurate phoneme recognition performance and pronunciation scoring. This is done by feeding audio-encoded and normalized canonical phoneme embedding into a linguistic encoder including multi-head attention (MHA) layer and specifically designed feed forward module (FFN). Our system, with only 4.3 million parameters on top of pretrained wav2vec 2.0, achieved top-1 performance at the VLSP Vietnamese Mispronunciation Detection 2023 challenge with 9.72% relative improvement of F1 score over the previous state-of-the-art.