Pronunciation error detection algorithms rely on both acoustic and linguistic information to identify errors. However, these algorithms face challenges due to limited training data, often just a few hours, insufficient for building robust phoneme recognition models. This has led to the adoption of self-supervised learning models like wav2vec 2.0. We propose an innovative approach that combine canonical text and audio inputs to balance the trade-off between accurate phoneme recognition performance and pronunciation scoring. This is done by feeding audio-encoded and normalized canonical phoneme embedding into a linguistic encoder including multi-head attention (MHA) layer and specifically designed feed forward module (FFN). Our system, with only 4.3 million parameters on top of pretrained wav2vec 2.0, achieved top-1 performance at the VLSP Vietnamese Mispronunciation Detection 2023 challenge with 9.72% relative improvement of F1 score over the previous state-of-the-art.