ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Embedding Articulatory Constraints for Low-resource Speech Recognition Based on Large Pre-trained Model

Jaeyoung Lee, Masato Mimura, Tatsuya Kawahara

Knowledge about phonemes and their articulatory attributes can help improve automatic speech recognition (ASR) of low-resource languages. In this study, we propose a simple and effective approach to embed prior knowledge about phonemes into end-to-end ASR based on a large pre-trained model. An articulatory attribute prediction layer is constructed by embedding articulatory constraints in layer initialization, which allows for predicting articulatory attributes without the need for explicit training. The final ASR transcript is inferred by combining the output of this layer with encoded speech features. We apply our method to finetune a pre-trained XLS-R model using Ainu and Mboshi corpora, and achieve a 12% relative improvement when target data of only 1 hour is available. This demonstrates that the approach of incorporating phonetic prior knowledge is useful when combined with a large pre-trained model.