ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Wav2vec-S: Semi-Supervised Pre-Training for Low-Resource ASR

Han Zhu, Li Wang, Gaofeng Cheng, Jindong Wang, Pengyuan Zhang, Yonghong Yan

Self-supervised pre-training could effectively improve the performance of low-resource automatic speech recognition (ASR). However, existing self-supervised pre-training are taskagnostic, i.e., could be applied to various downstream tasks. Although it enlarges the scope of its application, the capacity of the pre-trained model is not fully utilized for the ASR task, and the learned representations may not be optimal for ASR. In this work, in order to build a better pre-trained model for low-resource ASR, we propose a pre-training approach called wav2vec-S, where we use task-specific semi-supervised pretraining to refine the self-supervised pre-trained model for the ASR task thus more effectively utilize the capacity of the pretrained model to generate task-specific representations for ASR. Experiments show that compared to wav2vec 2.0, wav2vecS only requires a marginal increment of pre-training time but could significantly improve ASR performance on in-domain, cross-domain and cross-lingual datasets. Average relative WER reductions are 24.5% and 6.6% for 1h and 10h fine-tuning, respectively. Furthermore, we show that semi-supervised pretraining could close the representation gap between the selfsupervised pre-trained model and the corresponding fine-tuned model through canonical correlation analysis.