ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Self-training ASR Guided by Unsupervised ASR Teacher

Hyung Yong Kim, Byeong-Yeol Kim, Yunkyu Lim, Jihwan Park, Shukjae Choi, Yooncheol Ju, Jinseok Park, Youshin Lim, Seung Woo Yu, Hanbin Lee, Shinji Watanabe

Self-training has gained increasing attention due to its notable performance improvement in speech recognition. However, conventional self-training techniques have two key limitations: (1) labeled dataset is required for training a teacher to produce a pseudo-target, and (2) the first teacher trained with the small labeled dataset faces an over-fitting issue, generating noisy pseudo-targets for unseen datasets. Our approach adopts an unsupervised automatic speech recognition model as the teacher, thus solely utilizing the unlabeled dataset. As the proposed model also learns phonetic information from the UASR teacher at the intermediate layer, the pseudo-target at the higher layer contains more ASR-related information than that of Data2vec2. Experimental results on the LibriSpeech show that our model outperforms Data2vec2, the state-of-the-art self-supervised learning model, achieving 8.9% and 4.3% relative word error rate reduction on test-clean and test-other.