ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Self-Supervised Learning for ASR Pre-Training with Uniquely Determined Target Labels and Controlling Cepstrum Truncation for Speech Augmentation

Akihiro Kato, Hiroyuki Nagano, Kohei Chike, Masaki Nose

To utilize a pre-trained large-scale model is an effective choice to develop automatic speech recognition (ASR) at limited data conditions. However, if we try pre-training with supervised manner, it causes high costs, specifically for transcription. To tackle this problem, recent research has presented self-supervised learning and it has successfully performed at ASR tasks. For further improvement, we study a new approach to self-supervised learning for ASR including methods for generating self-supervised labels and data augmentation. Experimental results on Libri-Light and LibriSpeech corpora without any external language models demonstrate that our proposed method outperforms non pre-trained Conformer at limited data conditions in terms of character error rate (CER). Furthermore, the proposed method also exhibits comparable performance to HuBERT, which is one of the state-of-the-art model for self-supervised representation learning.