ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Transformer-Based ASR Incorporating Time-Reduction Layer and Fine-Tuning with Self-Knowledge Distillation

Md. Akmal Haidar, Chao Xing, Mehdi Rezagholizadeh

Reducing the input sequence length of speech features to alleviate the complexity of alignment between speech features and text transcript by sub-sampling approaches is an important way to get better results in end-to-end (E2E) automatic speech recognition (ASR) systems. This issue is more important in Transformer-based ASR, because the self-attention mechanism in Transformers has O(n2) order of complexity in both training and inference. In this paper, we propose a Transformer-based ASR model with the time-reduction layer, in which we incorporate time-reduction layer inside transformer encoder layers in addition to traditional sub-sampling methods to input features that further reduce the frame-rate. This can help in reducing the computational cost of the self-attention process for training and inference with performance improvement. Moreover, we introduce a fine-tuning approach for pre-trained ASR models using self-knowledge distillation (S-KD) which further improves the performance of our ASR model. Experiments on LibriSpeech datasets show that our proposed methods outperform all other Transformer-based ASR systems. Furthermore, with language model (LM) fusion, we achieve new state-of-the-art word error rate (WER) results for Transformer-based ASR models with just 30 million parameters trained without any external data.