ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Knowledge Distillation For CTC-based Speech Recognition Via Consistent Acoustic Representation Learning

Sanli Tian, Keqi Deng, Zehan Li, Lingxuan Ye, Gaofeng Cheng, Ta Li, Yonghong Yan

Recently, end-to-end ASR models based on connectionist temporal classification (CTC) have achieved impressive results, but their performance is limited in lightweight models. Knowledge distillation (KD) is a popular model compression method for improving the performance of lightweight models. However, CTC-models emit spiky posterior distribution making KL-divergence hard to converge, thus hindering the application of KD. To address this issue, we propose a new frame-level KD method that significantly improves the performance of lightweight CTC-based ASR models. First, we design a blank-frame-elimination mechanism that addresses the difficulty of applying KL-divergence on CTC posterior distribution. Second, we propose a consistent-acoustic-representation-learning (CARL) method to improve the representation ability of student model. Unlike matching the student model's feature to the teacher model's feature directly, CARL passes the teacher and student encoder's output features through the teacher's pre-trained classifier to produce similar outputs by blank-frame-elimination, making teacher and student represent acoustic features in a consistent way. Third, we introduce a two-stage process to further improve the accuracy of ASR, which performs feature-level KD via cosine-similarity in stage1 and softmax-level KD by CARL in stage2. Compared to the vanilla CTC-baseline model, our method relatively reduces CER by 16.1% and WER by 26.0% on Aishell-1 and Ted-lium2.