ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Learning from Back Chunks: Acquiring More Future Knowledge for Streaming ASR Models via Self Distillation

Yuting Yang, Guodong Ma, Yuke Li, Binbin Du, Haoqi Zhu, Liang Ruan

The performance of streaming automatic speech recognition (ASR) is often inferior to that of non-streaming speech recognition due to the absence of complete contextual information. However, we cannot optimize the model by merely accessing more future frames, as this would lead to considerable latency. In this paper, we propose Future-aware Transformer (FaT) that models long-distance future contextual dependencies by transferring information from later chunks to former chunks through look-ahead windows. Specifically, the chunk-based context is used to encode audio sequence features. On this basis, the look-ahead window provides more context information for each chunk and acts as a bridge to progressively transfer long-distance future information from later chunks to earlier ones via a future-aware distillation mechanism. Experiments on AISHELL-1 and AISHELL-2 demonstrate that the proposed method achieves superior accuracy and better streaming latency than several strong baselines.