ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Exploring Linear Variant Transformers and k-NN Memory Inference for Long-Form ASR

Carlos Carvalho, Jinchuan Tian, William Chen, Yifan Peng, Alberto Abad, Shinji Watanabe

While transformer-based models excel in short-form (SF) automatic speech recognition, the quadratic complexity of the self-attention mechanism introduces significant challenges in long-form (LF) audio. To address this, we compare strong linear transformer variants—Fastformer, SummaryMixing, BiMamba, and E-Branchformer with Flash Attention (EBranch-FA). For the latter, we also explore rotary positional encodings. Additionally, we propose a new challenging LF benchmark derived from the LibriHeavy corpus, featuring development and test sets with varying average durations to enable comprehensive evaluation across different temporal scales. Furthermore, we propose a memory system, KNN-MAN, for inference, which can be applied to any existing encoder-decoder models, without additional training. For example, with BiMamba, we reduce the word error rate from 18.8% to 17.5% in our LF Test Clean set derived from LibriSpeech.