ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Visually-Adaptive Guided Robust Speech Recognition with Parameter-Efficient Adaptation

Zhao Yang, Rui Jiang, Yue Heng Yeo, Xiao Fu, Wei Xi, Jizhong Zhao

Recent developments in large-scale speech foundation models have further pushed the boundaries of automatic speech recognition (ASR) capabilities, making them excellent candidates for integration with multi-modality approaches. In this work, we propose AVWhisper-LoRA, an extension of the Whisper model that incorporates an auxiliary visual encoder to enable audio-visual speech recognition (AVSR) with lightweight trainable parameters. Our approach capitalizes on the existing attention mechanisms of the well-trained Whisper model, facilitating the integration of visual information through both self-attention and cross-attention interactions. Additionally, we introduce LoRA-based trainable lightweight adapters into the frozen Whisper model to enable effective adaptation to the multi-modality target domain during training. Experimental results on the LRS3-TED dataset demonstrate that our method consistently outperforms state-of-the-art methods, particularly in challenging speech environments.