ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Integrating Speech Self-Supervised Learning Models and Large Language Models for ASR

Ling Dong, Zhengtao Yu, Wenjun Wang, Yuxin Huang, Shengxiang Gao, Guojiang Zhou

The integration of Large Language Models (LLMs) and speech Self-Supervised Learning (SSL) models has garnered increasing attention due to their potential to enhance tasks such as Automatic Speech Recognition (ASR) and Speech Translation (ST), thereby improving the model’s “listening and writing” capabilities without requiring a large amount of labeled data. However, effectively aligning speech representations with the LLMs remains a challenge. In this paper, we explore the potential of connecting a speech pretrained model with a decoder-only LLM for the ASR task under the encoder-decoder framework. We employ a word boundary-aware compression method along with the optimal transport algorithm to mitigate the modality gap between speech and text in both length and semantic. Experiments conducted on the LibriSpeech dataset demonstrate that our proposed method achieves satisfactory results compared to mainstream End-to-End ASR models.