ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

WhiSLU: End-to-End Spoken Language Understanding with Whisper

Minghan Wang, Yinglu Li, Jiaxin Guo, Xiaosong Qiao, Zongyao Li, Hengchao Shang, Daimeng Wei, Shimin Tao, Min Zhang, Hao Yang

Spoken Language Understanding (SLU) systems commonly use cascading structures. However, these systems are prone to error propagation, information loss, high costs, and latency, leading researchers to explore end-to-end (E2E) SLU as a hot topic. However, E2E SLU faces the challenge of insufficient data, resulting in most previous work relying on pretrained acoustic models. Nevertheless, pre-training task and SLU task solution spaces are often substantially different, making it difficult for E2E SLU models to surpass cascading models. To address this, we propose using OpenAI's Whisper model for SLU tasks. We employ the Sequence-level Multitask Learning (SML) paradigm, which encodes multiple ASR-related tasks into a sequence for learning. Our method significantly outperforms the E2E baseline by a large margin (with a 10% improvement in EM score) and even outperforms cascading models, achieving a 77% EM score on the STOP dataset, demonstrating its effectiveness.