ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Using Large Language Model for End-to-End Chinese ASR and NER

Yuang Li, Jiawei Yu, Min Zhang, Mengxin Ren, Yanqing Zhao, Xiaofeng Zhao, Shimin Tao, Jinsong Su, Hao Yang

Mapping speech tokens to the same feature space as text tokens has become the paradigm for integrating speech modality into decoder-only large language models (LLMs). An alternative is to use an encoder-decoder architecture that incorporates speech features through cross-attention. In this work, we connect the Whisper encoder with ChatGLM3 and provide in-depth comparisons of these two approaches using Chinese automatic speech recognition (ASR) and named entity recognition (NER) tasks. We evaluate their performance using the F1 score and a fine-grained taxonomy of ASR-NER errors. Our experiments reveal that the encoder-decoder model outperforms the decoder-only model if the context is short, while the decoder-only model benefits from a long context as it fully exploits all layers of the LLM. Additionally, we obtain a state-of-the-art F1 score of 0.805 on the AISHELL-NER test set by using chain-of-thought NER which first infers long-form ASR transcriptions and then predicts NER labels.