ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings

Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka

This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize ``who spoke what'' with low latency even when multiple people are speaking simultaneously. Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion. To further recognize speaker identities, we propose an encoder-decoder based speaker embedding extractor that can estimate a speaker representation for each recognized token not only from non-overlapping speech but also from overlapping speech. The proposed speaker embedding, named t-vector, is extracted synchronously with the t-SOT ASR model, enabling joint execution of speaker identification (SID) or speaker diarization (SD) with the multi-talker transcription with low latency. We evaluate the proposed model for a joint task of ASR and SID/SD by using LibriSpeechMix and LibriCSS corpora. The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.

doi: 10.21437/Interspeech.2022-253

Cite as: Kanda, N., Wu, J., Wu, Y., Xiao, X., Meng, Z., Wang, X., Gaur, Y., Chen, Z., Li, J., Yoshioka, T. (2022) Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings. Proc. Interspeech 2022, 521-525, doi: 10.21437/Interspeech.2022-253

  author={Naoyuki Kanda and Jian Wu and Yu Wu and Xiong Xiao and Zhong Meng and Xiaofei Wang and Yashesh Gaur and Zhuo Chen and Jinyu Li and Takuya Yoshioka},
  title={{Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings}},
  booktitle={Proc. Interspeech 2022},