ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Speaker Diarization for ASR Output with T-vectors: A Sequence Classification Approach

Midia Yousefi, Naoyuki Kanda, Dongmei Wang, Zhuo Chen, Xiaofei Wang, Takuya Yoshioka

This paper considers applying speaker diarization (SD) to the output tokens of automatic speech recognition (ASR). We formulate the task to be solved as a sequence classification problem, where we estimate the correct speaker label for each ASR output token based on a sequence of token-level speaker embeddings and candidate speaker profiles. To leverage the information from the ASR model, we utilize a recently proposed t-vector for the speaker embedding estimation. Whereas previous studies performed t-vector classification using cosine similarities with ad hoc post-processing, we propose to use a sequence classification model to leverage the sequential nature of the task more effectively. To handle a variable number of speakers, we use a classification model inspired by a target speaker voice activity detection based on transformers. We conduct experiments using the AMI meeting corpus in both speaker identification and diarization settings and show the effectiveness of our approach.