ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Joint Autoregressive Modeling of End-to-End Multi-Talker Overlapped Speech Recognition and Utterance-level Timestamp Prediction

Naoki Makishima, Keita Suzuki, Satoshi Suzuki, Atsushi Ando, Ryo Masumura

This paper proposes autoregressive modeling of the joint multi-talker automatic speech recognition (ASR) and timestamp prediction. Autoregressive modeling of multi-talker ASR is a simple and promising approach. However, it does not predict utterance timestamp information despite its being important in practice. To address this problem, our key idea is to extend autoregressive-modeling-based multi-talker ASR to predict quantized timestamp tokens representing the start and end time of an utterance. Our method estimates transcription and utterance-level timestamp tokens of multiple speakers one after another. This enables joint modeling of multi-talker ASR and timestamps prediction without changing the simple autoregressive modeling of the conventional multi-talker ASR. Experimental results show that our method outperforms the ASR performance of conventional autoregressive multi-talker ASR without timestamp prediction and achieves promising timestamp prediction accuracy.