ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

SOMSRED: Sequential Output Modeling for Joint Multi-talker Overlapped Speech Recognition and Speaker Diarization

Naoki Makishima, Naotaka Kawata, Mana Ihori, Tomohiro Tanaka, Shota Orihashi, Atsushi Ando, Ryo Masumura

This paper proposes SOMSRED, which jointly models the multi-talker automatic speech recognition (ASR) and speaker diarization (SD) for fully overlapped speech of unknown speakers. The conventional method that jointly estimates ASR and SD requires non-overlapping speech and a separate clustering-based SD component for accurately identifying speakers. However, the speech is often overlapped, which deteriorates speaker identification performance, and the separate model makes the whole system sub-optimal. To address this problem, our idea is to build a sequential output model that outputs transcriptions, timestamps, and newly introduced speaker identifiers recursively from overlapped speech. Since speaker identifier do not fully represent the speaker characteristics of unknown speakers, SOMSRED utilizes the intermediate feature as speaker embeddings. Experimental results show the efficacy of the proposed method in speaker recognition, SD, and multi-talker ASR.