ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

SOMSRED-SVC: Sequential Output Modeling with Speaker Vector Constraints for Joint Multi-Talker Overlapped ASR and Speaker Diarization

Naoki Makishima, Naotaka Kawata, Taiga Yamane, Mana Ihori, Tomohiro Tanaka, Satoshi Suzuki, Shota Orihashi, Ryo Masumura

We have developed a sequential output model with speaker vector constraints for the joint modeling of multi-talker automatic speech recognition (ASR) and speaker diarization. The conventional approach to joint modeling of multi-talker ASR and speaker diarization, called SOMSRED, enables the estimation of speaker embeddings from fully overlapped speech by discretizing the speaker embedding space and treating the speaker embeddings as tokens. However, the predicted speaker embeddings become less distinctive compared to the ones directly obtained from non-overlapping speech due to the discretization. To address this problem, we add a new training objective that optimizes speaker embeddings in continuous space without discretization. Experimental results show that the proposed method avoids overfitting to the discretized speaker tokens and outperforms SOMSRED in both ASR performance and speaker embedding performance.