ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Enhancing Serialized Output Training for Multi-Talker ASR with Soft Monotonic Alignment and Utterance-level Timestamp

Fengyun Tan, Tao Wei, Kun Zou, Ning Cheng, Shaojun Wang, Jing Xiao

Multi-talker ASR has gained significant attention due to its broad applications in conference settings. The previously proposed SOT stands out among many approaches. However, when multiple speakers talk simultaneously, predicting the speaker change symbol becomes more challenging. Boundary-Aware SOT (BA-SOT) uses multi-task to learn speaker change point, increasing complexity and training cost, without timestamp prediction. Therefore, we propose enhanced SOT, called Soft Monotonic Alignment SOT (SMA-SOT), which introduces SMA Loss and utterance-level timestamps. These two components complement each other, not only utilizing timestamps to promote monotonic alignment constraint learning but also, in turn, making timestamp prediction more accurate through the SMA Loss. Experimental results on AliMeeting test set show that SMA-SOT achieves a 5.4% and 0.35% relative CER reduction compared to the SOT and BA-SOT respectively and achieves a Speaker Change Accuracy (SCA) of 81.3%.