ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

A Hybrid Approach to Combining Role Diarization with ASR for Professional Conversations

Bongjun Kim, Arindam Ghosh, Mark C. Fuhs, Anurag Chowdhury, Deblin Bagchi, Monika Woszczyna

In professional settings, conversations often involve persons with defined roles (doctor, patient, lawyer, client, etc.), and the intelligibility of a conversational transcript may be improved by annotating conversational turns with the role of the speaker, e.g. "Doctor: How are you feeling? Patient: I sprained my ankle." We propose a novel hybrid architecture that combines an ASR model augmented to label the speaker's role at each speaker change point with a d-vector-based diarization system. This system outperforms modular and fully integrated baselines by 12% and 28%, respectively. We also show that, when an ASR transducer model is trained to predict role or speaker-change tokens as part of the transcript, these token timings can improve diarization more than the adjacent word token timings can, despite there being no explicit training signal conveying precise speaker change points.