ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Cascaded encoders for fine-tuning ASR models on overlapped speech

Richard Rose, Oscar Chang, Olivier Siohan

Multi-talker automatic speech recognition (MT-ASR) has been shown to improve ASR performance on speech containing overlapping utterances from more than one speaker. While MT-ASR models have typically been trained from scratch using simulated overlapping speech datasets, there is generally an underlying goal that these models also obtain state of the art performance on single speaker utterances as well. This implies that they must be competitive with the best available fine-tuned speech models that have been trained using massive datasets collected from a wide variety of task domains. This paper presents an MT-ASR model formed by combining a well-trained foundation model with a multi-talker mask model in a cascaded RNN-T encoder configuration. Experimental results show that the cascade configuration provides improved WER on overlapping speech utterances with respect to a baseline multi-talker model without sacrificing the performance achievable by the foundation model on non-overlapping utterances.