ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Streaming Target-Speaker ASR with Neural Transducer

Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takahiro Shinozaki

Although recent advances in deep learning technology have boosted automatic speech recognition (ASR) performance in the single-talker case, it remains difficult to recognize multi-talker speech in which many voices overlap. One conventional approach to tackle this problem is to use a cascade of a speech separation or target speech extraction front-end with an ASR back-end. However, the extra computation costs of the front-end module are critical for a quick response, especially for streaming ASR. In this paper, we propose a target-speaker ASR (TS-ASR) system, which integrates implicitly the target speech extraction functionality within a streaming end-to-end (E2E) ASR system, i.e. recurrent neural network-transducer (RNNT). Our system uses a similar idea as target speech extraction, but implements it directly at the level of the encoder of RNNT. This allows to realize TS-ASR without extra computation costs for the front-end. Note that our works present two major differences between prior studies about E2E-TS-ASR, we investigate streaming models and base our study on Conformer models, whereas prior studies used RNN-based systems and only dealt with offline processing. We confirm in experiments that our TS-ASR achieves comparable recognition performance with conventional cascade system in offline setting, while reducing computation costs and allowing streaming TS-ASR.