ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Joint Target-Speaker ASR and Activity Detection

Chikara Maeda, Muhammad Shakeel, Yui Sudo

Target-speaker automatic speech recognition (TS-ASR) has shown promise in transcribing speech in multi-speaker scenarios by focusing on a specific speaker. However, existing approaches employ a cascaded design, where voice activity detection (VAD) and TS-ASR are optimized separately. This separation leads to unstable training and error accumulation, limiting overall performance. We address these issues by proposing TS-ASR-AD, a joint end-to-end model that integrates VAD with TS-ASR, enabling stable training and reducing error accumulation. Moreover, improved training stability leads to better connectionist temporal classification (CTC) alignment in token probabilities, further enhancing transcription accuracy. Our approach outperforms previous studies and achieves a word error rate (WER) of 6.61% and 14.81%, and a diarization error rate (DER) of 1.23% and 2.66% on the Libri2Mix and Libri3Mix datasets, respectively.