ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Language and Speaker-Independent Feature Transformation for End-to-End Multilingual Speech Recognition

Tomoaki Hayakawa, Chee Siang Leow, Akio Kobayashi, Takehito Utsuro, Hiromitsu Nishizaki

This paper proposes a method to improve the performance of multilingual automatic speech recognition (ASR) systems through language- and speaker-independent feature transformation in a framework of end-to-end (E2E) ASR. Specifically, we propose a multi-task training method that combines a language recognizer and a speaker recognizer with an E2E ASR system based on connectionist temporal classification (CTC) loss functions. We introduce the language and speaker recognition sub-tasks into the E2E ASR network and introduce a gradient reversal layer (GRL) for each sub-task to achieve language and speaker-independent feature transformation. The evaluation results of the proposed method in the multilingual ASR system in six sorts of languages show that the proposed method achieves higher accuracy than the ASR models for each language by introducing multi-tasking and GRL.