ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Switch Conformer with Universal Phonetic Experts for Multilingual ASR

Masato Mimura, Jaeyoung Lee, Tatsuya Kawahara

Multilingual end-to-end ASR presents significant challenges due to the need to accommodate diverse writing systems, lexicons, and grammatical structures. Existing methods often rely on large models with high computational costs for adequate cross-language performance. To address this, we propose the switch Conformer, which enhances model capacity while maintaining nearly the same inference cost as a standard Conformer. Our approach replaces the FFN module in each Conformer block with a sparse mixture of independent experts, activating only one expert per input to enable efficient language-specific feature learning. In addition, a shared expert trained with phonetic supervision captures language-universal speech characteristics. Experiments on streaming ASR using the CommonVoice dataset demonstrate that these experts work synergistically to achieve better performance than the baseline Conformer, with minimal additional active parameters.