ISCA Archive Interspeech 2020
ISCA Archive Interspeech 2020

Monolingual Data Selection Analysis for English-Mandarin Hybrid Code-Switching Speech Recognition

Haobo Zhang, Haihua Xu, Van Tung Pham, Hao Huang, Eng Siong Chng

In this paper, we conduct data selection analysis in building an English-Mandarin code-switching (CS) speech recognition (CSSR) system, which is aimed for a real CSSR contest in China. The overall training sets have three subsets, i.e., a code-switching data set, an English (LibriSpeech) and a Mandarin data set respectively. The code-switching data are Mandarin dominated. First of all, it is found using the overall data yields worse results, and hence data selection study is necessary. Then to exploit monolingual data, we find data matching is crucial. Mandarin data is closely matched with the Mandarin part in the code-switching data, while English data is not. However, Mandarin data only helps on those utterances that are significantly Mandarin-dominated. Besides, there is a balance point, over which more monolingual data will divert the CSSR system, degrading results. Finally, we analyze the effectiveness of combining monolingual data to train a CSSR system with the HMM-DNN hybrid framework. The CSSR system can perform within-utterance code-switch recognition, but it still has a margin with the one trained on code-switching data.