To deal with the problem of data scarce in training language model
(LM) for code-switching (CS) speech recognition, we proposed an approach
to obtain augmentation texts from three different viewpoints. The first
one is to enhance monolingual LM by selecting corresponding sentences
for existing conversational corpora; The second one is based on replacements
using syntactic constraint for a monolingual Chinese corpus, with the
helps of an aligned word list obtained from a pseudo-parallel corpus,
and part-of-speech (POS) of words; The third one is to use text generation
based on a pointer-generator network with copy mechanism, using a real
CS text data for training. All sentences from these approaches show
improvement for CS LMs, and they are finally fused into an LM for CS
ASR tasks.
Evaluations on LMs built by the above augmented data were conducted
on two Mandarin-English CS speech sets DTANG, and SEAME. The perplexities
were greatly reduced with all kinds of augmented texts, and speech
recognition performances were steadily improved. The mixed word error
rate (MER) of DTANG and SEAME evaluation dataset got relative reduction
by 9.10% and 29.73%, respectively.