India is home to multiple languages, and training automatic speech
recognition (ASR) systems is challenging. Over time, each language
has adopted words from other languages, such as English, leading to
code-mixing. Most Indian languages also have their own unique scripts,
which poses a major limitation in training multilingual and code-switching
ASR systems.
Inspired by results in text-to-speech synthesis, in this paper,
we use an in-house rule-based phoneme-level common label set (CLS)
representation to train multilingual and code-switching ASR for Indian
languages. We propose two end-to-end (E2E) ASR systems. In the first
system, the E2E model is trained on the CLS representation, and we
use a novel data-driven backend to recover the native language script.
In the second system, we propose a modification to the E2E model, wherein
the CLS representation and the native language characters are used
simultaneously for training. We show our results on the multilingual
and code-switching (MUCS) ASR challenge 2021. Our best results achieve
≈6% and 5% improvement in word error rate over the baseline system
for the multilingual and code-switching tasks, respectively, on the
challenge development data.