ISCA Archive Interspeech 2020
ISCA Archive Interspeech 2020

On Improving Code Mixed Speech Synthesis with Mixlingual Grapheme-to-Phoneme Model

Shubham Bansal, Arijit Mukherjee, Sandeepkumar Satpal, Rupeshkumar Mehta

Regional entities often occur in a code-mixed text in the non-native roman script and synthesizing them with the correct pronunciation and accent is a challenging problem. English grapheme-to-phoneme (G2P) rules fail for such entities because of the orthographical mistakes and phonological differences between the English and regional languages. The traditional approach for this problem involves language identification, followed by the transliteration of the regional entities to their native language and then passing them through a native G2P. In this work, we simplify this module based architecture by learning an end-to-end mixlingual G2P in a multi-task type setting. Also, rather than mapping the output phone sequences from our mixlingual G2P to the English phoneset or using the “shared” phoneset, we use the polyglot data and “separated” phoneset to train a mixlingual synthesizer to improvise the synthesized voice accent for regional entities. We have used Hindi-English as the code-mix scenario and we show absolute incremental gains of up to 28% in pronunciation accuracy and a 0.9 gain in “overall impression” mean-opinion-score (MOS) over using a standard English monolingual text-to-speech (TTS).