Grapheme-to-phoneme (G2P) models are a key component in Automatic Speech
Recognition (ASR) systems, such as the ASR system in Alexa, as they
are used to generate pronunciations for out-of-vocabulary words that
do not exist in the pronunciation lexicons (mappings like “e
c h o” → “E k oU”).
Most G2P systems are
monolingual and based on traditional joint-sequence based n-gram models
[1, 2]. As an alternative, we present a single end-to-end trained neural
G2P model that shares same encoder and decoder across multiple languages.
This allows the model to utilize a combination of universal symbol
inventories of Latin-like alphabets and cross-linguistically shared
feature representations. Such model is especially useful in the scenarios
of low resource languages and code switching/ foreign words, where
the pronunciations in one language need to be adapted to other locales
or accents. We further experiment with word language distribution vector
as an additional training target in order to improve system performance
by helping the model decouple pronunciations across a variety of languages
in the parameter space. We show 7.2% average improvement in phoneme
error rate over low resource languages and no degradation over high
resource ones compared to monolingual baselines.