Efficient grapheme-to-phoneme (G2P) conversion models are considered
indispensable components to achieve the state-of-the-art performance
in modern automatic speech recognition (ASR) and text-to-speech (TTS)
systems. The role of these models is to provide such systems with a
means to generate accurate pronunciations for unseen words. Recent
work in this domain is based on recurrent neural networks (RNN) that
are capable of translating grapheme sequences into phoneme sequences
taking into account the full context of graphemes. To achieve high
performance with these models, utilizing explicit alignment information
is found essential. The quality of the G2P model heavily depends on
the imposed alignment constraints.
In this paper, a novel
approach is proposed using complex many-to-many G2P alignments to improve
the performance of G2P models based on deep bidirectional long short-term
memory (BLSTM) RNNs. Extensive experiments cover models with different
numbers of hidden layers, projection layer, input splicing windows,
and varying alignment schemes. One observes that complex alignments
significantly improve the performance on the publicly available CMUDict
US English dataset. We compare our results with previously published
results.