ISCA Archive Interspeech 2011
ISCA Archive Interspeech 2011

Letter-to-phoneme conversion based on two-stage neural network focusing on letter and phoneme contexts

Kheang Seng, Yurie Iribe, Tsuneo Nitta

The improvement of Letter-To-Phoneme (L2P) conversion that can output the phoneme strings corresponding to Out-Of-Vocabulary (OOV) words, especially in English language, has become one of the most important issues in Text-To-Speech (TTS) research. In this paper, we propose a Two-Stage Neural Network (NN) based approach to solve the problem of conflicting output at a phonemic level. Both Letter and Phoneme Context-Dependent models are combined and implemented in the first-stage NN to convert several letters into several phonemes. Then, the second-stage NN can predict the final output phoneme by observing on a combination of several consecutive phoneme sequences that obtained from the first-stage NN. Therefore, our L2P conversion module takes a sequence of letters as input and outputs only one phoneme at each time. By focusing mainly on the result of word accuracy of OOV words, this new approach usually provides a higher performance.