ISCA Archive Interspeech 2005
ISCA Archive Interspeech 2005

Stochastic pronunciation modeling by ergodic-HMM of acoustic sub-word units

V. Ramasubramanian, P. Srinivas, T. V. Sreenivas

We propose a stochastic pronunciation model using an ergodic - hidden Markov model (EHMM) of automatically derived acoustic sub-word units (SWU). The proposed EHMM discovers the pronunciation structure inherent in the acoustic training data of a word without any apriori phonetic transcriptions. The EHMM is an HMM of HMMs - its states are SWU HMMs and the state-transitions compose various possible lexicon. The EHMM parameters are estimated by an iterative segmental K-means procedure which jointly optimizes the subword units (states) and the pronunciation structure parameters (state-transitions). The EHMM based pronunciation model is evaluated in an English isolated word recognition task with 70 speakers drawn from 8 highly different first languages. Results show that EHMM learns the lexicon distribution over the population of speakers for each word, thereby effectively modeling the inter-speaker pronunciation variability. EHMM offers an improvement of 8% (absolute) word recognition accuracy over a single most likely lexicon performance.